Is an ETL/ELT tool more pain than gain?

I worked in my career with different ETL/ELT Tools. 

Microsoft SSIS, SAP BO Data Services, IBM Data Stage, Informatica Power Center and now with Matillion. 

 

I was working as a consultant for a while so I worked with a lot of different people at a lot of different companies.

One of the main topics is always the culture clash between SQL only vs ETL/ELT only.

In some work environments everything is coded with cascaded SQLs/Views and the tool is only used the connect to different data sources and targets. 

In other environments SQL is forbidden and everything has to be build inside the tool to improve the standardization and make the "code" more readable and maintainable. 

 

I share the opinion that something well build using the components of the GUI and following good design patterns will result in more maintainable code than using plain SQL only. But i acknowledge that SQL has it own advantages. Development speed/sometimes performance advantages/generating code is easier)

 

My problem with all of this in mind is that an ETL/ELT tool should help me to make my work easier and not more complicated.

But most of the time I try to avoid these tools because they make my work more complicated compared to writing plain SQL code. 

This feels always wrong that the company pays for tools it employees try to avoid. 

 

The most annoying part is adding new attributes/columns. 

In a lot of cases this works in Matillion awesome without doing anything manual.

But especially using the join component it gets frustrating. 

If my source gets a new attribute that I need in the target table and the job has three joins I manually need to add this attribute in every join.

To avoid this I could delete all attributes and always use the add all feature but with this feature the names will get pretty messy after some joins because of the added default prefix. In some cases I don't need all the columns.

In this case something like an propagate column feature would be nice. 

 

 

What is your solution to this kind of problems where the tool makes the work more complicated compared to plain SQL coding?

Do you use plain SQL instead? 

Still use the GUI but 'allow' everything to get messy regarding names and orderings of attributes?

Do you force conventions and invest a lot of time to fix these components to make them more readable for someone else?

 

In my opinion this is always an trade off between saving time now but loosing time in the future for maintaining these jobs.

In our current agile IT landscape a lot of projects start small and have to deliver fast results.

For the management a result is already a finished solution so the next project is waiting to get fast results. 

That sounds worse than I think it actually is. 

As an data engineer I like to build perfect solutions but this is most of the time not needed and in some years everything we did will be migrated anyway.

So everything we build will be gone soon. (Famous last word regarding legacy systems) 

 

I always like to avoid manually stupid work and hope these kind of tools are actually tools that will help me in the future.

 

 

 

 

Hi Nils

 

For full disclosure, I’m a Product Manager at Matillion, but the views I share are entirely my own.

 

Thanks for raising this. It’s certainly a topic which comes up a fair bit. As a CS grad, I will always love a bit of code. For a full time developer coding is very often quicker and easier than any tool, irrespective of the language and purpose. I know from past experience as a Business Intelligence Manager that coding can go very wrong when your only developer leaves the company and has written poorly commented code that no-one can read but I appreciate this doesn't have to be the case. 

 

It does play to the argument that the code is only as good as the person who’s written it, in the same way your ETL job can be. You mention, for example, the issue with joins in Matillion in that you manually have to add new columns in. This is true. This is also true if your SQL specifies each column individually, you may have to edit your SQL to add all columns in and this can be more tricky tracing everything through than simply changing a property in a component. If you choose to write the SQL as generic “SELECT *” then you can lose confidence in the output you’re going to get if you find your input changes. Similarly with Matillion you could specify the job to use grid variables and dynamically pass those through to overcome the hard coding issue. This works well in some circumstances.

 

I’ve found a benefit of using an ETL/ELT tool is that you can get more colleagues than just your developers working on it, so for example you could have a technical BA. This means you can then concentrate development resources where you need most, which is often in the more tricky business rules, and allow some users to self serve.

 

Another way I’ve seen customers use Matillion is to just use the Orchestration side to manage the data load and the schedules and all transformations are written as SQL in SQL components. This works really well for some customers in that you can in some ways get the best of both worlds.

 

There’s a Matillion blog on this topic, which the link is here. Again full disclosure, I wrote it:

 

https://www.matillion.com/resources/blog/etl-coding-10-reasons-why-you-should-stop-hand-coding-your-etl-jobs

 

I’d love to hear others input on this, especially what we can do to make the Matillion ETL products easier for seasoned coding professionals. Please do feel free to submit the propagation feature to the ideas portal here

 

Thanks

Laura

 

 

First let's consider; As opposed to what?

 

ETL (or whatever folks want to call it) is best designed with the idea of constant change, to your point, things we use today might be different or non-existent tomorrow, so designing ETL with compartmentalization and simplicity is pretty essential if you don't want to have to maintain the beast you've built.

 

Yes, you can code everything. Not only could you write SQL scripts, you could use libraries in just about every language that abstract and consolidate, but ETL is more than the sum of it's extract, transform, and load parts. Writing out the 'what' our ETL does will always remain, in my opinion, the easier of the tasks at hand. The tougher propositions like 'when' does it do it, does it do it 'with' anything else, does it depend on the success or failure of anything else and of course, the feedback about what happened, not just in a blathering avalanche of logs, but in a sequential, step/task oriented way that is representative of how the process is expected to run.

 

I often tell people who claim they ETL'd an S3 file into Redshift that they did nothing of the sort, for ETL that performs single tasks is not the challenge, ETL is pipes and plumbing at scale, with complexity and when it has to go right. Sure, you can write up all of this stuff as discrete bits of code in a bunch of repos, stored procs, sql files, cron scripts - but at the end of the day, a good ETL system is the glue and the conductor, there is a reason that 'orchestrations' were a term that originated in enterprise ETL.

 

As for evolving messy stuff and getting things done fast before perfect, a good ETL tool will allow you to do both. If you find yourself becoming encumbered by your ETL tool and processes, it might become important to consider if you are overthinking your outcome and approach - because this is all too easy to do in good ETL systems - they tend to put the 'world' in front of you even if you need just the wee city block for what you're trying to accomplish. ETL gives you big powers meant to be used in manageable bite sized chunks so try and keep your solutions from becoming too monolithic.

 

Just my two cents.

 

Edward Hunter