Complex scheduling with dependencies between jobs

Hi all,

 

the number of jobs, schedules and dependencies keep growing.

 

How do you solve the issues with dependencies between different jobs?

 

With smaller projects and job plans it is possible to add all jobs into one orchestration job.

But with larger and more complex jobs this gets messy pretty fast.

 

I can imagine different solution for this issue:

 

The first solution is to use logging tables and solve this issue using the database.

In this case the jobs needs to poll the database to check if the dependent job is finished. Or another solution is that every finished jobs is starting all jobs that are dependent and they only start if all requirements are fulfilled.

This might get messy fast as well because the dependencies are maintained inside a database table. How to handle errors and delays?

 

Another solution is to use e.g. Apache Airflow.

In this case Airflow needs to poll Matillion to monitor the job status.

Is anyone using Airflow with Matillion for larger projects? Is the polling an issue for Matillion? I have some trust issues for such an external solution in combination with the Matillion API.

 

Our requirement is to easily add new dependencies and restart jobs when they fail.

 

I am missing any other solutions?

One more fancy solution might be to have jobs running all the time and everything has some form of eventual consistency. But this won't work everywhere.

 

 

Kind regards,

Nils

 

 

 

 

@Nils​ - Based on your post, it sounds like you are in need of dedicated scheduling software. In the past I have used Tidal scheduler (I believe it is now Tidal Workload Automation - https://www.tidalsoftware.com/), but I don't know if it would work with Matillion. I don't see it listed on their website - I used it with Informatica jobs, to call Linux scripts, and to execute database queries in the past. Every job is setup in the scheduler, and dependencies can depend on many different things, including if a job has finished. If a dependent job fails, the other jobs wait until it runs successfully.

I would talk to Matillion and see if there are any third party schedulers that work with Matillion. If the tool can call a Python script, you could call scripts that start the Matillion jobs using the API (https://www.matillion.com/resources/blog/ask-a-solution-architect-can-i-execute-jobs-from-the-matillion-api), from the scheduler.

You could also handle the calling of jobs by having a Python script check the table like you mentioned, then call the next job if it has finished, but that can get messy as you mentioned. Not sure how Apache Airflow would poll Matillion to see if the job finished, but if it is by calling the Matillion API for job status, check out the job endpoint for that with Postman or some other tool first to see what filtering is available. I just started looking at it and you can filter on a date range which returns all jobs that finished in that date range, then you would have check the JSON returned to see if the job you are looking for is there. You can also filter on the job name, but that returns all instances of that job.

Hope this helps a little.

I agree that polling does chew up resources. I would not do polling inside a Matillion python module - if I was going to use a python script to start a Matillion job, I would have the script outside Matillion, and start the script with some type of scheduler.

 

I am not familiar with AWS SQS, but it sounds like you have a plan for a solution.

Thanks for your response.

Actually we were using a similar tool (Atomic UC4) for our legacy Informatica dwh.

Matillion is providing a stateless rest api.

This means starting a job is easy but monitoring a job is annoying because polling has to be done.

As soon as you build something that is doing polling you start wasting resources.

Doing polling inside Matillion using Python is a really bad idea because of Matillions massive limitations regarding parallel threads and resources. So best practice using Matillion is to avoid using Matillion for such tasks like polling because we will need the resources somewhere else.

Building a polling mechanism using Apache Airflow is possible and we can get this running.

But the disadvantage is that we need to maintain another tool (we would use AWS managed service for this). I am more worried about the reliability of Matillions API.

That is the reason I am looking for someone who is actually using Airflow with Matillion on a large scale.

Currently in my mind is some kind of logging framework with dependencies between jobs.

Before and after every job is a component that will write the status of the job into the database.

So e.g. Started/Finished/Error. Everything will be triggered event based.

So when the job finishes the component will lookup which dependent jobs exist und will start them using the sqs component. I don't know why but using sqs to start Matillion jobs feels more mature than calling the api. (@ian.funnell​ I have the feeling you prefer sqs as well?)

When a job ist started using the sqs component the jobs will lookup the latest status of all dependent jobs. If any not finished jobs exist the job will immediately stop. The job will be started again with the next dependent job finishing.

So this is just a rough draft in my head. Might have some flaws.

Hope it was not to confusing ;)

@Nils​ I like what you have described.. it's more sophisticated than a timed schedule, and more sophisticated than a single Orchestration job which calls many others.

I have seen Matillion customers successfully implement a persistence-based solution like you described...

  • A database table which holds "state"
  • Another database table whish holds dependency rules
  • A Matillion job which wakes up periodically, reads the "state" and the "rules" and decides: can I do anything? Then if yes, launch the appropriate job

It's like a microservices pattern, with the dependency rules driven by metadata.

That's one reason I think SQS (and similar) can be good.

  • It's easy to have branching inside a job, for example with an IF, or perhaps with the Success/Fail connectors. So the dependency rules can be part of the job definition
  • It can remove the immediate dependency between named jobs
  • It's resilient, and guarantees that the next job(s) will be run
  • It removes any requirement for parallel threads
  • You can make it launch the next job(s) only when this one has finished

However, monitoring becomes more difficult...

Then you may consider other tools, such as have been discussed in this thread.

Also I do think there is a place for an API based solution. The big thing you have to be careful is the REST call may fail, or it may succeed with a message that it did not launch the job! So you have to also parse and check the JSON output. This article on Airflow was one of mine, and might be useful. There's a link at the end where you can download the Python code.

However, monitoring from the API can be difficult too. As you have seen, it's asynchronous so you don't know if it was successful, or even if the job started or finished.

I am curious to hear from more of our users who have implemented complex workflows using solutions external to Matillion?

As a suggestion, I am also curious if anyone has implemented a solution based on AWS Step Functions? If there is interest in this, I could write some enablement documentation, or maybe some Shared Jobs?

Hey Ian,

 

like always thank you very much for your detailed feedback.

 

Regarding Step Functions.

This was actual something we discussed as well.

My issue is the json based language.

But now you can use an graphical interface for editing as well!

 

We are already using step functions for some tasks. e.g. monitoring Sagemager Jobs and starting Matillion Jobs afterwards.

 

With the old design and writing everything in json this is extrem complex for larger schedules.

But with the new interface maybe it does make sense for some customer to use such a solution. Compared to Airflow it is quite cheap and serverless. But Airflow is is more powerful.

In my opinion Step Functions main use case are smaller state machines for polling or smaller flows.

But this is only my opinion so maybe some one is using it for extrem large schedules as well.

 

Its been an year on this thread, just wondering if there are any new , optimal and efficient complex warehousing solutions built using Matillion ?

 

Much like Anudeep, I'm curious if anybody has fleshed out more solutions to this problem. My team is currently trying to figure out the best way to re-architect our ELT to be dependency driven. For example, if Table A requires Tables B and C to be loaded before it can begin loading, we want to be able to just specify somewhere that Tables B and C are requirements for Table A to load. Then when all requirements are fulfilled, the Table A load job will automatically start running. I'm looking into Airflow and SQS/SNS now, but was curious if there are other solutions out there as well.

We are thinking of a hybrid solution using a database which holds the state and using Matillion Orchestration capabilities i.e, Orchestration of Orchestrations. We divided our jobs into Subject Areas(group of Orchestration jobs). Each job in a subject area will trigger a common wrapper job (which takes the actual orchestration job name as a parameter), checks and validates for its state in the database. If it is not run for specified run date, it validates the state then it continues to trigger required Orchestration job. If it has already run for the required date, it silently succeeds without processing and then continues for the next job.