Hello, We are investigating why sometimes our jobs get stucks forever for below case.
Use Case: We created a shared job to load incrementally data of all tables from primary databases. And each database we created a orchestration including above shared job, so we have about 16 orchestrations for this. But sometimes they got stucks forever while running. The state of these job runs show "executing" but actually they didn't process anything anymore (Only these incremental jobs got issues, other jobs on matillion still good). This issue was fixed after we restarted Matillion.
We found that:
1) There were some auto cancellations of these jobs while they were running.
2) There were many auto cancellations due to "Prevent duplicate jobs scheduling"
From 2 findings, we guess this happens due to the auto cancellations of some of 16 above orchestration jobs occupied threads => others got stucks. But if this makes sense, whole Matillion system must get stuck, not only these incremental jobs.
We are difficult to investigate the reason behind.
Do anyone have experience with the similar issue? Or Do anyone really understand about how the auto cancellation of schedulings/ the auto cancellation of the running jobs work?
Many thanks.
Hi @biappgroup1626192256878,
It sounds like there may be a combination of three things happening there: 1) setting of the Prevent Duplicate Job in your schedule, 2) calling transformation jobs and they run in serial, and 3) the 16-job limit.
When you schedule a job you have the option to Prevent Duplicate Jobs by ticking the checkbox in the schedule definition (see this document). The setting influences how Matillion handles the case when the last instance of the job is still running when the newly scheduled instance is due to start.
With Prevent Duplicate Jobs ticked on, new instances of the job will be immediately cancelled, and it looks something like this:
With Prevent Duplicate Jobs not ticked, new instances of the job will queue behind the running one, and you will see hourglasses, something like this:
Now on to the second point: calling transformation jobs. Let’s say you have 16 different Orchestration Jobs, or 16 instances of a Shared Job, which are all running at the same time, and all want to run one Transformation Job with different parameters. At the time of writing those Transformation Job runs will be serialized, so the parent Orchestration Jobs or Shared Jobs will all show “executing”, although they are actually just waiting on a lock for their chance to run the Transformation Job. All the jobs will run eventually. From an architecture perspective you are less likely to hit this potential bottleneck if you design fewer, larger Transformations. If you would like to have your say in this area, please take a look at this thread in our Ideas Portal and vote accordingly!
The third point about the 16 job limit is documented in much more detail here. Matillion ETL instances allow up to 16 jobs concurrently. If you launch a 17th, it will queue until there’s an empty slot. Again from an architecture perspective you are less likely to hit this potential bottleneck if you design fewer, larger jobs.
Hope that all makes sense and is helpful.
Ian
@ian.funnell Got it. Very clear. Many thanks for your info.