@ian.funnell
I just want to give an update:
Let me share first what I actually wanted to do and why.
We are using Sagemaker as machine learning framework.
I was challenged with the task to prepare the data for 26 models that share more or less the same data but with different target variables.
So we are preparing 26 different data sets using Matillion and Snowflake.
This data is uploaded to S3 and than we need to trigger 26 Sagemaker Batch Transform Jobs that score this data using previously trained models.
We currently orchestrating everything using Matillion. We looked at Airflow but at the moment we can still handle everything using Matillion. Maybe we will switch with more complex scheduling tasks.
That was the reason I needed Matillion to monitor my Batch Transform Jobs. I skipped the approach to run one python script for every job that is polling this job. Because with this solution I would lock a lot of resources just for polling.
So my other approach was to run a loop and poll every few seconds all the still running jobs at once and logging them to our database. As soon as all the jobs are finished I want to pull all the results back from s3 into our database. I was not happy with this solution because I was using Matillion for tasks it was not built for.
So my current more sophisticated approach is the following:
We are using Matillion with python/boto3 to trigger 26 AWS Step Functions.
Before starting this functions we log into our database what job was started for which task.
These Step Functions have a sagemaker batch transform sync object.
So AWS is creating a cloudwatch event listener for these triggered jobs and will do the polling for us.
At this point Matillion is already done by sending 26 requests to the AWS api and writing 26 entries into our logging table.
When a Step Function hits an error or the batch transform job is finished it will trigger a lambda function that will send a message to an SQS Queue Matillion is listening to.
This will trigger an monitoring job that will write into the same log table the information which batch transform job finished.
After every insert into this logging table this monitoring job will check if the last status for all this 26 batch transform jobs is completed.
When every job is completed we will start the final job to pull all the S3 data and integrate all the data into one table.
It is not on production yet but I like this solution so far.
It feels kind of strange to have this asynchronous communication, but I think for this purpose it is the best approach.
Btw adding a dead letter queue to the lister would also be nice.
I already posted the idea: https://metlcommunity.matillion.com/s/idea/0874G000000kBIPQA2/detail