Preventing Outages in Matillion ETL and supporting near real time update pipelines

pratikdatta1 · May 27, 2022, 4:30am

We have a Matillion with Snowflake on AWS Enterprise edition on a m4.xlarge EC2 instance.

It has been running fine for the last few years.

But for the last couple of weeks, we are seeing that the server has gone down due to out of memory issues 5 times.

The number of jobs running and the data it processes has scaled up in the last few months.

The reason why this issue is a blocker now is because we have to deploy a new application which will run for 18 hours for 7 days with certain jobs running every 10 minutes. for near time updates.

Implementing HA in Matillion requires all the jobs which are currently live to be idempotent and have transaction control and it would require extensive remediation.

Can anyone suggest any other scalable and cost effective option to implement near zero downtime in Matillion and reduce such outages and how feasible it would be for supporting for near real time updates?

ChikaMatillionCommunityMgr · June 3, 2022, 3:04pm

Hi @pratikdatta1

The best approach here would be to understand what’s causing your out of memory issues. We can help you get to the bottom of that if you raise a support case on that issue in particular. Ideally, if you can address your out of memory issues, that might alleviate concerns around high availability.

Thanks,

Chika

Bryan · June 6, 2022, 9:28pm

Hi @pratikdatta1 ,

I would agree with @ChikaMatillionCommunityMgr. It's definitely worth figuring out what is causing the out of memory errors. In many cases that I have seen out of Memory errors are caused by one or more jobs that use large Grid Variables. This could be where a job loads data into a grid variable that is then iterated over or just a large hardcoded grid variable. The other typical cause is with Python scripts. If you are doing a lot memory intensive operations within your Python scripts this can cause issues.

The other thing could be that your jobs are clean and efficient but as you stated, you are simply doing a lot more for longer periods. In this case, the short term solution could be to just increase the size of your single instance. As you increase the Ec2 instance size, you will get more memory (and of course CPU). Doing this will likely get your over the issue of the server going down. With that said, it sounds like the criticality of the jobs needing to run every 10 mins, it seems that an HA setup is a better long term solution.

I hope this help!

pratikdatta1 · June 9, 2022, 3:46am

Thank you @ChikaMatillionCommunityMgr @Bryan . Our Devops have reached out to the Matillion Support and are working with them on this.

ChikaMatillionCommunityMgr · June 9, 2022, 3:34pm

You're welcome, @pratikdatta1. Please circle back and share about the outcome with the Community.

PS: If you are comfortable, please upload a photo of yourself. We're all friends!

Cheers!

Chika

Topic		Replies	Views
Recently facing "Out of Memory Error detected" issue in Matillion Matillion ETL	14	7	February 5, 2025
Out of Memory - Jobs not releasing memory for certain connectors Matillion ETL	3	1	August 24, 2021
Matillion has a blank white screen after logging in? Need to reboot to enable login, losing progress of running jobs Matillion ETL	2	0	January 23, 2023
Matillion out of memory Matillion ETL	3	3	August 17, 2022
Communications Link Failure: The last packet sent successfully to the server was 0 milliseconds ago Matillion ETL	2	0	October 19, 2021

Preventing Outages in Matillion ETL and supporting near real time update pipelines

Related topics