We have a Matillion with Snowflake on AWS Enterprise edition on a m4.xlarge EC2 instance.
It has been running fine for the last few years.
But for the last couple of weeks, we are seeing that the server has gone down due to out of memory issues 5 times.
The number of jobs running and the data it processes has scaled up in the last few months.
The reason why this issue is a blocker now is because we have to deploy a new application which will run for 18 hours for 7 days with certain jobs running every 10 minutes. for near time updates.
Implementing HA in Matillion requires all the jobs which are currently live to be idempotent and have transaction control and it would require extensive remediation.
Can anyone suggest any other scalable and cost effective option to implement near zero downtime in Matillion and reduce such outages and how feasible it would be for supporting for near real time updates?
The best approach here would be to understand what’s causing your out of memory issues. We can help you get to the bottom of that if you raise a support case on that issue in particular. Ideally, if you can address your out of memory issues, that might alleviate concerns around high availability.
I would agree with @ChikaMatillionCommunityMgr. It's definitely worth figuring out what is causing the out of memory errors. In many cases that I have seen out of Memory errors are caused by one or more jobs that use large Grid Variables. This could be where a job loads data into a grid variable that is then iterated over or just a large hardcoded grid variable. The other typical cause is with Python scripts. If you are doing a lot memory intensive operations within your Python scripts this can cause issues.
The other thing could be that your jobs are clean and efficient but as you stated, you are simply doing a lot more for longer periods. In this case, the short term solution could be to just increase the size of your single instance. As you increase the Ec2 instance size, you will get more memory (and of course CPU). Doing this will likely get your over the issue of the server going down. With that said, it sounds like the criticality of the jobs needing to run every 10 mins, it seems that an HA setup is a better long term solution.