Matillion out of memory

We constantly getting a java.lang.OutOfMemoryError: Java heap space

This is often due to one tomcat process trying to allocate over the 2GB ram limit and thereby initiating restart of tomcat. We get this issue when we right click a transformation/orchestration and selecting "Task History".

Resulting in crashing the application and restarting all running jobs.

Hi. Sam from Matillion here.

With regards to the OutOfMemory errors you are experiencing, it's worth knowing that

Matillion jobs are vulnerable to a variety OOM problems, with the main symptoms including

  • OOM
  • Java heap space
  • java.lang.OutOfMemoryError

 

You will need to identify which job is causing the OOM as It's impossible to tell that from the catalina log file. Although there are edge cases, it's usually caused by one of three things:

  1. A Python Script (especially if running in Jython mode) which is using a lot of memory
  2. A Database Query, especially if it's trying to fetch many columns or wide columns (LOB, JSON, XML etc)
  3. An iterator in an orchestration job, especially in Concurrent mode, and especially if it contains either of the above

 

After hitting an OOM:

  • You must restart Matillion because nothing works properly after an OOM. It will detect an OOM and restart itself up to 5 times per day.
  • Please watch out for large .hprof file(s) in /tmp which can quickly consume a lot of disk space. They can just be deleted if the restart does not clean them up

 

Things to bear in mind:

  • Don't write Python scripts which require a lot of memory. If required, do this outside of Matillion
  • Don't query many columns or wide columns using the Database Query
  • Run less things in parallel (don't use concurrent-mode iterators, and schedule jobs such that start times are staggered)

 

Customers with HA enabled might have a worse experience with OOM issues. What happens is:

  • The job runs on instance 1, and hits a java.lang.OutOfMemoryError: Java heap space error
  • Matillion detects the error and shuts downs instance 1 for restart
  • Instance 2 then detects that the job needs to be run, launches the job, and then very likely hits the same java.lang.OutOfMemoryError: Java heap space error
  • Instance 2 shuts down for restart
  • Repeat

I hope this helps.

 

Thank you for your response.

We will try to adjust the implementation of the jobs that are causing these errors. However it is important for us that Matillion also take part in solving these issues as they are well known for multiple customers.

Other software implement disk caching and blocking implementations to overcome these issues, is this something that is evaluated for Matillion?

A fix for this issue would possibly also contribute to the issue of compiling larger transformations without splitting them in multiple steps and thereby minimizing the complexity of customer implementation.

@Sam-Matillion

You mention that the following may cause an OOM:

"An iterator in an orchestration job, especially in Concurrent mode, and especially if it contains either of the above".

Does this recommendation to avoid using a Concurrent mode iterator in an orchestration job especially if it contains a Database Query, apply where the Concurrent mode iterator is attached to a Run Orchestration component that runs an Orchestration job containing a Database Query?