Graceful matillion server shutdown

In our nightly process we start the Matillion VM for processing and shutdown the VM at a specific time. We are using google cloud functions to shutdown the VM which essentially will force the Matillion server down ignoring any graceful server shutdown features that might be available or necessary. In a case where a job is still running (e.g., due to an error situation) this shutdown will even kill the running job and instance. This might be problematic when we investigate the job's failure because the database log and the catalina may not contain all information, and some logging information might not have been persisted to disk/database and is only available inmemory.

 

In general we want to change the time triggered shutdown to an event triggered shutdown after successful completion of the jobs. We have an idea how to trigger the shutdown either by using local python system calls or triggering our cloud function via pub/sub or web API.

The open question for us is:

 

What is the preferred way to "gracefully" shutdown a Matillion server so that we could follow a two step shutdown

  1. Stop Matillion Server
  2. Shutdown VM instance

 

I've not found an available API call to gently shutdown the Matillion server. Any suggestions?

 

Thanks,

Bodo

Hi Bodo,

 

I think we are in the same boat (again!). We thought about this before and already had some discussions with Matillion Support on this topic.

 

What we do now: A Lambda Function on AWS spins up our Matillion instance early in the morning and another Lambda Function is responsible for the shutdown in the evening. This is all cron-triggered on a fixed schedule and it works most of the time. BUT: We also had few situations where our daily batch took longer than expected, the job was killed, we lost information on the job and also some of our data from that batch.

 

We thought about two approaches for improving here. First we could try to integrate the shutdown script to our daily batch (last step in the chain) and add a Python or Bash script to shutdown the instance if all jobs are completed. The second option is to extend our Lamdba Function and include an API call (Matillion API) to check if there are any running jobs on the instance. If we still have some jobs running, wait x minutes for the next shutdown attempt.

 

Maybe this helps for your further investigations. Of course, I'm also interested in an "official" idea for a graceful shutdown approach.

 

Cheers,

Michael

We are in the same boat as well. Using an AWS cloudwatch rule to trigger the lambda job on a schedule which starts/stops the Matillion EC2 instance. But of course if a job is in progress it shuts down anyway. Appreciate the post here and are considering the API call for active jobs as Michael mentioned. Will update if we find success with that method.

Also, some months ago I posted a similar question to the ideas portal (https://metlcommunity.matillion.com/s/idea/0874G000000kAvuQAE/detail) to ask for a way to warn UI users when a shutdown was happening. One of the responses there was a link to a Power Off in the Job Exchange. I haven't implemented that yet but could be of help here: https://exchange.matillion.com/s/exchange-job/a074G00001Ek5R0QAJ/power-off-matillion

 

We have extended our async job that is extracting run stats after the daily has ended. Now we are waiting for all dailies in all environments (projects) and shutdown the instance accordingly. If anybody is interested I can share the python code we use to wait for job completion. This only works if the wait is part of an additional job running async. We start this job via REST API at the beginning as part of the daily thus it kind of a watcher in the background.

 

@dhislop1581359283633​ attached you'll find a code snipped to wait for a job. It will wait a certain time if needed and then poll the rest api to wait for a specific job.

thank you, and look forward to giving it a try!

Can anyone please suggest the API used to shutdown or start Matillion Service or instance ??

@bodo yes i am quite interested in the python code for this job - that sounds like something we could use. is it a long running job with a 'sleep' / recheck component?

not sure the best way to exchange the code. one way would be to connect on https://www.linkedin.com/in/danhislop/ but maybe there is a way through this matillion portal as well which would be easiest. cheers

Hi, we have just put a bash script component to run unconditionally after our nightly orchestration job has finished, containing this code:

if [ "${v_stop_instance}" == "ON" ]; then

  echo "Stopping matillion instance ${instanceId} in 60 seconds"

  # For testing add --dry-run

# Put both sleep and the aws command in parantheses and run in the background to force the component to return

  ( sleep 60; aws ec2 stop-instances --region eu-west-1 --instance-ids ${instanceId} ) &

else

echo "Stopping of matillion instance is switched off. v_stop_instance: ${v_stop_instance}"

fi