Extract API task history to monitor it

Hi there

we have used the API QUERY component to extract data out of the matillion Postgres database

 

we are not really satisfied with the flattening in the RSD and as we can do that later on snowflake itself we thought to switch to

 

API Extract Component

which in my eyes is from the feature better and more flexible

but we are no running into out of memory issue and the Matillion instance gets instable

has memory issues (heap space) and restarts than tomcat

 

even if we are trying to paging that happens - I thought paging should prevent that

but it doesn't

 

has anybody some working solution using API Extract to extract data from task history?

or any suggestions

 

Thx

 

Marti

Hi mw,

 

I don't think Matillion implemented paging in that part of the API.. so during processing it tries to hold up to an entire year's worth of task history in memory. That's what can cause the OutOfMemoryError: Java heap errors.

 

According to the docs you can filter by date range... http://<InstanceAddress>/rest/v1/group/name/<groupName>/project/name/<projectName>/task/filter/by/start/range/date/<startDate>/time/<startTime>/to/date/<endDate>/time/<endTime>

 

SO you may be able to loop for example month by month and fetch a smaller amount of data at one time.

 

If you want quick summary stats you could take a look at Matillion;s own JDBC driver.

 

CK

 

I have experienced the same issue with the API Exctract Component and the Task History API. You have to add a filter to reduce the number of task history entries. From my experience, it is not possbile to query more than week of task history data. All above causes the OOM error. Of course this depends highly on the number of jobs/tasks that you run per week.

 

I have dismissed my plans to use the API and I now query the local Postgres Database directly. This is much faster for larger amount of data and does not cause memory issues. I know that this is not the officially supported way to access the data but it works very well for us. Perhaps this is an option for you?

Hi @m.w​,

There are some issues with API Extract which I suspect will get fixed and also made better. Feature wise, API Query profiles can do all the same things as API Extract but the difference is that you have to write the RSD. We are doing exactly what you are attempting to do with Task History using the API Query. We extract the history every night and load it into Snowflake.

We are doing this extract a little different than what you are attempting to do. The first load needs to be loaded in batches which is easy enough using start and end dates and times. Then we load incrementally every night based on the last load date and time. Keep in mind there is an end to Task History. Meaning, Matillion only keeps so much history before it starts falling off the end. This is another reason why our job runs daily. We are ensured that we are getting all of history since we are loading every night.

I have attached our API Query Profile that has all our Matillion API defined methods or calls that we use. The specific ones that would be of interest to you would be these:

Here are a couple screenshots that show the 2 jobs that we built to do the nightly load in case it helps paint a little bit of a picture.

If you plan to load the JSON directly from the database you should plan on it breaking at some point. There is no guarantee that the data will be the same or that Matillion won't change the schema of the database from version to version. This is why the API method is safer. If they change the schema of the database or change the data the API's will follow those changes and just keep working. I hope this helps!

Could you elaborate more on how you connect directly to the postgres database? I too am not satisfied with their API method for extracting this information and would rather just get it directly from their DB tables.

Hi @MichaelBlack​ ,

sure I will do that. It's actually quite easy:

Just use the Database Query Component and choose to connect to a PostgreSQL database.

  • The connection string is: jdbc:postgresql://127.0.0.1:5432/
  • The database user is postgres (default user) with an empty password.
  • Look for the tables in the public schema.

We found the following tables useful:

  • AUDIT
  • AUDIT_SPECIFIERS
  • CREDENTIALS
  • CUSTOMERS
  • FILESYNC
  • JOBS
  • MELT_JOBS
  • MELT_METADATA
  • MODEL_METADATA
  • OAUTH
  • PERMISSION_GROUPS
  • PERMISSION_USERS
  • PERMISSIONS
  • PROJECTS
  • QRTZ_CRON_TRIGGERS
  • QRTZ_JOB_DETAILS
  • QRTZ_LOCKS
  • QRTZ_SCHEDULER_STATE
  • QRTZ_SIMPLE_TRIGGERS
  • QRTZ_TRIGGERS
  • RECYCLE_BIN
  • RUN_HISTORY_JOB
  • SCM_CREDENTIALS
  • SCM_FILESYNC
  • TASK_BATCH_HISTORY
  • TASK_PACKAGE_HISTORY
  • USERS
  • WEBHOOK_PAYLOAD

Let me know how this works for you. Good luck!