Extract API task history to monitor it

m.w · May 4, 2021, 8:53am

Hi there

we have used the API QUERY component to extract data out of the matillion Postgres database

we are not really satisfied with the flattening in the RSD and as we can do that later on snowflake itself we thought to switch to

API Extract Component

which in my eyes is from the feature better and more flexible

but we are no running into out of memory issue and the Matillion instance gets instable

has memory issues (heap space) and restarts than tomcat

even if we are trying to paging that happens - I thought paging should prevent that

but it doesn't

has anybody some working solution using API Extract to extract data from task history?

or any suggestions

Thx

Marti

Cey111 · May 4, 2021, 11:10am

Hi mw,

I don't think Matillion implemented paging in that part of the API.. so during processing it tries to hold up to an entire year's worth of task history in memory. That's what can cause the OutOfMemoryError: Java heap errors.

According to the docs you can filter by date range... http://<InstanceAddress>/rest/v1/group/name/<groupName>/project/name/<projectName>/task/filter/by/start/range/date/<startDate>/time/<startTime>/to/date/<endDate>/time/<endTime>

SO you may be able to loop for example month by month and fetch a smaller amount of data at one time.

If you want quick summary stats you could take a look at Matillion;s own JDBC driver.

CK

Michael · May 5, 2021, 9:02am

I have experienced the same issue with the API Exctract Component and the Task History API. You have to add a filter to reduce the number of task history entries. From my experience, it is not possbile to query more than week of task history data. All above causes the OOM error. Of course this depends highly on the number of jobs/tasks that you run per week.

I have dismissed my plans to use the API and I now query the local Postgres Database directly. This is much faster for larger amount of data and does not cause memory issues. I know that this is not the officially supported way to access the data but it works very well for us. Perhaps this is an option for you?

Bryan · May 5, 2021, 3:25pm

Hi @m.w,

There are some issues with API Extract which I suspect will get fixed and also made better. Feature wise, API Query profiles can do all the same things as API Extract but the difference is that you have to write the RSD. We are doing exactly what you are attempting to do with Task History using the API Query. We extract the history every night and load it into Snowflake.

We are doing this extract a little different than what you are attempting to do. The first load needs to be loaded in batches which is easy enough using start and end dates and times. Then we load incrementally every night based on the last load date and time. Keep in mind there is an end to Task History. Meaning, Matillion only keeps so much history before it starts falling off the end. This is another reason why our job runs daily. We are ensured that we are getting all of history since we are loading every night.

I have attached our API Query Profile that has all our Matillion API defined methods or calls that we use. The specific ones that would be of interest to you would be these:

Here are a couple screenshots that show the 2 jobs that we built to do the nightly load in case it helps paint a little bit of a picture.

If you plan to load the JSON directly from the database you should plan on it breaking at some point. There is no guarantee that the data will be the same or that Matillion won't change the schema of the database from version to version. This is why the API method is safer. If they change the schema of the database or change the data the API's will follow those changes and just keep working. I hope this helps!

MichaelBlack · November 28, 2022, 11:20pm

Could you elaborate more on how you connect directly to the postgres database? I too am not satisfied with their API method for extracting this information and would rather just get it directly from their DB tables.

Michael · November 29, 2022, 7:08am

Hi @MichaelBlack ,

sure I will do that. It's actually quite easy:

Just use the Database Query Component and choose to connect to a PostgreSQL database.

The connection string is: jdbc:postgresql://127.0.0.1:5432/
The database user is postgres (default user) with an empty password.
Look for the tables in the public schema.

We found the following tables useful:

AUDIT
AUDIT_SPECIFIERS
CREDENTIALS
CUSTOMERS
FILESYNC
JOBS
MELT_JOBS
MELT_METADATA
MODEL_METADATA
OAUTH
PERMISSION_GROUPS
PERMISSION_USERS
PERMISSIONS
PROJECTS
QRTZ_CRON_TRIGGERS
QRTZ_JOB_DETAILS
QRTZ_LOCKS
QRTZ_SCHEDULER_STATE
QRTZ_SIMPLE_TRIGGERS
QRTZ_TRIGGERS
RECYCLE_BIN
RUN_HISTORY_JOB
SCM_CREDENTIALS
SCM_FILESYNC
TASK_BATCH_HISTORY
TASK_PACKAGE_HISTORY
USERS
WEBHOOK_PAYLOAD

Let me know how this works for you. Good luck!

Topic		Replies	Views
I'm a back end developer in .NET/C# that was given the task of calling the Matillion API to get task status information and history for a reporting dashboard. and I need help understanding some Matillion terms for URLs - can anyone help? Matillion ETL	8	7	August 26, 2021
Hi Team, We are using API Extract component to make REST API calls, We need to make API calls in paginated manner. Need help below are details Matillion ETL	3	15	January 6, 2022
In Matillion, We have a TASK HISTORY which is really having detailed data. Is there any way I can use those data and trigger an email? like failure jobs task history or some specific job task history. Curious to know. Any suggestions, please Matillion ETL	6	13	February 17, 2023
API Extract - Iterate and combine data from multiple calls Matillion ETL	5	4	February 9, 2022
How to download Task History? Matillion ETL	2	6	March 4, 2022

Extract API task history to monitor it

Related topics