We're getting tens of thousands of errors in our catalina log for our dual-node HA clustered environment. Error msg below. Is this related to a time sync issue between the nodes?

INFO [QuartzScheduler_QuartzScheduler-10.82.105.132_ClusterManager] org.quartz.impl.jdbcjobstore.JobStoreSupport.clusterRecover ClusterManager: Scan

ning for instance "x.x.x.x"'s failed in-progress jobs.

Hi @Frederick.Wright​,

This is probably a question to ask Matillion support simply because I don't think there are many HA setups out there so the footprint and visibility to your situation is pretty small. It won't hurt to post back what you find out from support though. Sorry, I couldn't be of more help. 😞

Hi @Frederick.Wright

That message specifically has to do with being in a clustered HA architecture. In that HA architecture, you have 2 or more running nodes that can run Matillion jobs. The behavior of HA is that if a node fails, the other node can take the job workload. That particular message is related to this behavior, where Matillion is checking to see if any jobs were running on a node, but failed due to an issue with the node. If it had found any, it would then trigger for that failed job to run on the available node. So, it’s just the normal behavior of Matillion. And, the reason for the volume of those messages is so that Matillion’s HA architecture can be as resilient as possible. I hope that clears up any confusion.

Thanks @Bryan​ - I have temporarily stopped the messages by idling the other node, and will check again after we have completed the upgrade to v1.61.6 since the dependency on NTP may be interfering, as NTP was completely deprecated for RHEL 8.x. So I suspect there is a synchronization issue, and fingers cross it may be resolved after upgrade!