Communications Link Failure: The last packet sent successfully to the server was 0 milliseconds ago

Good day, all!

 

We are running into a critical production issue for a customer of ours who decided it wise to leave their Matillion -> Snowflake environment unattended for two years.

 

On 9/2 we were contacted and told that the ETL jobs were failing to run, and were failing with the error message found in the title. Communications Link Failure: The last packet sent successfully to the server was 0 milliseconds ago

 

Upon initial inspection, I discovered that their Matillion environment was still running 1.40, and immediately performed a backup/upgrade to the latest version. The driver they had been using was a JDBC v8 driver, which Matillion 1.40 did not officially support (I believe official support came in 1.42).

 

Long story short, they're now running latest versions of Matillion with an updated MySQL JDBC driver (8.0.26) and we're still receiving the same errors!

 

Now for the weirdest part: SOMETIMES the job will run and successfully complete the first (and sometimes second) queries in the job before failing. MOST of the time (like 95% of the time) it will fail on the first query in the job.

 

Thanks ahead of time for your collective brainpower with this. It is a critical issue for a rather large customer.

 

---------------------

Below is the Matillion support information for added clarity:

 

-- METL Version

1.51.8 (build 979)

 

-- License

None

 

-- Disk:Free

35.67 Gb

 

-- Disk:Usable

35.57 Gb

 

-- Disk:Total

39.25 Gb

 

-- Scheduler ID

NON_CLUSTERED

 

-- Scheduled Jobs Ran

5

 

-- Scheduler Durable

false

 

-- Persistent Mode

false

 

-- Server:TimeZone

Universal; GMT+0000

 

-- MySQL 8.0

mysql-connector-java-8.0.26.jar

 

-- Tomcat: Version

Apache Tomcat/8.5.69

 

-- Tomcat: Built

Jul 23 2021 20:47:50 UTC

 

-- Tomcat: Server Number

8.5.69.0

 

-- Operating System: Name,

Linux

 

-- Operating System: Version

4.14.114-83.126.amzn1.x86_64

 

-- Operating System: Architecture

amd64

 

-- Operating System: System Info

Linux ip-10-255-1-137 4.14.114-83.126.amzn1.x86_64 #1 SMP Tue May 7 02:26:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

 

-- Operating System: Distribution Info

NAME="Amazon Linux AMI"

VERSION="2018.03"

ID="amzn"

ID_LIKE="rhel fedora"

VERSION_ID="2018.03"

PRETTY_NAME="Amazon Linux AMI 2018.03"

ANSI_COLOR="0;33"

CPE_NAME="cpe:/o:amazon:linux:2018.03:ga"

HOME_URL="http://aws.amazon.com/amazon-linux-ami/"

 

-- JVM Version

1.8.0_201-b09

 

-- JVM Vendor

Oracle Corporation

 

-- Cloud Platform

aws

 

-- Target Warehouse

snowflake

 

-- Repository Database

Unknown

 

-- Cluster

Not clustered

 

-- Type

t3.medium

 

-- AMI ID

ami-0b901a7ee09e11f3d

 

-- ImageID

ami-0b901a7ee09e11f3d

 

-- AZ

us-east-2a

 

-- Region

us-east-2

 

-- Client:UserAgent

mozilla/5.0 (macintosh; intel mac os x 10_15_7) applewebkit/537.36 (khtml, like gecko) chrome/94.0.4606.71 safari/537.36

 

-- Client:Resolution

1680x914

 

-- Client:TimeZone

GMT-07:00

Here is a screenshot of the failure error

Hi @northcollin​,

From what I can tell the upgrade that was performed was an in-place upgrade. The reason I say that is because the AMI and OS is old yet the Matillion version number is newer. I would highly suggest upgrading the Ec2 instance as there have been a few tomcat vulnerabilities found which are fixed with new AMI's/OS updates.

Troubleshooting communication failures is a tough gig. It usually means there is something within the AWS infrastructure that is not configured correctly or being over exerted to the point packets are being dropped. It could also be something where the MySQL DB is being backed up and happens to be offline at that time. I am assuming the MySQL instance is within AWS? If so, I would focus on the logs of the Ec2 instance and MySQL. Download the Catalina logs for the Ec2 instance and see if you can glean any extra detail from there. The better question may be, whether you can see the connection attempt on the MySQL side when the failure occurs. If you never see the attempt in MySQL then that suggests a failure within the network infrastructure or possibly the Matillion instance. From their, I would start working back from MySQL towards Matillion which would require scrutinizing the Cloudwatch logs as it pertains to network infrastructure, traffic, etc.

I hope this helps in some way. Good luck!