HELP! File iterator filter regex not working after Matillion update

CR295672 · September 30, 2024, 5:01pm

I recently updated from Matillion 1.73.1 to 1.75.10 and we have noticed that the file iterator component no longer behaves as expected. Note, I also experienced this same issue back when updating to 1.74 and I had to roll back to 1.73.1.

The current issue is that the file iterator checks an s3 bucket for files matching the filter regex expression. In previous versions of Matillion, this method worked without issue. With recent updates, our file iterator only processes a portion of the files and fails to match any others. All files are csv files named with a naming standard, and they are processed in alphabetical order. Only the first 35 out 49 files get processed.

A current regex filter expression looks like:

^((?!.*archive)${JOB_CLIENT_ID}.*myfile.*[.]csv)

Some details: The variable ${JOB_CLIENT_ID} would be textual client id folder. The expression also has the prefix (?!.*archive) to prevent any files existing in a folder named "archive" from being processed.

This is extremely frustrating and I checked my regex in an expression tester without issue.

CR295672 · September 30, 2024, 7:11pm

UPDATE - WITH MORE DETAIL

The file iterator behavior in my case goes like this... in my s3 bucket, there are 49 "folders" each having a single file that matches the file iterator regex filter. So there are a total of 49 files (49 iterations) that should process. After updating Matillion to 1.75.10 from 1.73.1, the iterator only treats the first 35 folders (files) as candidates. The remaining folders (sorted alphabetically) are ignored by the iterator. I can repeat this test with the same result.

I use the file iterator extensively in many jobs, so even if I could use a workaround it would take a large amount of redevelopment/testing. I have opened a support case for this just now but waiting to see it posted.

CR295672 · October 4, 2024, 2:43pm

UPDATE - SUPPORT CASE OPENED AND THERE ARE KNOWN BUGS WITH FILE ITERATOR

TLDR: There are at least one or two bugs with the Matillion file iterator component that first appeared when they modified regex behavior with the component in release 1.74. They knew of one bug where recursion was not being handled, and in their internal ticket they offered a regex filter that seems to support the intended behavior. But that is not this issue - the issues go beyond what they tested. I am seeing that the file iterator fails to process all subfolders in an s3 bucket. It will process the first 35 out of 49 folders (folder names sorted alphabetically). Inside each folder is a file name that matches the regex. They referred me to their internal ticket, which as I mentioned does not address my issue.

We use the file iterator extensively and this bug prevents us from updating past 1.73.x . URGING THE ENGINEERING TEAM TO ADDRESS THIS.

Below are the explanation and test scenarios I provided in my support case:

Thank you for sharing the similar issue and video, however the problem I have is different from your internal ticket and the workaround is not applicable to our issue. In fact, the problem I have relies on a regex filter I have already been using, but unlike your workaround video the file iterator is only evaluating the first 35 of 49 folders in the bucket, while the regex filter should be selecting files from all 49 folders. This bug was first noticed in the 1.74.0 update. I have uploaded a screen shot of the s3 bucket, one level down from the top-level folder. Each of these 49 folders contains a file that matches the regex filter. In Matillion v1.31.1 all 49 folders/files would be processed. But now in the 1.75.x version only the first 35 are processed by the file iterator. Below is the filter regex and two tests I ran:

Filter regex: contains job three variables

^((?!.*archive)${JOB_FILE_SPEC}.*${JOB_CLIENT_ID}.*ota_kpi_${JOB_FILE_GRAIN}.*[.]csv)

Test 1: In s3 bucket screen shot, the YELLOW highlighted folder should be processed for client "corpus_christi"

Variables passed:

${JOB_FILE_SPEC} = (?!.*historical)

${JOB_CLIENT_ID} = corpus_christi

${JOB_FILE_GRAIN} = day

/corpus_christi/ota_kpi_day.csv <-- SELECTED BY ITERATOR WHICH IS EXPECTED

/corpus_christi/ota_kpi_day_historical.csv <-- Does not get selected as expected

/corpus_christi/archive/ota_kpi_day.csv <-- Does not get selected as expected

Test 2: In s3 bucket screen shot, the GREEN highlighted folder should be processed for client "vermont"

Variables passed:

${JOB_FILE_SPEC} = (?!.*historical)

${JOB_CLIENT_ID} = vermont

${JOB_FILE_GRAIN} = day

/vermont/ota_kpi_day.csv <-- DOES NOT GET SELECTED BY ITERATOR WHICH IS NOT EXPECTED

/vermont/ota_kpi_day_historical.csv <-- Does not get selected as expected

/vermont/archive/ota_kpi_day.csv <-- Does not get selected as expected

When iterating over the entire s3 bucket, matching files from only the first 35 folders (sorted alphabetically) are processed. The iterator seems to ignore the bottom 14 folders.

Topic		Replies	Views
Issue in File Iterator Matillion ETL	4	23	September 6, 2024
Back with New doubt(issue) :) Matillion ETL	3	7	November 8, 2024
Hi I want to add a filter condition while reading files through file iterator component. My requirement is to get files with size > 0.Currently the file itertor is reading all the files with some pattern defined in regex filter. Looking for immediate help Matillion ETL	1	0	September 9, 2024
Regex for file iterator to exclude filename with certain strings Matillion ETL	1	1	January 26, 2022
Hi, I am using a file iterator component, in the Filter regex parameter, I want to use environmental variable something like '.*`${testdt}`\d{6}.csv.gpg where testdt will resolve to a date value. The above is not working as testdt is not resolving Matillion ETL	4	0	November 18, 2020

HELP! File iterator filter regex not working after Matillion update

Related topics