UPDATE - SUPPORT CASE OPENED AND THERE ARE KNOWN BUGS WITH FILE ITERATOR
TLDR: There are at least one or two bugs with the Matillion file iterator component that first appeared when they modified regex behavior with the component in release 1.74. They knew of one bug where recursion was not being handled, and in their internal ticket they offered a regex filter that seems to support the intended behavior. But that is not this issue - the issues go beyond what they tested. I am seeing that the file iterator fails to process all subfolders in an s3 bucket. It will process the first 35 out of 49 folders (folder names sorted alphabetically). Inside each folder is a file name that matches the regex. They referred me to their internal ticket, which as I mentioned does not address my issue.
We use the file iterator extensively and this bug prevents us from updating past 1.73.x . URGING THE ENGINEERING TEAM TO ADDRESS THIS.
Below are the explanation and test scenarios I provided in my support case:
Thank you for sharing the similar issue and video, however the problem I have is different from your internal ticket and the workaround is not applicable to our issue. In fact, the problem I have relies on a regex filter I have already been using, but unlike your workaround video the file iterator is only evaluating the first 35 of 49 folders in the bucket, while the regex filter should be selecting files from all 49 folders. This bug was first noticed in the 1.74.0 update. I have uploaded a screen shot of the s3 bucket, one level down from the top-level folder. Each of these 49 folders contains a file that matches the regex filter. In Matillion v1.31.1 all 49 folders/files would be processed. But now in the 1.75.x version only the first 35 are processed by the file iterator. Below is the filter regex and two tests I ran:
Filter regex: contains job three variables
^((?!.*archive)${JOB_FILE_SPEC}.*${JOB_CLIENT_ID}.*ota_kpi_${JOB_FILE_GRAIN}.*[.]csv)
Test 1: In s3 bucket screen shot, the YELLOW highlighted folder should be processed for client "corpus_christi"
Variables passed:
${JOB_FILE_SPEC} = (?!.*historical)
${JOB_CLIENT_ID} = corpus_christi
${JOB_FILE_GRAIN} = day
/corpus_christi/ota_kpi_day.csv <-- SELECTED BY ITERATOR WHICH IS EXPECTED
/corpus_christi/ota_kpi_day_historical.csv <-- Does not get selected as expected
/corpus_christi/archive/ota_kpi_day.csv <-- Does not get selected as expected
Test 2: In s3 bucket screen shot, the GREEN highlighted folder should be processed for client "vermont"
Variables passed:
${JOB_FILE_SPEC} = (?!.*historical)
${JOB_CLIENT_ID} = vermont
${JOB_FILE_GRAIN} = day
/vermont/ota_kpi_day.csv <-- DOES NOT GET SELECTED BY ITERATOR WHICH IS NOT EXPECTED
/vermont/ota_kpi_day_historical.csv <-- Does not get selected as expected
/vermont/archive/ota_kpi_day.csv <-- Does not get selected as expected
When iterating over the entire s3 bucket, matching files from only the first 35 folders (sorted alphabetically) are processed. The iterator seems to ignore the bottom 14 folders.