How can we set the PATH in the Azure Blob Storage Load? Here is what i want to achieve COPYINTO @"DEV_STG"."PUBLIC"."STAGE"/XX/IN PATTERN = '.CONNECTIONS.'

BATENBURG · December 9, 2021, 6:29am

I got it working using the Pattern instead of the PATH. From a functional perspective this works the same, but performance is way worse. ADLS simply seems to be way less efficient with the PATTERN clause compared to the PATH clause. I did a simple test with the LIST command to demonstrate:

LIST @"DEV_STG"."PUBLIC"."STAGE"/XX/IN PATTERN = '.*CONNECTIONS.*'

-- <1 second

;

LIST @"DEV_STG"."PUBLIC"."STAGE" PATTERN = 'XX/IN/.*CONNECTIONS.*'

-- >60seconds

;

Output is the same. When our ADLS was realy small this wasn't an issue, but when it started to grow to thousands of files (in other directories than the XX example!), performance suddenly became much worse.

The option to supply a relative path (as in the external table component) seems to be missing in matillion in the Azure Blob Storage Load component or am i missing something?

COPY INTO "DEV_STG"."PUBLIC"."STAGE_CSV_FILE" FROM

@"STAGE"

PATTERN='.*'

FILE_FORMAT= (

FORMAT_NAME='"FF_CAR"'

)

ON_ERROR='ABORT_STATEMENT'

PURGE=FALSE

TRUNCATECOLUMNS=FALSE

FORCE=FALSE

Bryan · December 9, 2021, 7:57pm

Hi @BATENBURG,

There are a couple ways to accomplish this but a lot of it hinges on the folder structure of your Blob Storage. As you have probably seen, load performance is better if you can pair down the folders in your Blob Storage you are loading from and the quantity of files in those folders. If you folder structure in your Blob Storage is following best practices then it could be fairly easy to implement.

The best practice is to partition your storage into folders based on some logic. The typical design is some variation of /YYYY/MM/DD/HH/MM/SS. This type of structure can speed up loads considerably when your data has grown over time while giving you flexibility around what data is loaded or perhaps reloaded. For instance, if you wanted to go back and load or reload all data from a specific day or hour, you have the flexibility to do that.

We use the above pattern. On the Matillion, what we do is generate a list of paths to load from using string manipulation in Python. We then store those paths in a grid variable. From there we use a Grid Iterator tied to the Load component and on each iteration we pass in the path to the load component. The load time never grows or shrinks because it's always the same count of folders that will be loaded and the file count is similar each day.

This may not apply directly to you situation but hopefully it gives you an idea.

Topic		Replies	Views
Can we setup filters on the Azure Blob Storage Load Component? Matillion ETL	4	12	April 25, 2023
How does Azure Blob Storage Load pattern parameter behave? Matillion ETL	2	11	December 2, 2022
Azure Blob Storage Load - Order files are loaded? Matillion ETL	2	5	May 23, 2023
I'm try to connect to an Azure Blob storage container with multiple folders that contain text files. I want to create an ETL that will create a table based on each file name and then load each file into that table Matillion ETL	3	14	March 26, 2021
What needs to be configured in Matillion to connect to an storage container Matillion ETL	6	23	March 10, 2021

How can we set the PATH in the Azure Blob Storage Load? Here is what i want to achieve COPYINTO @"DEV_STG"."PUBLIC"."STAGE"/XX/IN PATTERN = '.*CONNECTIONS.*'

Related topics

How can we set the PATH in the Azure Blob Storage Load? Here is what i want to achieve COPYINTO @"DEV_STG"."PUBLIC"."STAGE"/XX/IN PATTERN = '.CONNECTIONS.'