Hi Am new to matillion,am tying to load almost 5 million Json files to snow flake using s3 load. Its giving an error ‘Total size (>=1,073,742,040 bytes) for the list of file descriptors returned from the stage exceeded limit (1,073,741,824 bytes); Number’
Hi @nadeem.a,
I haven't seen this error before but I wonder how you are loading the JSON files and how big they are?
Answers to the below questions will help us answer you better:
- Are you loading the files into a Variant column in a table?
- If so, are any of the files larger than 16mb compressed?
- Do you have an external stage in Snowflake that is pointed to your S3 bucket?
There are a lot of different factors at play depending on the files and what you have configured in Snowflake. If you can answer those questions, it will give a clearer picture. It might also help if you could provide a screenshot of the S3 Load configuration that you have. Thanks for posting!
Welcome to the Matillion Community, @nadeem.a
Hi @Bryan (Customer) ,
Thanks for your response.
The each file size is less than 500 bytes and the number of files in folder is more than 5 million.
Also am loading files to the variant column and there is stage that pointing to s3 bucket.
Hi @nadeem.a,
What you have looks correct. I have not personally tried to load that large of quantity of files all at once which could be where the error is coming from. The best practice for S3 and Blob storage is break your folder structure down by year, month, day, hour, minute, second. With that said there is some flexibility based on the data. For example our S3 buckets are typically organized using this format /YYYY/MM/DD/. The core reason for this best practice is based around loading and querying the data. If you ever wanted to load a single day then you load just that folder. Snowflake also mentions this in their documentation: https://docs.snowflake.com/en/user-guide/data-load-considerations-stage.html#label-organizing-data-by-path
When you are configuring the S3 Load component you are effectively giving Matillion the information needed to create a "COPY INTO" statement that will be ran in Snowflake. In a nutshell, Matillion isn't performing magic by loading the data. They are simply creating an easy to navigate interface that creates the necessary SQL statement to be ran in Snowflake.
Applying the above to your situation, the idea is to set the S3 Objective Prefix to a location that has the least amount of files to be loaded but still loads the necessary data without missing any. There can be overlap in files that are loaded or not loaded but you want to try to keep that overlap minimal where possible. When you tell Snowflake (via a COPY INTO statement) to look at the entire bucket for loading even though the entire bucket except 1 file has already been loaded, the overhead is still virtually the same as loading everything. It has to compare every single file every time the COPY INTO statement is executed. As you can imagine, as your 5 million files grow to 10 million and on, the loading is going to take longer and longer. You are also going to spend more money because your warehouse is running longer.
We implemented a solution that gets around the growing file count issue by implementing the best practice folder structure that I spoke about and then we dynamically create a list of folder paths based on date. We load the last 7 days every single day. Meaning, we are overlapping 6 days worth of data along with the current days data. It's what we have deemed a safe overlap but still able to pick up changes in files that have perhaps changed in since our last load. We use a Python script to generate the paths and then we store those in a grid variable. We then use a grid iterator to iterate over the paths. On each iteration it assigns the path the S3 Load's S3 Objective Prefix parameter. This translates us to running 7 COPY INTO statements every day instead of 1 that everything. This may seem inefficient but it saves us close to 20 mins a day on load times and the number grows every day as more files show up.
I hope this helps explain the possible issue and best practices. If you have more questions about this, please free to respond.