Handle S3 Multiple Folder->Json File Structure achieve using S3 load Manifest (If you already know the folder and file name)

ganesh_pitchai · May 6, 2021, 1:06pm

If you struggle like the above image Multiple Inside file structures and If you already know the folder and object name. We can load the objects in less time without using each Iterator in Schedule.

Step 1: First We have to find the input source row count below 500 or above 500. The reason is redshift cell limit 65k restriction, you will know the reason following steps.

Using python script we make a decision 500 above or below

cursor=context.cursor()

cursor.execute("select case when 500>=(select count(1) from tablename) then 1 else 0 end ")

result=cursor.fetchone()

result=int(result[0])

if result==1:

context.updateVariable(“manifest_decision”, ‘below_500’)

else:

context.updateVariable(“manifest_decision”, ‘above_500’)

Based on the decision we can load the data two different ways using the “if” component.

Step 2: Below 500 way

In this part first, create manifest data as a view using folder name and object name from the source table.

create or replace view manifest_view as

(select replace(‘{

“entries” : [’||listagg(manifest, ‘’)||‘]

}’,‘,]’,‘]’)from

(select

‘{

“url” : "s3://path/’||Foldername (from table)

||‘/’||Objectname (from table)||‘",

“mandatory” : true

},’ as manifest from table limit 500))

Based on the result create a CSV file using the text output component the file extension should be “.manifest”.

After creating the file we need to clean up inside the file data for proper manifest data format. We can update using the python component.

import boto3

import json

s3 = boto3.resource(u’s3’)

csv_object = s3.Object(‘lcu-matillion’, ‘path/file.manifest’)

f= csv_object.get()[‘Body’].read().replace(‘“”’, ‘"’).replace(‘"{’,‘{’).replace(‘}"’,‘}’)

f = json.loads(f)

with open(“manifest_test.manifest”, “w”) as outfile:

json.dump(f, outfile)

s3.meta.client.upload_file(‘manifest_test.manifest’, ‘lcu-matillion’, ‘path/file.manifest’)

os.remove(“manifest_test.manifest”)

Note: These two steps CSV creation and python update taking less time compare to complete python code write.

Finally, we can load the bulk objects using this manifest file. You can refer matillion documentation for manifest file load.

Step 2: Above 500 Way

In this way, we need to iterate the data and pass each 500 set limit till maximum records.

Like above need to create a temp table first table name as 500_split.

First, create an additional view to find out the maximum record and splitting limit.

create or replace view 500_split_view as

(select * from 500_split where

id <=

(select

ceil((select max(row_num) from

(select foldername,objectname,row_number() over (order by foldername,objectname) as row_num

from table))/500::float)) order by id)

Using table iterator pass the start and end values from 500_split_view .

Finally, create the orchestration job same flow like below 500 and first view creation only different like below

create or replace view staging.manifest_view as

(select replace(‘{

“entries” : [’||listagg(manifest, ‘’)||‘]

}’,‘,]’,‘]’)from

(select

‘{

“url” : "s3://path/’||foledername

||‘/’||objectname||‘",

“mandatory” : true

},’ as manifest

from

(select foledername,objectname,row_number() over (order by entityid,subentityid) as row_num

from table) where row_num between ${start} and ${end}

))

.You can load data in multiple iterations with 500 set objects.

Hint: Based on my design 25 fields from JSON 500 files taking less than 5 minutes.

Joe-Community-Manager · May 6, 2021, 1:44pm

Hi @ganesh_pitchai

This is a brilliant post, thank you for sharing, I like the use of multiple Matillion ETL Components from the table iterator to the orchestration job leading to an end result!

I think the community will find this really handy and I cannot wait to hear what other users' thoughts are in relation to this but also what have they achieved similar, amazing use of Matillion ETL!

Kind regards, Joe

Topic		Replies	Views
Data Ingestion: Load Dynamic Files from S3 to Snowflake Matillion ETL	1	1	January 29, 2021
Load JSON to Redshift using S3 load Matillion ETL	1	3	February 26, 2024
Converting redshift table to a json payload using matillion jython Data Productivity Cloud	2	13	March 3, 2025
Hi Am new to matillion,am tying to load almost 5 million Json files to snow flake using s3 load. Its giving an error 'Total size (>=1,073,742,040 bytes) for the list of file descriptors returned from the stage exceeded limit (1,073,741,824 bytes); Number' Matillion ETL	4	3	February 18, 2022
API TO S3 bucket Matillion ETL	3	3	October 16, 2024

Handle S3 Multiple Folder->Json File Structure achieve using S3 load Manifest (If you already know the folder and file name)

Related topics