Handle S3 Multiple Folder->Json File Structure achieve using S3 load Manifest (If you already know the folder and file name)

 

If you struggle like the above image Multiple Inside file structures and If you already know the folder and object name. We can load the objects in less time without using each Iterator in Schedule.

 

Step 1: First We have to find the input source row count below 500 or above 500. The reason is redshift cell limit 65k restriction, you will know the reason following steps.

 

Using python script we make a decision 500 above or below

 

cursor=context.cursor()

cursor.execute("select case when 500>=(select count(1) from tablename) then 1 else 0 end ")

result=cursor.fetchone()

result=int(result[0])

if result==1:

 context.updateVariable(“manifest_decision”, ‘below_500’)

else: 

 context.updateVariable(“manifest_decision”, ‘above_500’)

 

Based on the decision we can load the data two different ways using the “if” component.

 

Step 2: Below 500 way



 

In this part first, create manifest data as a view using folder name and object name from the source table.

 

create or replace view manifest_view as

(select replace(‘{

 “entries” : [’||listagg(manifest, ‘’)||‘]

}’,‘,]’,‘]’)from

(select 

‘{

  “url” : "s3://path/’||Foldername (from table)

||‘/’||Objectname (from table)||‘",

  “mandatory” : true

 },’ as manifest from table limit 500))

 

Based on the result create a CSV file using the text output component the file extension should be “.manifest”.

 

After creating the file we need to clean up inside the file data for proper manifest data format. We can update using the python component.

 

import boto3

import json

s3 = boto3.resource(u’s3’)

 

csv_object = s3.Object(‘lcu-matillion’, ‘path/file.manifest’)

 

f= csv_object.get()[‘Body’].read().replace(‘“”’, ‘"’).replace(‘"{’,‘{’).replace(‘}"’,‘}’)

f = json.loads(f)

 

with open(“manifest_test.manifest”, “w”) as outfile:

 json.dump(f, outfile)

 

s3.meta.client.upload_file(‘manifest_test.manifest’, ‘lcu-matillion’, ‘path/file.manifest’)

os.remove(“manifest_test.manifest”)

 

Note: These two steps CSV creation and python update taking less time compare to complete python code write.

 

Finally, we can load the bulk objects using this manifest file. You can refer matillion documentation for manifest file load.

 

Step 2: Above 500 Way

In this way, we need to iterate the data and pass each 500 set limit till maximum records.

 

Like above need to create a temp table first table name as 500_split.

 

First, create an additional view to find out the maximum record and splitting limit.

 

create or replace view 500_split_view as

(select * from 500_split where

id <=

(select

ceil((select max(row_num) from

(select foldername,objectname,row_number() over (order by foldername,objectname) as row_num

from table))/500::float)) order by id)

 

Using table iterator pass the start and end values from 500_split_view .

 

 

 

Finally, create the orchestration job same flow like below 500 and first view creation only different like below

 

create or replace view staging.manifest_view as

(select replace(‘{

 “entries” : [’||listagg(manifest, ‘’)||‘]

}’,‘,]’,‘]’)from

(select 

‘{

  “url” : "s3://path/’||foledername

||‘/’||objectname||‘",

  “mandatory” : true

 },’ as manifest 

 from

 (select foledername,objectname,row_number() over (order by entityid,subentityid) as row_num

from table) where row_num between ${start} and ${end}

))

 

 

.You can load data in multiple iterations with 500 set objects.

 

 

Hint: Based on my design 25 fields from JSON 500 files taking less than 5 minutes.

 

Hi @ganesh_pitchai

This is a brilliant post, thank you for sharing, I like the use of multiple Matillion ETL Components from the table iterator to the orchestration job leading to an end result!

I think the community will find this really handy and I cannot wait to hear what other users' thoughts are in relation to this but also what have they achieved similar, amazing use of Matillion ETL!

Kind regards, Joe