If you struggle like the above image Multiple Inside file structures and If you already know the folder and object name. We can load the objects in less time without using each Iterator in Schedule.
Step 1: First We have to find the input source row count below 500 or above 500. The reason is redshift cell limit 65k restriction, you will know the reason following steps.
Using python script we make a decision 500 above or below
cursor=context.cursor()
cursor.execute("select case when 500>=(select count(1) from tablename) then 1 else 0 end ")
result=cursor.fetchone()
result=int(result[0])
if result==1:
context.updateVariable(“manifest_decision”, ‘below_500’)
else:
context.updateVariable(“manifest_decision”, ‘above_500’)
Based on the decision we can load the data two different ways using the “if” component.
Step 2: Below 500 way
In this part first, create manifest data as a view using folder name and object name from the source table.
create or replace view manifest_view as
(select replace(‘{
“entries” : [’||listagg(manifest, ‘’)||‘]
}’,‘,]’,‘]’)from
(select
‘{
“url” : "s3://path/’||Foldername (from table)
||‘/’||Objectname (from table)||‘",
“mandatory” : true
},’ as manifest from table limit 500))
Based on the result create a CSV file using the text output component the file extension should be “.manifest”.
After creating the file we need to clean up inside the file data for proper manifest data format. We can update using the python component.
import boto3
import json
s3 = boto3.resource(u’s3’)
csv_object = s3.Object(‘lcu-matillion’, ‘path/file.manifest’)
f= csv_object.get()[‘Body’].read().replace(‘“”’, ‘"’).replace(‘"{’,‘{’).replace(‘}"’,‘}’)
f = json.loads(f)
with open(“manifest_test.manifest”, “w”) as outfile:
json.dump(f, outfile)
s3.meta.client.upload_file(‘manifest_test.manifest’, ‘lcu-matillion’, ‘path/file.manifest’)
os.remove(“manifest_test.manifest”)
Note: These two steps CSV creation and python update taking less time compare to complete python code write.
Finally, we can load the bulk objects using this manifest file. You can refer matillion documentation for manifest file load.
Step 2: Above 500 Way
In this way, we need to iterate the data and pass each 500 set limit till maximum records.
Like above need to create a temp table first table name as 500_split.
First, create an additional view to find out the maximum record and splitting limit.
create or replace view 500_split_view as
(select * from 500_split where
id <=
(select
ceil((select max(row_num) from
(select foldername,objectname,row_number() over (order by foldername,objectname) as row_num
from table))/500::float)) order by id)
Using table iterator pass the start and end values from 500_split_view .
Finally, create the orchestration job same flow like below 500 and first view creation only different like below
create or replace view staging.manifest_view as
(select replace(‘{
“entries” : [’||listagg(manifest, ‘’)||‘]
}’,‘,]’,‘]’)from
(select
‘{
“url” : "s3://path/’||foledername
||‘/’||objectname||‘",
“mandatory” : true
},’ as manifest
from
(select foledername,objectname,row_number() over (order by entityid,subentityid) as row_num
from table) where row_num between ${start} and ${end}
))
.You can load data in multiple iterations with 500 set objects.
Hint: Based on my design 25 fields from JSON 500 files taking less than 5 minutes.