Which component i can use

Hi team,

In my current project i am having scala code. The scala code is all about it will take a file as input read each line , based on the data in each line it will generate new transformations and write to a new file. New file will have additional columns that are derived as part of transformations. I need to do the same thing in matillion . Can somebody help me how we can do this in matillion or even sf. Since snowflake is dwh i am using. I tried with snowflake stored proc. It is giving time out error. Since file size is big. ​

The first step is to load your file to a Snowflake table. Depending on the file format and structure this can be out of the box with Matillion. E.g., for CSV or JSON files on S3 this is easily done with the S3 Load component.

 

If your file has a custom format, you might need to transform it first.

 

Once the data is loaded to Snowflake, you can apply standard SQL or a Matillion Transformation Job to transform your data. You'll have to "translate" your Scala code to SQL. With that, it should also be feasible to process larger volumes of data.

Hi Michael,

I have done the samw thing. Loaded the file into table and from table i am iterating each line to transformed as per the requirement. Each line iterating is the issue here in snowflake. My file size is 15gb . Thats the problem. I am using large warehouse to process it. After 30 mins run i was able to iterate 30k records which is very expensive for 15gb file and i will get timed out. Any suggestion on the iteration part will highly appreciated. I am struggling to process it. ​

If you want to work with a SQL based database like Snowflake you have to stop processing the data with "line by line" iterations. SQL works on sets and you should consider this when designing your transformation processes.

Example: If you have a table with integer columns a and b want to calculate a + b, you don't apply the calculation to line1, then line2, then line3, ... Simply write SELECT a+b FROM table and the calculation is applied to all the lines/rows in the table.