Hi everyone I am new to matillion etl tool and I wanna know how can I convert a csv file into avro format, source platform might be SnowFlake

I am considering to use the Python alternative however I want to hear from you if there is another alternate way. Thanks in advance Matillion community :)

Hi @Tony

Welcome — we're glad you're here!

Hey @Bryan​, this question made me think of you and your experience with CSV files, care to share what you know? Thanks in advance.

Again, glad to have you onboard, @Tony

Chika

PS. Feel free to share your best practices too!

Hi @Tony​,

Unfortunately, there is nothing out of the box that Matillion has to do this conversion. Your comment about using Python is going to be your best approach for this. One thing I like to ask is, what are you trying to solve for by converting a CSV to avro? I ask because there could be other methods to get you to your end result.

It looks like you are using snowflake. I am making a general assumption that you are wanting to load avro files into Snowflake. You can load CSV files into Snowflake but I know there are a few caveats with that. JSON seems to be the most flexible file type when loading data into a variant column. JSON has the most functionality in Snowflake around querying it natively without having to transform it before use.

The conversion from CSV to AVRO is not super common but there are a few Python Modules that can help. I tend to lean towards the pandas module for most thing but in this case there is nothing that I know of in Pandas that deals with Avro. Unfortunately, I wouldn't be of much help in this space.

Let us know what end result you are working towards and maybe we can help give you ideas on how to get there. Thanks for reaching out!

Thank you tons, @Bryan

  1. Hey Bryan, thank you very much for your kind response. Yes I am using Snowflake, the task I need to complete is to load snowflake tables (which stored csv format files) then I want to transform those dataframes from that csv format into avro format. I have found various alternate ways to accomplish that using the Python node (According to my research we can harness pyarrow library (from Apache foundation) or dask library (Python framework) among other alternatives such as Pandas, Koalas, Pyspark, etc...however Pyarrow and Dask enables create a more flexible scenario in terms of optimization, the capability to apply partition to the data, etc). This is an extract of the code using Dask:
  2. import dask.dataframe as dd
  3. df = dd.read_csv('./data/people/*.csv')
  4. df.to_parquet('./tmp/people_parquet2', write_index=False)

 

Thank you @Bryan​ I will appreciate your kind support. Thanks in advance.

That's the approach I would take if I had to do this myself. I am assuming you are needing the Avro file of the CSV file because you're loading some other system besides Snowflake, is that correct?

That is correct. Thanks very much Bryan for all your help