Handling malformed CSV (unenclosed delimiters) in an ELT fashion

Brendan · May 17, 2022, 1:25am

Is it possible to load a file (from Azure Blob Storage) into Snowflake as a single varchar, or a single varchar per line?

My situation is that a source system provides data in a CSV file with unenclosed delimiters.

example file illustrating the unenclosed delimiter character:

ID, name, amount, date

001,Tammy,$123,456.78,2012-12-12

Because this kind of malformed CSV is ambiguous, I don't expect a CSV handler to fix this on its own. We've produced a script to solve this issue and can run that script in pre-processing, but I'm interested in a ETL-style approach, handling the issue within Snowflake/Matillion.

Thank you in advance for your advice and support!

Michael · May 17, 2022, 7:43am

Hi @Brendan ,

the first solution that comes to my mind is to use the "regular" CSV load process with an empty delimiter (is that even possible) or a delimiter that is non-existent in your input data. The idea is to treat every full line in the CSV as one attribute. Just specify the target structure as one VARCHAR column with maximum lenght and then this should give you one row per input line with the complete content.

Let me know if that works for you. Looking forward to hear about the results.

Michael

ChrisHastie-InterWorks · May 23, 2022, 9:00am

Hey @Brendan , my suggestion regarding a VARIANT field is purely for if your data may not strictly be a string. For example, it could have been a semi-structured construct, such as a JSON object. VARIANT is the basic fall-back to contain any type of data before formatting, and you can convert from a VARIANT to a string or any other Snowflake data type.

Other than the file type of the destination field being a STRING or a VARIANT, nothing else changes here. Row order would still be maintained in the same way as ingesting into a STRING, you would simply have a VARIANT field for every record of data instead of a STRING.

For bonus points if it is useful, you can also include METADATA$FILE_ROW_NUMBER in your ingestion as another field. Then the loaded data can easily be compared with the original file. You can pair this with METADATA$FILENAME

SubashSambath · June 2, 2022, 3:39pm

Hai @Brendan

My first solution that comes to my mind is to use the "regular" CSV load process with an empty delimiter (is that even possible) or a delimiter that is non-existent in your input data. The idea is to treat every full line in the CSV as one attribute. create table with single column using super datatype .

create table csvload(

data super

)

There is no limit for super data type u can load csv file data into single column values

cheers

Subash S

ChrisHastie-InterWorks · May 17, 2022, 12:44pm

Michael is correct here. Minor clarification is that you can specifically set your FIELD_DELIMITER in your Snowflake File Format to `NONE`.

Depending on your use case, you may also find it more helpful to ingest the data into a VARIANT column instead of a STRING, but that's completely your choice.

Brendan · May 21, 2022, 12:09am

Thank you Michael and Chris! Michael has the right path forward, and Chris's configuration suggestion was right on point.

Using The Azure Blob Storage Load component, I was able to set the field delimiter to NONE and process the files' format within Matillion transformations on Snowflake. IT will not have to hear from me this week regarding their file formatting!

I'm not sure how VARIANT processing would be different. If I loaded this file in as a single VARIANT, would the row order be maintained? Set-based manipulation was sufficient for my needs, but I could see a benefit to using a VARIANT if order is maintained.

Topic		Replies	Views
How do I remove or ignore a special character in a column of a CSV file I'm exporting from S3 and loading into Snowflake? Matillion ETL	6	0	September 29, 2021
Hi everyone I am new to matillion etl tool and I wanna know how can I convert a csv file into avro format, source platform might be SnowFlake Matillion ETL	7	1	February 18, 2022
In Matillion ETL pipeline we are using Azure Blob Storage Unload component to create a csv file from snowflake table. We are able to load the file into the Azure blob successfully but we are not able to see the data inside the file (Data is encrypted) Matillion ETL	3	0	May 15, 2024
How can I specify a Unicode character as the enclosing character in the Azure Blob Storage Load component? Matillion ETL	1	0	December 7, 2021
Dynamically generate column list from incoming source file Matillion ETL	21	2	March 31, 2022

Handling malformed CSV (unenclosed delimiters) in an ELT fashion

Related topics