I have a job that I'm working on that is simply picking up a csv file from an S3 bucket and attempting to load into Snowflake. The S3 step is failing because there is a % in value from one of the columns.
If I remember correctly you are using Matillion for Snowflake. Correct me if I am wrong on that. If you are loading from S3 into Snowflake using the S3 Load component there are a couple things to check. Since the error is "Error parsing JSON", I am wondering if the Format Type is wrong. If you are not using an explicit CSV File format that you created in Snowflake then I would make sure you have the Format type set to CSV. Another thing to take note of is if your fields are enclosed with a a delimiter like a double quote. If they are then you can add that to your File format which will make for a cleaner load as well.
The other thing that dawns on me is how you are loading the CSV file? Are you loading it into a variant column in a table and then transforming from there or are you loading the CSV into a table where columns for each field have already been defined? I personally like to load into a variant column when possible as it results in the truest data lineage. I realize that's sometimes not possible depending on the size of the file contents as well. I hope this helps. Post back if you have other thoughts or information.
You said that the issue you are getting is in the S3 unload component, correct? If this is the case it sounds like it is an issue reading from S3, not an issue writing to Snowflake. Is it possible to have the values in that column enclosed in quotes and then convert back to a percentage in you job using a calculator component?
Hey @DataGuy, I believe the issue is actually in Snowflake. The way I understand it, is that her Kerri's columns are enclosed in double quotes but she has 1 or more CSV columns that is a decimal value with the percent character (i.e. 1.00%). Since Snowflake doesn't have a numeric data type to handle decimal numbers that have the percentage character, the load fails. It's basically trying to stuff what would be a string into a decimal typed column. I have ran into this issue several times in my SQL Server days when processing CSV files. The best approach I found was to create the table where all columns are of type string (aka: varchar). You will almost never have load failures. Then you can use SQL skills to clean up and transform the data.
The other method is to preprocess every file before the load for every bad data possibility, which is likely going to require a ton of Python and moving files around. In my opinion, data clean up is best done in a data system.
The solution I used for this issue was solved in the initial Python step that was calling the API to get the report. Here is the specific line of code I used:
df['Name of Column'] = list(map(lambda x: x[:-1], df['Name of Column'].values))
Thanks for your reply. Yes, you are correct I am loading from S3 to Snowflake using the S3 load component. Here are the answers to your great questions:
I am using CSV as the file format.
My columns are enclosed with a double quote delimiter.
I am loading into a table with the types already defined, and the datatype I'm currently using for that column is NUMER(5,2)
I have not ever loaded into a variant column followed by a transformation, however, I am interested how that looks and or what component is used. Here is a screenshot of my current job flow. The row count for this csv file is 3k for now, but I don't foresee it will be massive.
After more investigation I stand corrected. I have used the variant column with great success on JSON and XML files and my assumption was that CSV files would follow the same pattern but that doesn't appear to be the case.
The benefit to the variant data type when working with semi structured data is that flattening a perhaps complex nested set of arrays becomes trivial. Obviously, this doesn't apply when you are talking about CSV files that are as flat as they can be.
So, variant columns are out but there are other options which lend itself to both data lineage and process flow. Following a true ELT pattern, I would create your table that you plan to load with columns of data type string. What this gives you is the ability to load the data in it's raw form without the worry of formatting or adhering to strict data type constraints. You are effectively bulletproofing that raw data load. From there you could start transforming the data into a more materialized set of views or tables. The reason for this pattern is that you are now debugging common SQL code responsible for transformations rather than trying to debug a component and it's configuration where there perhaps isn't as good of logging or flexibility for all possible situations. You can also leverage functions in Snowflake like try_cast, try_to_datetime, try_to_decimal, etc. to help test for bad data before materializing it. The idea is that you are ensuring the load will always be good and the ability to fix failures around bad data becomes substantially easier since you are then just making tweaks to SQL.
I hope this helps in some way. Post back on your thoughts if you would like.