We are working on a POC to decrypt files from an S3 bucket to dump into Snowflake. I have read this option isn't an available Matillion component. I will most likely have to use Python or a bash script.
Does anyone have examples or template?
We are working on a POC to decrypt files from an S3 bucket to dump into Snowflake. I have read this option isn't an available Matillion component. I will most likely have to use Python or a bash script.
Does anyone have examples or template?
Hi @kerri.gathright,
What kind of encryption are you using? Are you using the S3 Encryption or KMS Encryption? Even though they are used in S3 they are different in nature. Also, what kind of files are they? (i.e. JSON, CSV, Parquet, etc.)
One thing to note here is that if you are not using some custom encryption on the individual files in the S3 bucket then you can leverage native Snowflake abilities. There are 2 ways of handling this (with a Snowflake Stage or without). If you don't want to create a Snowflake Stage because it's a one off load then you can use the S3 Load component and set the Encryption property appropriately. If you plan to load files from this location often, then I would create a Stage in Snowflake that will have the required encryption configuration within it.
Either route you took above the usage is still the same. You will want to use the S3 Load Component. If you are leveraging a Snowflake Stage then I don't believe you need to set the Encryption property on the S3 component since the Stage is going to handle the decryption at the time the COPY INTO statement is submitted to Snowflake. If you are not using a stage then you will want to set the Encryption property appropriately.
Again, you should only need to implement custom a solution if you had an outside process that encrypted the files before dropping them into the S3 bucket. If this is in fact the case, there are a handful of ways to handling that situation which we can discuss further if needed.
The file is encrypted when dropped into the S3 bucket and is gpg encrypted. So we will need the custom solution. Can you provide the ways in which we can handle this now that you know we need the custom solution.
Ok. That scenario is a bit hairy but can be done. Without getting super deep in the details. The workflow should be something like this. Have a subfolder in the S3 bucket to hold decrypted files. Use a Python Script component to read each file in S3 and then decrypt and place the decrypted file in the new sub folder. From there you can use the S3 Load component to load the file into Snowflake. After loaded, the file can be deleted from the unencrypted folder if needed.
Following the above workflow you would have to first find a Python module that can decrypt the files you have. I personally do this work on my own machine until I know what Python module will work since there will be a bit of trial and error. After you have the module that will work, you will want to install that module on your Matillion Ec2 instance. Once installed, that will make it available for use in the Python Script component. From there you will want to use the boto3 Python module which comes installed on the Matillion instance to connect to and iterate over files in an S3 bucket. On each iteration you will want to decrypt the file using the GPG module you found earlier and then place the decrypted file in the subfolder you created earlier. You now should have 1 or more decrypted files in a folder within the S3 bucket. The next component would be your S3 Load component to load the file into Snowflake.
There are a couple other ways that this can be done using AWS Lambdas to decrypt files as they land in S3 but you are then mixing AWS code with your Matillion work which could make it harder to debug and manage later.
Hopefully this helps paint a bit of a picture of what this looks like.
Some possible modules to start with for decrypting the files:
https://pypi.org/project/encryptedfile/
https://pypi.org/project/PGPy/
thanks so much for this!
I also appreciate this -- thanks. I am also getting .gpg files in S3 and need to decrypt them, before using the S3 Load component.
I was hoping Matillion would have a feature to decrypt them, and I wouldn't have to use a Python component. No such luck! Might be worth a feature request, though?
2 yrs later and I also need PGP decryption of S3 files. I'll check the Ideas portal and if it isn't there I will add it as a request. Cheers.