This is a must for us as well. We require 'base' columns for our customers files, but optionally allow them to provide extra columns, so we have to support both validating the base columns and additional columns they might have provided. With AWS S3, our biggest challenge is often that the files are simply too large to try and download just to get at the header row.
However, if you use a bash CLI command, you can interrogate the S3 file in such a way that you can get just the first row of data.
This is accomplished using an oft overlooked switch on the aws s3 api get-object cli command: range_bytes.
range_bytes lets you make an HTTP request for the first nth bytes of the get-object target and this allows you to pull the header from even the largest s3 files without having to fetch the entire thing.
In your orchestration, you'll need a couple of job variables setup, buffer_bytes and byte_sample.
Buffer bytes is important because, depending on the 'width' of the target file, you may need to use a bigger range of bytes to ensure you get everything up to the line break character in the file. It is okay to overshoot the line break, but you must at least include it to determine the 'header' row from the next row in the file. It may take some experimentation, but we use 0000-2111 as the value of the buffer_bytes variable.
byte_sample is used to house the actual data from the file, and we accomplish getting this data in two fairly easy bash steps.
first bash step
rm -f ./header_content
aws s3api get-object --bucket $<variable of bucket name> --key $<variable of s3 key as /dir/target_file> --range bytes=$buffer_bytes ./header_content
tail -n +11 ./header_content
The bash component can be a bit tricky with its output so we use two steps but there might be a way to shorten this into one bash step.
In this first step we're removing any file named 'header_content' to ensure a clean slate.
We then run the aws s3api get-object command which pipes it's results to the now new file header_content. The last command is really just for task output.
second bash step
The second bash step just does a cat ./header_content which really does nothing other than populate the built in 'message' variable of the component step with the output from the cat command.
Now, simply export the components Message variable to the byte_sample job variable.
Now that you have the data available to your job, you can use it a number of ways. We use a Python step to populate a job variable (header_string) with the actual header data which is then consumed downstream by another orchestration task.
import codecs
stringa = byte_sample.split('\n')[0]
stringa = stringa.encode('ascii', 'ignore').decode('ascii')
header = stringa.replace('\"','')
context.updateVariable('header_string',header)
print("Incoming header row:" + header_string)
This is not the most efficient Python code but it is pretty easy to understand what it is doing;
- Split the data captured into byte_sample on the line break \n (replace with other linefeed symbols as needed), and taking the first element of the resulting array.
- Because of how this data arrives in the previous step, we apply an encoding to it that helps it ignore illegal characters or BOM.
- We eliminate any double quotes that might be contained in the string
- We update our job variable with the result.
That's it. From there, you have the header data and can do whatever you wish with it to compare it to the headers you are expecting. Python does not reorder lists (arrays) so your output should be a correct ordinal representation of the file columns.
Hope this helps!
Edward Hunter