Loading XML file with multiple container nodes

So I'm trying to get some XML files loaded and ultimately flattened out into a Snowflake table. The S3 Load component of course supports XML and even offers the option to "Strip outer element" but unfortunately, the file I need to ingest has two outer elements and despite a day of trying various methods I'm stumped on how to proceed.

 

If I remove one of the two outer elements, S3 Load does its thing and I end up with one row per <item> element, which is exactly what I need. If I use S3 Load without any modifications to the file, I end up with the entire XML file loaded into a variant column, which thus far I've been unable to query successfully with Snowflake SQL.

 

Unfortunately though I have to consume these from a URL provided by the vendor and unless I can figure out some way to either a ) strip one of the elements from the file before S3 Load, or b) figure out how to query the desired sub-elements after loading the whole file into one column, I'm stuck.

 

The XML in question is attached. Any suggestions appreciated!

Hi @chris.odell1617208923357​ ,

Hi Chris, there are a couple approaches to take on this but the most out of the box approach would be to leverage the power of Snowflake as much as possible. Implement the pattern you spoke about in this statement , "If I use S3 Load without any modifications to the file, I end up with the entire XML file loaded into a variant column". Once the data is in snowflake you have several options for handling the flattening which will allow you to materialize the data in to tables or simply expose the data via a view. This is what the SQL statement would look like to query that data once it's loaded into a variant column as a single row.

select

get(xmlget(r.$1, 'title'), '$')::string as FEED_TITLE,

get(xmlget(r.$1, 'link'), '$')::string as FEED_LINK,

get(xmlget(r.$1, 'description'), '$')::string as FEED_DESCRIPTION,

get(xmlget(parse_xml(f.THIS), 'g:id'), '$')::string as ITEM_ID,

get(xmlget(parse_xml(f.THIS), 'g:gtin'), '$')::string as ITEM_GTIN,

get(xmlget(parse_xml(f.THIS), 'title'), '$')::string as ITEM_TITLE,

get(xmlget(parse_xml(f.THIS), 'description'), '$')::string as ITEM_DESCRIPTION,

get(xmlget(parse_xml(f.THIS), 'g:brand'), '$')::string as ITEM_BRAND,

get(xmlget(parse_xml(f.THIS), 'g:google_product_category'), '$')::string as ITEM_GOOGLE_PRODUCT_CATEGORY,

get(xmlget(parse_xml(f.THIS), 'g:product_type'), '$')::string as ITEM_PRODUCT_TYPE,

get(xmlget(parse_xml(f.THIS), 'g:item_group_id'), '$')::string as ITEM_GROUP_ID,

get(xmlget(parse_xml(f.THIS), 'g:color'), '$')::string as ITEM_COLOR,

get(xmlget(parse_xml(f.THIS), 'g:size'), '$')::string as ITEM_SIZE,

get(xmlget(parse_xml(f.THIS), 'g:image_link'), '$')::string as ITEM_IMAGE_LINK,

get(xmlget(parse_xml(f.THIS), 'g:additional_image_link'), '$')::string as ITEM_ADDITIONAL_IMAGE_LINK,

get(xmlget(parse_xml(f.THIS), 'g:price'), '$')::string as ITEM_PRICE,

get(xmlget(parse_xml(f.THIS), 'g:availability'), '$')::string as ITEM_AVAILABILITY,

get(xmlget(parse_xml(f.THIS), 'g:condition'), '$')::string as ITEM_CONDITION,

get(xmlget(parse_xml(f.THIS), 'g:adwords_labels'), '$')::string as ITEM_ADWORDS_LABELS,

get(xmlget(parse_xml(f.THIS), 'g:shipping_label'), '$')::string as ITEM_SHIPPING_LABEL

from YOUR_SCHEMA.YOUR_TABLE r,

table(flatten(input => r.$1, recursive => true)) f

where f.KEY = '@' and f.VALUE = 'item';

I hope this helps.

Hey Bryan, thanks for the reply. Your suggestion to use xmlget() in a sql component is exactly what I ended up doing, though not quite as elegantly. Going to take some time later this week to study yours in detail 😄

 

I'm not sure why, but in my environment the "Strip outer element" option wasn't working until I deleted & re-added the S3 Load component. Once I fixed that, I was dealing with just one container node and was able to take a slightly simpler approach.

 

SELECT 

xmlget(t1."VALUE",'g:id'):"$"::STRING AS id, 

xmlget(t1."VALUE",'g:gtin'):"$"::STRING AS gtin,

xmlget(t1."VALUE",'title'):"$"::STRING AS title,

xmlget(t1."VALUE",'description'):"$"::STRING AS description,

xmlget(t1."VALUE",'g:brand'):"$"::STRING AS brand,

xmlget(t1."VALUE",'g:product_type'):"$"::STRING AS product_type,

xmlget(t1."VALUE",'g:color'):"$"::STRING AS color,

xmlget(t1."VALUE",'g:size'):"$"::STRING AS size,

xmlget(t1."VALUE",'g:image_link'):"$"::STRING AS image_link,

xmlget(t1."VALUE",'g:price'):"$"::STRING AS price,

xmlget(t1."VALUE",'g:availability'):"$"::STRING AS availability,

xmlget(t1."VALUE",'g:condition'):"$"::STRING AS condition,

xmlget(t1."VALUE",'g:additional_image_link'):"$"::STRING AS additional_image_link

from

(SELECT value

FROM VENDORETL."PUBLIC"."stage_GIANT_BIKES",

LATERAL flatten(VENDORETL."PUBLIC"."stage_GIANT_BIKES".FILECONTENTS:"$") ) AS t1 WHERE t1."VALUE" LIKE '<item>%'