Cloud Storage Loader Output Scheme

mariah_rogers · July 30, 2021, 12:12am

Hello there!

I am trying to ingest our Snowplow events into Snowflake eventually, with a full pipeline implemented in GCP (Scala Stream Collector → Stream Enrich PubSub → Cloud Storage Loader). Since the Snowflake Loader doesn’t currently support GCP, we are going to sink our stream into a bucket via the Cloud Storage Loader, and then read it to Snowflake via Snowpipe.

I am curious if anyone can explain what the data output format will be when it gets sunk into the bucket? Is it a super wide file? Shredded into atomic, context and custom event tables? CSV? TSV? Any clarification would be much appreciated!

Thanks so much!

mike · July 30, 2021, 12:32am

Yep - the data format that pops out of enriched is in a wide TSV format - this format is consumable by any of the analytics SDKs as well as the shredder process.

The Snowflake model has its own dedicated shredder that runs on Spark but I imagine it’s probably portable to GCP with some changes. There’s also stream shredder (not dependent on Spark) but that’s not in a production ready state yet.

Topic		Replies	Views
When will Snowflake loading in GCP be available? Questions	1	669	June 1, 2022
Collector -> S3 loader Collectors	3	1337	June 7, 2020
Components are being removed from GCP? For engineers	3	586	June 20, 2019
Archiving raw events in GCP Collectors	2	992	February 25, 2020
Realtime GCP pipeline GCP pipeline	7	1451	August 31, 2021

Cloud Storage Loader Output Scheme

Related Topics