We are currently testing out Snowplow on AWS using quickstart-examples. Is there a way to put the data in S3 in compressed but pure JSON format? Currently, it’s a compressed document consisting of some metadata and then JSON string. Getting it there in pure JSON format would resolve the need of having an intermediary transforming step in between to convert it to JSON before moving it to the final warehouse destination (Snowflake in our case).
I attempted to utilize the purpose prop in the terraform-aws-s3-loader-kinesis-ec2 module by setting the value ‘JSON’ ( GitHub - snowplow-devops/terraform-aws-s3-loader-kinesis-ec2 ) but the data stopped sinking to S3 after that. Any ideas?
Thanks in advance!
Hey @Pratik the S3 Loader does not do any transformations of data before landing it into S3. The enriched data comes is a TSV format where certain values are indeed JSON - this is the format you are seeing inside the GZipped files.
To convert TSV → JSON you would generally use our Analytics SDKs inside a Lambda function / some other microservice style consumer of the Kinesis stream. You would then re-publish the event to a new Kinesis Stream which would now contain the JSON you want to use.
The flow therefore looks a bit like:
enriched stream > processor to convert TSV - JSON > enriched json stream > s3 loader
The SDK documentation can be found here: Analytics SDKs - Snowplow Docs
Hope this helps!
Thanks Josh! This was really helpful! I will look into updating the pipeline using the above approach.
Alternatively, I also came across Snowflake Loader - Snowplow Docs which we’re wondering if we could utilize since our end goal is to get the data into snowflake. Is there any documentation around adding that to the existing pipeline? The existing documentation is helpful but I’m a bit lost as to where to start. Currently, we have everything up to S3 set up. I would really appreciate if you could provide some pointers about adding the Snowplow Snowflake Transformer and Snowplow Snowflake Loader pieces adding to the existing pipeline.
Hi @Pratik you are on the right track already with loading into Snowflake!
After the Enriched data is landed in S3 the following stages happen:
- EMR: Stages data that has been landed in S3, Snowflake Transformer converts it for loading into Snowflake, saves it to a new destination in S3
- EMR: Snowflake Loader copies the data from the S3 staging bucket in
The Setup steps are fairly involved but should be comprehensive on all the steps required - we are working hard on simplifying this however and extending our Open Source modules to support loading into Snowflake as well with the same ease as the rest of the setup.
When we are closer to that releasing would you be open to beta-testing the modules (which should automate this whole process)?
Hi @josh ! Apologies for the late response! Thank you so much for your response!
Yes, we would definitely be interested in beta-testing these modules. Do you have a rough timeline for when the release might go out?
So I believe they have all been created and the team are just working on the docs and examples - if you follow this repo (GitHub - snowplow/quickstart-examples: Examples of how to automate creating a Snowplow Open Source pipeline) you should be able to find the Snowflake specific sections when they are ready!