Hello Snowplow community,
My current pipeline for page_view events looks like
I have used Athena to connect to the S3 files to query the data.
Now for the unstructured events, I followed this link. As per the link, we need to use Redshift Cluster/DB to store the events.
All our data analysis happens in BigQuery. For all the data which are in S3(other than collected from snowplow data), we have different jobs to move sync this data to BigQuery. We want to use the existing jobs to move the data from S3 to BigQuery so that our data analysis team can continue to use the new data captured using Snowplow.
I have a couple of questions related to this.
- Is it mandatory to use Redshift for unstructured events?
- Can the unstructured events be stored in S3? If yes, how can I link event_id and root_id as we do it in
FOREIGN KEY (root_id) REFERENCES atomic.events(event_id)?
- If no, then do I need to use Redshift for structured(page_view) events also so that I can link both unstructured and structured events using
If there is any other way to improve the pipeline, let me know. But the end goal is to move the data to BigQuery as we want to have a hybrid(both AWS and GCP) cloud solution.