Archiving raw events in GCP

Timmycarbone · February 21, 2020, 11:27pm

Hello!

This is more of a discussion than a real problem.

We’re about to move from an AWS Batch pipeline to a realtime pipeline in GCP.

In AWS, we used to have an archive of all our raw events that the collector logged.
I can see how to replicate this in GCP (with a GCS Loader, although it could be a bit expensive) but these raw events (right out of the collector) are thrift records and I can’t find any schema allowing me to decode them.

I’ve tried with:
CollectorPayload (https://github.com/snowplow/snowplow/blob/master/2-collectors/thrift-schemas/collector-payload-1/src/main/thrift/collector-payload.thrift)
&
SnowplowRawEvents (https://github.com/snowplow/snowplow/blob/master/2-collectors/thrift-schemas/snowplow-raw-event/src/main/thrift/snowplow-raw-event.thrift),

both without luck.

I think CollectorPayload is a schema to decode the events later on (after the enrichment step, mainly the bad rows), while SnowplowRawEvents is for another kind of collector. Is that correct?

So first, did I mess things up when decoding and one of those schema should work?

Else, I’m curious to what’s everyone’s approach regarding raw events in GCP. For us, it would bring a feeling of security, knowing that we could replay things in case of hardcore failure, like we could in AWS.

Is it something you’re abandoning? Are you building custom solutions? Is archiving raw events not considered a “good practice” anymore? Or am I just missing something dumb?

Thanks in advance for your help and your input

dilyan · February 25, 2020, 5:22pm

Hi @Timmycarbone. The Scala Stream Collector (the one used in real-time Snowplow pipelines) outputs a thrift-encoded binary payload. Those schemas are indeed valid but the problem is that you cannot easily decode the binary data, unlike you would do with string.

You can use GCS Loader to persist data in blob storage. If you want to be able to query the data, it might be a better idea to sink the Pub/Sub topic that has all the enriched data, rather than the raw collector data. Of course, this way you don’t have a permanent record of the data pre enrichment. You can also sink the raw topic – the challenge in querying it will be much greater but you can still use it to replay it through Enrichment, etc.

Hope this make sense.

Timmycarbone · February 25, 2020, 6:23pm

Makes total sense, thanks a lot!

I tried to use GCS Loader to persist the raw events in GCP then use Pub/Sub to pull the raw events out of GCP and send them to Enrichment but I failed. I guess I did something wrong. Maybe raw events got re-encoded before storage to GCP or something.

I’ll retry this approach then!

Else, I guess persisting the output of Enrichment (both good and bad rows) makes sense too. It doesn’t cover a “critical failure” of the Enrichment step though. I guess it shouldn’t happen but better safe then sorry

Thanks a lot!

Topic		Replies	Views
Count events in GCS Loader created file GCP pipeline	9	1099	November 25, 2021
Collector -> S3 loader Collectors	3	1336	June 7, 2020
Realtime GCP pipeline GCP pipeline	7	1450	August 31, 2021
Cloud Storage Loader Output Scheme GCP pipeline	1	1276	July 30, 2021
Send events in bulk for streaming setup on GCP GCP pipeline	13	1790	November 18, 2021

Archiving raw events in GCP

Related Topics