Convert Snowplow thrift files (on S3) to parquet


#1

Hi,

We are using snowplow scala collector to collect events. Its a standard collection pipeline - collector sinks events in Kinesis, kinesis-s3 consumes from Kinesis and writes these events to S3.

Our intention is to use PrestoDB to analyze the S3 files. We’d like to convert these thrift files to parquet format, since parquet supposedly performs better. Any suggestion on how do we go about that? Also, is it possible to dump the events to S3 directly in parquet format?

Cheers
Nitish


#2

Hi @shardnit - putting a query engine like Presto or Drill or Impala over the raw collector payloads isn’t going to get you very far - you’ll be missing out on all the format translation, schema validation and event enrichment (“dimension widening”) that the Stream Enrich (or indeed Hadoop Enrich) component does.

So we would always recommend analyzing the enriched event files in S3. In terms of Parquet support for these - it’s something we’d like to support in the future, but there’s a lot of work to do first to refactor our enriched event format (likely into Avro). You’ll find our first Avro milestone in our GitHub repo.

In the meantime, the recommended way of analyzing the enriched event files in S3 is to use Spark plus our Snowplow Python or Scala Analytics SDKs.