Minimal Enrich Setup?

noamgat · June 26, 2017, 7:47am

Hi,

We are trying to set up a minimalist snowplow collection process for an internal analytics process.
We already have a system that can take unstructured JSON lines and process them.
What we would like is to set up simple Track->Collect->Parse process.
We set up the track & collect processes (using the cloudfront collector), and we are trying to understand what is the simplest way to convert them from the cloudfront format to something more manageable.
Which of the snowplow solutions is the most minimalist one to perform a “Read cloudfront, write json” style operation?

ihor · June 26, 2017, 8:29pm

Hi @noamgat,

The approach you can take is to use Spark on EMR to analyze the enriched data. We provide the Scala Analytics SDK and Python Analytics SDK that take the data in S3 and transform it into an easy to work with JSON, with the different context and event fields as nested properties so you don’t need to join different tables the way you would do in Redshift.

A tutorial to analyse your Snowplow data in Spark on EMR with Zeppelin can be found here.

Further to the Spark utilization in Snowplow pipeline, our approach is to use our EventTransformer function, which should automatically take your Snowplow data (including the embedded JSONs) and turn it into a nice JSON format that is then straightforward to convert into a table.

Our intention is to create a standalone EventTransformer function and maintain it as we evolve our data structure so that any Hadoop based downstream process that starts with it will continue to work as that data structure evolves. This should be much more elegant than manually creating tables in Hive. You can see it in action in this blog post: http://snowplowanalytics.com/blog/2015/12/02/data-modeling-in-spark-exploring-spark-sql/. See the first section in particular (Loading Snowplow data into Spark.)

If you adopt this approach, then you spin the EMR cluster with the option --skip shred to EmrEtlRunner as you do not need shredded entities. Also, you do not need to use StorageLoader.

You can refer to this diagram to visualize the steps I’m referring to.

ihor · June 26, 2017, 8:50pm

Yet, another approach is to use Athena (instead of Spark). Here’s the link to just published totorial: Using AWS Athena to query the ‘good’ bucket on S3

noamgat · June 28, 2017, 3:09pm

I may have miswritten. My target here is not to analyse the data but to prepare it for an internal system which expects files with json lines. I would just like to convert a cloudfront log file with snowplow conventions (analytics params in uri query params etc) into a json-per-line file with the good events, and trying to find the easiest to setup+maintain pipeline to do this.

ihor · June 29, 2017, 10:50pm

@noamgat,

You won’t get the “raw” events in JSON format. Here’s the documentation on the format of the events from the Cloudfront collector: https://github.com/snowplow/snowplow/wiki/Collector-logging-formats#the-cloudfront-logging-format-with-cloudfront-naming-convention. In fact, it’s a format imposed by Amazon: http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html.

If your “Parse process” can work with that format then you don’t really need anything but what you already depicted: Tracker => Cloudfront collector => S3 (“raw”) bucket => Parse process

I added S3 to the graph as the Collector would have a log rotation enabled to push the files on hourly (or so) basis to S3.

Topic		Replies	Views
Collector -> S3 loader Collectors	3	1337	June 7, 2020
Snowplow Analyticss on Azure For engineers	0	1162	June 15, 2018
AWS Athena as an alternative data store For engineers	0	1527	January 11, 2017
AWS batch pipeline to real-time pipeline upgrade guide AWS batch pipeline (Legacy)	0	3685	January 24, 2020
Convert Snowplow thrift files (on S3) to parquet For engineers	2	1902	February 25, 2019

Minimal Enrich Setup?

Related Topics