Snowplow real-time analysis with on-premise pipeline


#1

Hi Snowplow Team,

We are currently evaluating Snowplow and had it running for almost a month with this pipeline: NSQ (For Raw and Enrich Data Processing) -> Logstash (For Data Transformation from TSV to JSON) -> Elasticsearch (Storage) -> Kibana (Analysis and Visualization). Though we’re able to get real time data using abovementioned pipeline, it was advised that Redshift can provide in-depth data which Elasticsearch may lack. I am actually looking what’s the best platform or pipeline for real-time analysis in case we want to use Redshift. Should we dump the current setup or NSQ can still work as our messaging platform?

Basing from what I have researched, it seems we must utilize full AWS pipeline if we opt to use Redshift which might be costly for evaluation. Well, least cost is ideal as much as possible and without the need to heavily depend on AWS except for Redshift, I guess. Is there any other approach to this setup? Can we really use Redshift for real-time processing or it is meant for batch processing?

Thanks in advance! Looking forward to your suggestions. I’d be glad to know what really works well for this.


On-premise Realtime Pipeline
#2

I have been looking to do this as well but haven’t found a “snowplow” specific way to do it. Depending on the complexity of the analysis the options I have looked at are:

  • Perform basic analytics on logs in Splunk, this is an expensive option at scale but requires very little setup (docker image Here that you can throw into ECS or EKS).

  • Query with sql in Kinesis itself through kinesis analytics. I don’t know how this would work with enriched snowplow events in their base64 and json format but this could be a very nifty serverless solution if someone could put it together (see blog from AWS here)

  • Go down the spark streaming route. This could potentially involve EMR clusters and connections from kinesis firehose. The advantage here is the speed, complexity and scalability of the ecosystem. Disadvantage is obviously the maintenance of the infrastructure.

If anyone else has managed to use snowplow in this way to achieve realtime aggregation or analysis with big timeframes that it would be great to hear your experiences


#4

What sort of queries are you looking to run?

Are they simple aggregations? Across what volumes of time/event volumes?


#5

We would like to analyze user experience from the start he hits the button, in between, up to the last action he made or whatever interruption there is. We have apps both on web and mobile and would be great to have the data in real time so we know what we need to further develop. I would say averagely 50 million events per month.


#6

You can use Kinesis Analytics on the top of the enriched stream (no base 64 in enriched is a huge advantage here). Another approach is kind of microbatch on the top of Elasticsearch. Unfortunately 50 million/month would require quite big cluster (currently I use 3* m3.medium for about 8 millions of events). Lots depends on user action history tail…


#7

For something on that sort of volume depending on how long your typical sessions are then Apache Spark (streaming) or Apache Dataflow would both be good fits for performing this kind of analysis.

You’d need to ensure that you’re using a compatible source for both of these tools (e.g., Kafka/Kinesis/PubSub etc).


#8

Hi @jenfiner - just to add some more thoughts on a couple of your points, I hope they’re helpful:

You shouldn’t have to use Logstash for preparing data for Elasticsearch - our Snowplow Elasticsearch Loader speaks NSQ and handles Snowplow events natively - the relevant release post is here:

You’re right - at the moment only AWS-based Snowplow pipelines support loading into Redshift (which is handled by the batch pipeline running on EMR). We will most likely extend the RDB Loader to be able to load NSQ->Postgres (so that we can offer this capability inside Snowplow Mini), but NSQ->Redshift will likely be lower priority than this.