Questions about setting up the real-time pipeline



Does anyone have an better set by step guide to set up SnowPlow ( wanting to use the Scala Stream Colletcor ) on AWS? I am struggling a bit to follow the guides on github.



Hi @stephan,

Sorry to hear you are having difficulties setting up real-time pipeline.

Could you be more specific about the kind of problem you have encountered, please?



Well I am basically new to AWS and SnowPlow. I am struggling to understand exactly what I need to do to get it up and running. I am pretty good with php, mysql and js but never done anything like this.


I have seen you changed the title of my question. I am not only struggling with the collector but with the entire setup. But I guess getting a collector installed would be a good step into the right direction.



Taking into account your comments I would suggest familiarizing yourself with a general concept of Snowplow pipeline first. Each component of the pipeline could be built/setup independently. Do it one-by-one ensuring it works before proceeding to setting up the other component.

Regardless of whether it is a batch pipeline or a real-time pipeline Snowplow pipeline consists of the 6 main components. The one you are most interested at the moment are:

1.Tracker -> 2.Collector -> 3.Enrichment -> 4.Storage.

You can start with either a JavaScript Tracker or Stream Collector.

The easiest way to install the Stream Collector is to download the compiled and zipped application. Just follow the instructions here. In fact, the archive file comes with 2 more components, namely Stream Enrich and Kinesis Elasticsearch Sink. The former is the going under 3.Enrichment component in my diagram above. The latter is required if you want to store your data in Elasticsearch and is depicted as the 4.Storage component.

Before you can launch the collector you need to configure it. Once the configuration hocon file is amended to reflect your pipeline setup you can try to run it.

Mind you that you could simply install the Snowplow Mini app which is the whole real-time pipeline in miniature with a “one-click” (kind-of) set-up. The whole app is running on a single EC2 instance!. See if this suits your needs.



What are the recommendations for AWS node for SCALA collector? Also I am unable to find information on scaling properties of this collector.
Secondly in setup guide Kenisis setup steps seem to be missing.