Feedback on Snowplow documentation


#1

Hey @alex and team. I am working on implementing the Snowplow AWS real time pipeline and had some feedback on the documentation where I got a bit stuck. Maybe you can update the docs?

On github.com/snowplow/snowplow/wiki/Configure-the-Scala-Stream-Collector

HTTP settings

Also verify the settings of the HTTP service:

collector.interface
collector.port

Some Googling suggested you leave interface as-is then put in 80 for the normal use case.

Buffer settings

You will also need to set appropriate limits for:

collector.buffer.byte-limit
collector.buffer.record-limit
collector.buffer.time-limit

No idea what normal use case would be. Would these values change depending to the number of shards? Also, time-limit does not stipulate whether it’s mins, seconds or milliseconds in the sample config or setup guide.

On github.com/snowplow/snowplow/wiki/Hosted-assets

Link to http://dl.bintray.com/snowplow/snowplow-generic/kinesis_s3_0.5.0.zip is broken


#2

Hey @timgriffinau,

Link to http://dl.bintray.com/snowplow/snowplow-generic/kinesis_s3_0.5.0.zip is broken

It looks like the documentation for this download was updated a bit too eagerly. The last release is actually version 0.4.1; 0.5.0 is in RC still.

The working link is: http://dl.bintray.com/snowplow/snowplow-generic/kinesis_s3_0.4.1.zip


The documentation does need a good a bit of work to be a bit clearer. However to answer your questions here.

Some Googling suggested you leave interface as-is then put in 80 for the normal use case.

If you are implementing the collector behind something like an AWS Load Balancer you can put this component on any port you like - you will just need to configure the Listener to forward requests to the correct port. Port 80 is however recommended!

As the collector does not handle TLS Termination itself you will always need to have some form of Load Balancer / Proxy in-front which can then route traffic from this point.

No idea what normal use case would be. Would these values change depending to the number of shards? Also, time-limit does not stipulate whether it’s mins, seconds or milliseconds in the sample config or setup guide.

From the sample configuration file here you can get a quick overview of what each of these settings does:

https://github.com/snowplow/snowplow/blob/master/2-collectors/scala-stream-collector/examples/config.hocon.sample#L105-L116

  • collector.buffer.time-limit: This is measured in milliseconds

No you would not change these numbers based on the number of shards. Although that number could have an impact on the success of some of these settings. Kinesis has several per shard limits that need to be taken into account when coming up with values for these buffers.

In the case of the Stream Collector the main thing to keep an eye on is that:

Each PutRecords request can support up to 500 records. Each record in the request can be as large as 1 MB, up to a limit of 5 MB for the entire request, including partition keys. Each shard can support writes up to 1,000 records per second, up to a maximum data write total of 1 MB per second.

Taken from: http://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html

So to ensure your application does not run into any issues pushing data to the stream it must adhere to these limits. Our default settings for this application are:

byte-limit: 4000000 # 4 MB
record-limit: 500
time-limit: 5000 # 5 seconds


Hope this helps!

Josh


#3

Link to http://dl.bintray.com/snowplow/snowplow-generic/kinesis_s3_0.5.0.zip is broken

It’s now been fixed since 0.5.0 has been published on Friday.