I’ve been looking to spin up a new real time Snowplow streaming analytics pipeline, but the documentation is a bit confusing. As I understand it, a typical Snowplow streaming setup would have multiple S3 Loader applications running on separate server instances (or clusters) for all of the following:
- Loading from the collector-fed raw “good” stream to S3 [Thrift to LZO]
- Loading from the collector-fed raw “bad” stream to S3 [Thrift to LZO]
- Loading from the enriched “good” stream to S3 [TSV to GZIP]
- Loading from the enriched “bad” stream to S3 [TSV to GZIP]
Complicating matters is the fact that the S3 Loader config has an input stream and an output stream as well.
Here are my questions:
- Is this correct as far as the number of separate S3 Loader applications that should be run (four)?
- What stream should be used for the S3 Loader out stream? What would even consume such a stream? It seems to me it would be fed events that failed to make it to enrich or RDB load but also failed to load to an S3 bucket…aren’t these unrecoverable?
- If I’m understanding correctly, this seems like a LOT of required resources just for the S3 logging alone. Wouldn’t it be easier AND cheaper to use Kinesis Firehose? I know the Snowplow team recommends the S3 Loader application but I’ve seen mentions that Firehose can be used without too much difficulty (does require a Lambda function to convert Enriched streams to proper loading format)