Kinesis Streams vs S3 Buckets

I am a little bit confused about where to use Kinesis streams and when to load data into S3 in our pipeline.

  1. In the docs for the ScalaStreamCollector, it’s written that we have to provide two Kinesis streams; one for good events and one for events that are too big for a Kinesis stream or in the wrong format. But how would the latter be able to go through a Kinesis stream, if they are too big?

  2. We want to persist our raw event data to S3 as a safety fallback.
    2a) Where should we run the s3-loader? On the same instance as the collector? Or do we need a separate autoscaling group for this?
    2b) Could we just directly dump the content of the raw event Kinesis stream into S3 via a Kinesis Firehose? Or would it be stored in the wrong format like that?

  3. After StreamEnrich, we need to load the enriched data into S3 since that seems to be the input for the Snowflake transformer & loader. Do we need a separate instance here running the s3-loader?

  1. Events that are too large are generally unrecoverable. There is a new (soon to be released) bad row format that attempts to better address this.

  2. You could use Kinesis Analytics to pipe your raw stream to Kinesis Firehose that is configured to send data to s3 if you’re concerned that having only the enriched events in S3 but I think many might think thats overkill if you have proper testing of your event payloads against something like snowplow-mini.

  3. You can run the s3-loader on the same ec2 instance that you run the collector on if you want. Some users run many of the components of the pipeline in containers with other things running on those hosts.

2 Likes

Run this on a separate autoscaling instance from your collector.

Ideally yes - there’s nothing to stop you running these processes on the same instance but it’s not recommended in a production setting.

2 Likes

All good answers above - the only thing I’ll add is if you plan on using Firehose for Enriched events, it might be useful to first read the ticket with our rationale for not using it, and to search for the topic on discourse, since others have explored it too - for example this ticket might be interesting.

1 Like

Thanks for the replies!

I don’t understand then why the documentation, and comments in the config for the ScalaStreamCollector says

" # Events that are too big (w.r.t Kinesis 1MB limit) will be stored in the bad stream/topic
bad = {{bad}}
bad = ${?COLLECTOR_STREAMS_BAD}"

  • is this for non-Kinesis streams? Do I even specify a bad stream in the collector? If yes, what does it do?