Kinesis Streams vs S3 Buckets

boba · February 17, 2020, 8:40pm

I am a little bit confused about where to use Kinesis streams and when to load data into S3 in our pipeline.

In the docs for the ScalaStreamCollector, it’s written that we have to provide two Kinesis streams; one for good events and one for events that are too big for a Kinesis stream or in the wrong format. But how would the latter be able to go through a Kinesis stream, if they are too big?
We want to persist our raw event data to S3 as a safety fallback.
2a) Where should we run the s3-loader? On the same instance as the collector? Or do we need a separate autoscaling group for this?
2b) Could we just directly dump the content of the raw event Kinesis stream into S3 via a Kinesis Firehose? Or would it be stored in the wrong format like that?
After StreamEnrich, we need to load the enriched data into S3 since that seems to be the input for the Snowflake transformer & loader. Do we need a separate instance here running the s3-loader?

brad.inscoe · February 17, 2020, 9:25pm

Events that are too large are generally unrecoverable. There is a new (soon to be released) bad row format that attempts to better address this.
You could use Kinesis Analytics to pipe your raw stream to Kinesis Firehose that is configured to send data to s3 if you’re concerned that having only the enriched events in S3 but I think many might think thats overkill if you have proper testing of your event payloads against something like snowplow-mini.
You can run the s3-loader on the same ec2 instance that you run the collector on if you want. Some users run many of the components of the pipeline in containers with other things running on those hosts.

mike · February 17, 2020, 9:39pm

Run this on a separate autoscaling instance from your collector.

Ideally yes - there’s nothing to stop you running these processes on the same instance but it’s not recommended in a production setting.

Colm · February 17, 2020, 9:46pm

All good answers above - the only thing I’ll add is if you plan on using Firehose for Enriched events, it might be useful to first read the ticket with our rationale for not using it, and to search for the topic on discourse, since others have explored it too - for example this ticket might be interesting.

boba · February 17, 2020, 10:53pm

Thanks for the replies!

I don’t understand then why the documentation, and comments in the config for the ScalaStreamCollector says

" # Events that are too big (w.r.t Kinesis 1MB limit) will be stored in the bad stream/topic
bad = {{bad}}
bad = ${?COLLECTOR_STREAMS_BAD}"

is this for non-Kinesis streams? Do I even specify a bad stream in the collector? If yes, what does it do?

Topic		Replies	Views
S3 Loader infrastructure AWS real-time pipeline	3	1128	November 20, 2020
Reading data from raw stream to both batch and real time For engineers	2	764	November 27, 2017
Cutting one step on real time pipeline : stream-collector > kinesis > elasticsearch AWS real-time pipeline	2	2179	April 12, 2016
Does the Kinesis LZO S3 Sink support reading from an "enriched" stream? AWS real-time pipeline	12	4047	May 4, 2018
Kinesis-S3 is taking too much time to store events into S3 buckets Storage targets	5	1576	October 6, 2017