I am a little bit confused about where to use Kinesis streams and when to load data into S3 in our pipeline.
In the docs for the ScalaStreamCollector, it’s written that we have to provide two Kinesis streams; one for good events and one for events that are too big for a Kinesis stream or in the wrong format. But how would the latter be able to go through a Kinesis stream, if they are too big?
We want to persist our raw event data to S3 as a safety fallback.
2a) Where should we run the
s3-loader? On the same instance as the collector? Or do we need a separate autoscaling group for this?
2b) Could we just directly dump the content of the raw event Kinesis stream into S3 via a Kinesis Firehose? Or would it be stored in the wrong format like that?
After StreamEnrich, we need to load the enriched data into S3 since that seems to be the input for the Snowflake transformer & loader. Do we need a separate instance here running the s3-loader?