Setup and run RDB Stream Shredder

Hello!

I’m trying to setup a new test pipeline based on the new RDB Stream Shredder following that page 1.0.0 Upgrade Guide - Snowplow Docs

But it’s not very clear if I need to run both RDB Stream Shredder and RDB Loader (I guess yes) or just RDB Stream Shredder? And if RDB Stream Shredder and RDB Loader should share the same hocon (since it contains the config for shredding and loading) and be started with the same args? i.e.

docker run \
  snowplow/snowplow-rdb-stream-shredder- \
  --iglu-config ewogICJzY2hlbWEiOiAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRp .... \
  --config ewogICJuYW1lIjogIkFjbWUgUmVkc2hpZnQiLAog ....

docker run \
  snowplow/snowplow-rdb-loader- \
  --iglu-config ewogICJzY2hlbWEiOiAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRp .... \
  --config ewogICJuYW1lIjogIkFjbWUgUmVkc2hpZnQiLAog ....

(found it here Run the RDB loader - Snowplow Docs)

Also any recommendation for the type of EC2 instances? Is a t3a.micro enough (2 vCPU, 1GB RAM)?

Hi @guillaume ,

Yes shredder and loader are 2 distinct apps and you need to run them both.

They use the same config file.

The loader needs very little resources and a t3a.micro is enough. For the shredder the size of the EMR cluster depends on the amount of data that you have. You can find more details about how to run it here.

For the shredder the size of the EMR cluster depends on the amount of data that you have. You can find more details about how to run it here .

Isn’t Stream Shredder supposed to not use EMR? From the announcement:

Unlike existing Spark Shredder, the Stream Shredder reads data directly from enriched Kinesis stream and does not use Spark (neither EMR) - it’s a plain JVM application, like Stream Enrich or S3 Loader.
Reading directly from Kinesis means that the Shredder can bypass long and error-prone S3DistCp staging/archiving steps. Another benefit is that it doesn’t work with bounded dataset anymore and can “emit” shredded folders based only on specified frequency.

Sorry I missed the fact that you were using the new streaming shredder. In this case indeed you don’t need EMR. But please bear in mind that this component is not production ready yet.

Just to add to @BenB 's comment about why it is not production ready:

  • It does not scale horizontally to have >1 streaming shredder running at the same time. The shredder sends a SQS message to tell the loader when to load, but this arrangement breaks if multiple shredders try to send the same SQS message.
  • It cannot do cross-batch deduplication of events
  • And we just have not battle-tested it in a high throughput pipeline yet.

The streaming shredder will certainly be a core part of Snowplow architecture in the future, just not yet.