Setup and run RDB Stream Shredder

Hello!

I’m trying to setup a new test pipeline based on the new RDB Stream Shredder following that page 1.0.0 Upgrade Guide - Snowplow Docs

But it’s not very clear if I need to run both RDB Stream Shredder and RDB Loader (I guess yes) or just RDB Stream Shredder? And if RDB Stream Shredder and RDB Loader should share the same hocon (since it contains the config for shredding and loading) and be started with the same args? i.e.

docker run \
  snowplow/snowplow-rdb-stream-shredder- \
  --iglu-config ewogICJzY2hlbWEiOiAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRp .... \
  --config ewogICJuYW1lIjogIkFjbWUgUmVkc2hpZnQiLAog ....

docker run \
  snowplow/snowplow-rdb-loader- \
  --iglu-config ewogICJzY2hlbWEiOiAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRp .... \
  --config ewogICJuYW1lIjogIkFjbWUgUmVkc2hpZnQiLAog ....

(found it here Run the RDB loader - Snowplow Docs)

Also any recommendation for the type of EC2 instances? Is a t3a.micro enough (2 vCPU, 1GB RAM)?

Hi @guillaume ,

Yes shredder and loader are 2 distinct apps and you need to run them both.

They use the same config file.

The loader needs very little resources and a t3a.micro is enough. For the shredder the size of the EMR cluster depends on the amount of data that you have. You can find more details about how to run it here.

For the shredder the size of the EMR cluster depends on the amount of data that you have. You can find more details about how to run it here .

Isn’t Stream Shredder supposed to not use EMR? From the announcement:

Unlike existing Spark Shredder, the Stream Shredder reads data directly from enriched Kinesis stream and does not use Spark (neither EMR) - it’s a plain JVM application, like Stream Enrich or S3 Loader.
Reading directly from Kinesis means that the Shredder can bypass long and error-prone S3DistCp staging/archiving steps. Another benefit is that it doesn’t work with bounded dataset anymore and can “emit” shredded folders based only on specified frequency.

Sorry I missed the fact that you were using the new streaming shredder. In this case indeed you don’t need EMR. But please bear in mind that this component is not production ready yet.

Just to add to @BenB 's comment about why it is not production ready:

  • It does not scale horizontally to have >1 streaming shredder running at the same time. The shredder sends a SQS message to tell the loader when to load, but this arrangement breaks if multiple shredders try to send the same SQS message.
  • It cannot do cross-batch deduplication of events
  • And we just have not battle-tested it in a high throughput pipeline yet.

The streaming shredder will certainly be a core part of Snowplow architecture in the future, just not yet.

@BenB @istreeter Small update just to share that I’ve been running the Stream Shredder in production for 1 month now and no issue so far. I’m running a single instance shredding 3 millions events/day in average (peak 6 millions).

3 Likes

Hi @guillaume - thanks for the update, that’s great to hear!

We have many ideas for where we can go with the streaming shredder in future. Such as enabling it to scale beyond just one instance, writing out to different file formats, and also porting it so it can run on different clouds, not just AWS. They’re all just ideas at the moment, so it’s good to hear you’re finding it valuable in its first incarnation.

1 Like

hi @guillaume / @BenB / @istreeter - can anyone redirect me to correct stream-shredder documentation. I am new to snowplow and a bit lost in documentation.
Thanks

Hi @Dhruvi,

You can find the docs here. Note that the RDB shredder has been renamed to the RDB transformer. For the moment, you may still see references to “shredder” in some docs, but we’re working on updating this.

Hello @lmath - Thank you for the link. In the documentation - i see transformation support for kinesis, do we have any plan for a kafka transformer?

Thanks

Hi @Dhruvi - a Kafka transformer isn’t on the immediate horizon for us, though we may still consider it in the future.

Hello @lmath - I see a support for transformer for kafka provided here . Is it specific to Azure only or I can use it with any self hosted kafka ?

I have a self hosted kafka on aws and would like my output to be dumped in s3 bucket .

Thanks,
Dhruvi

Hi @Dhruvi,

Yes, you can absolutely use it on AWS with self-hosted Kafka.

Hi @stanch - Thank you for quick reply.
I have few more doubts about setting up rdb_loader 5.7.1 (transformer + loader)
As I understand - the transformer takes input from kafka topic for enriched data and dumps it to s3 bucket; post that it will put a message in a kafka topic for loader. But I dont see any configuration in redshift loader config here to read message from a kafka topic.

Thank you,
Dhruvi

Oh, good point. We currently only support this for Snowflake loader :frowning:

Bummer :frowning_face:
@stanch - what about other way around ? Is it possible to configure the transformer to write to sqs queue ?

-Dhruvi

Not at the moment — the Kafka Transformer asset only reads and writes to Kafka… For Redshift we use Kinesis on AWS, even though I understand that’s not what you want.

Hi @Dhruvi actually it is probably possible get the redshift loader consuming from a kafka topic. To be completely honest, the reason it’s not documented is because we have never tested that configuration. But you are welcome to give a try, there’s no reason it shouldn’t work.

In your config file for the redshift loader, try adding this block:

  "messageQueue": {
    "type": "kafka"
    "bootstrapServers": "your-kafka-server:9092"
    "topicName": "your-kafka-topic-name"
   },

Aside from that change to messageQueue, just follow the regular instructions for Redshift loader on AWS.