Setup and run RDB Stream Shredder

guillaume · September 22, 2021, 9:11pm

Hello!

I’m trying to setup a new test pipeline based on the new RDB Stream Shredder following that page 1.0.0 Upgrade Guide - Snowplow Docs

But it’s not very clear if I need to run both RDB Stream Shredder and RDB Loader (I guess yes) or just RDB Stream Shredder? And if RDB Stream Shredder and RDB Loader should share the same hocon (since it contains the config for shredding and loading) and be started with the same args? i.e.

docker run \
  snowplow/snowplow-rdb-stream-shredder- \
  --iglu-config ewogICJzY2hlbWEiOiAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRp .... \
  --config ewogICJuYW1lIjogIkFjbWUgUmVkc2hpZnQiLAog ....

docker run \
  snowplow/snowplow-rdb-loader- \
  --iglu-config ewogICJzY2hlbWEiOiAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRp .... \
  --config ewogICJuYW1lIjogIkFjbWUgUmVkc2hpZnQiLAog ....

(found it here Run the RDB loader - Snowplow Docs)

Also any recommendation for the type of EC2 instances? Is a t3a.micro enough (2 vCPU, 1GB RAM)?

BenB · September 23, 2021, 6:26am

Hi @guillaume ,

Yes shredder and loader are 2 distinct apps and you need to run them both.

They use the same config file.

The loader needs very little resources and a t3a.micro is enough. For the shredder the size of the EMR cluster depends on the amount of data that you have. You can find more details about how to run it here.

guillaume · September 23, 2021, 3:00pm

For the shredder the size of the EMR cluster depends on the amount of data that you have. You can find more details about how to run it here .

Isn’t Stream Shredder supposed to not use EMR? From the announcement:

Unlike existing Spark Shredder, the Stream Shredder reads data directly from enriched Kinesis stream and does not use Spark (neither EMR) - it’s a plain JVM application, like Stream Enrich or S3 Loader.
Reading directly from Kinesis means that the Shredder can bypass long and error-prone S3DistCp staging/archiving steps. Another benefit is that it doesn’t work with bounded dataset anymore and can “emit” shredded folders based only on specified frequency.

BenB · September 23, 2021, 3:15pm

Sorry I missed the fact that you were using the new streaming shredder. In this case indeed you don’t need EMR. But please bear in mind that this component is not production ready yet.

istreeter · September 23, 2021, 3:39pm

Just to add to @BenB 's comment about why it is not production ready:

It does not scale horizontally to have >1 streaming shredder running at the same time. The shredder sends a SQS message to tell the loader when to load, but this arrangement breaks if multiple shredders try to send the same SQS message.
It cannot do cross-batch deduplication of events
And we just have not battle-tested it in a high throughput pipeline yet.

The streaming shredder will certainly be a core part of Snowplow architecture in the future, just not yet.

guillaume · November 11, 2021, 11:16pm

@BenB @istreeter Small update just to share that I’ve been running the Stream Shredder in production for 1 month now and no issue so far. I’m running a single instance shredding 3 millions events/day in average (peak 6 millions).

istreeter · November 12, 2021, 9:48am

Hi @guillaume - thanks for the update, that’s great to hear!

We have many ideas for where we can go with the streaming shredder in future. Such as enabling it to scale beyond just one instance, writing out to different file formats, and also porting it so it can run on different clouds, not just AWS. They’re all just ideas at the moment, so it’s good to hear you’re finding it valuable in its first incarnation.

Dhruvi · November 29, 2022, 6:04pm

hi @guillaume / @BenB / @istreeter - can anyone redirect me to correct stream-shredder documentation. I am new to snowplow and a bit lost in documentation.
Thanks

lmath · November 30, 2022, 9:18am

Hi @Dhruvi,

You can find the docs here. Note that the RDB shredder has been renamed to the RDB transformer. For the moment, you may still see references to “shredder” in some docs, but we’re working on updating this.

Dhruvi · December 5, 2022, 4:38am

Hello @lmath - Thank you for the link. In the documentation - i see transformation support for kinesis, do we have any plan for a kafka transformer?

Thanks

lmath · December 9, 2022, 2:32pm

Hi @Dhruvi - a Kafka transformer isn’t on the immediate horizon for us, though we may still consider it in the future.

Dhruvi · August 10, 2023, 1:58pm

Hello @lmath - I see a support for transformer for kafka provided here . Is it specific to Azure only or I can use it with any self hosted kafka ?

I have a self hosted kafka on aws and would like my output to be dumped in s3 bucket .

Thanks,
Dhruvi

stanch · August 10, 2023, 2:11pm

Hi @Dhruvi,

Yes, you can absolutely use it on AWS with self-hosted Kafka.

Dhruvi · August 10, 2023, 3:36pm

Hi @stanch - Thank you for quick reply.
I have few more doubts about setting up rdb_loader 5.7.1 (transformer + loader)
As I understand - the transformer takes input from kafka topic for enriched data and dumps it to s3 bucket; post that it will put a message in a kafka topic for loader. But I dont see any configuration in redshift loader config here to read message from a kafka topic.

Thank you,
Dhruvi

stanch · August 10, 2023, 4:18pm

Oh, good point. We currently only support this for Snowflake loader

Dhruvi · August 10, 2023, 4:24pm

Bummer
@stanch - what about other way around ? Is it possible to configure the transformer to write to sqs queue ?

-Dhruvi

stanch · August 10, 2023, 4:59pm

Not at the moment — the Kafka Transformer asset only reads and writes to Kafka… For Redshift we use Kinesis on AWS, even though I understand that’s not what you want.

istreeter · August 10, 2023, 5:22pm

Hi @Dhruvi actually it is probably possible get the redshift loader consuming from a kafka topic. To be completely honest, the reason it’s not documented is because we have never tested that configuration. But you are welcome to give a try, there’s no reason it shouldn’t work.

In your config file for the redshift loader, try adding this block:

  "messageQueue": {
    "type": "kafka"
    "bootstrapServers": "your-kafka-server:9092"
    "topicName": "your-kafka-topic-name"
   },

Aside from that change to messageQueue, just follow the regular instructions for Redshift loader on AWS.

Topic		Replies	Views
RDB Loader 1.1.0 docs refer to Shredding / EMR	2	611	September 5, 2022
How to setup Shredder? Data store sources	3	962	January 19, 2021
RDB shredder failed? For engineers	27	2789	January 5, 2022
DEPRECATION NOTICE: EmrEtlRunner Announcements	2	834	October 8, 2021
Shredder slow after stopping for a day / Reshredding 12 hours of data Spark	0	885	September 8, 2021

Setup and run RDB Stream Shredder

Related Topics