Is the stream shredder still experimental?

shimpeko · March 10, 2022, 9:23am

Hi,

I’m trying to run the shredder in our environment. I have a question about stream shredder.

Is the stream shredder still experimental? Or it is suitable to use in a production environment?

I’m assuming it is experimental as the README (snowplow-rdb-loader/README.md at master · snowplow/snowplow-rdb-loader · GitHub) said it is experimental and there is no document for stream shredder.

Thanks,
shimpeko

dilyan · March 17, 2022, 9:50am

Hi @shimpeko,

Apologies for the long silence on this one.

We’ve branded the stream shredder ‘experimental’ which was probably not the best use of the word. What we meant by it is that it has some limitations compared with the batch shredder. You can certainly use it in a production environment if you accept those limitations.

The first of these is that we’ve tested the stream shredder only in single-node deployments. In a distributed architecture, we expect there might be some race conditions between the different KCL workers. So the performance would be capped by the resources provided by the machine you run it on. This would be appropriate for low-volume pipelines, where the overhead of using Spark on EMR is not justified.

Secondly, there is no deduplication in the stream shredder. If duplicates are not a concern, or you can deal with them after the data has been loaded into the data warehouse, then this point is irrelevant.

We are currently working on the documentation for the stream shredder, but in the meantime, here’s what you need to know:

You can get the jar file from the Github release page or an image from Docker Hub under snowplow/snowplow-rdb-stream-shredder:2.2.0.
It takes the same config.hocon and iglu_resolver.json config files as the batch shredder. The only difference in the HOCON file is that the source is no longer an S3 bucket but a Kinesis stream. You can find the reference config file here.

You don’t need Dataflow Runner or EMR to run it. It can be as simple as:

$ docker run snowplow/snowplow-rdb-stream-shredder:2.2.0 \
--iglu-config 'base64-resolver' \
--config 'base64-config'

Do let us know any feedback if you give it a try.

shimpeko · April 14, 2022, 9:12pm

Hi @dilyan

Much appreciate your response. I understand the limitation and will discuss if we’d like to adapt it with my team.

Thanks,
Shimpeko

Topic		Replies	Views
On-premise Realtime Pipeline For engineers	2	2229	January 3, 2018
Only real-time pipeline AWS real-time pipeline	4	3060	March 12, 2017
Migration from batch processing to (near) real-time For engineers	3	838	February 14, 2019
Replacement for Dataflow runner? For engineers	2	535	January 11, 2022
Snowplow Realtime pipeline with Docker For engineers	5	1924	March 14, 2019

Is the stream shredder still experimental?

Related Topics