I am pretty new to Snowplow and was working on migrating our S3 Loader to Fargate. I understand the loader uses DynamoDB to keep track of it’s position(for writing to S3) but was wondering if it’s even possible to run multiple instances of the S3 Loader. I tried spinning up several instances of the loader but I think each instance was using the same DynamoDB table which was causing issues. Is it possible to run multiple S3 Loaders and if so are there any suggestions on how to do it correctly?
Hi Michael, did you use a different kinesis.appName in the config for each loader?
I hadn’t yet but a fellow developer had mentioned that. That sounds like my next step. Quick dumb question. If each S3 Loader is writing to the same S3 bucket and the strategy is TRIM_HORIZON and they are using different DDB tables, how does each instance of the S3 Loader avoid writing out the same records? Thanks again for the response, Jeroen.
Aha, multiple loaders to the same target. In that case you indeed need to use the same application name. I’m using multiple enrichers for example and they share the same appName. They also subscribe to Kinesis so the process is probably the same.
I haven’t tried with multiple S3 loaders but I guess it should also work, same as for the enrichers.
Another thing that might block you from doing this is when you do not have enough shards. Each loader will claim a shard.
Hope this helps.
I just found out another dev was using the same dynamodb table with another S3 loader, so that was my issue.
Just to add some clarity into how the Kinesis applications work on the Snowplow pipeline here. Each app (S3 Loader, ES Loaders, Stream Enrich) leverages the KCL (Kinesis Client Library) under the hood.
Each app is capable of scaling horizontally as long as they are pointing to the same Kinesis Stream and use the same DynamoDB Lease table - this lease table contains all the shards in the stream that the application is consuming and ensures that work is spread evenly amongst the available workers. This table name in turn is controlled by the “app name” you define in the configuration files.
To sum up:
If you want to have more compute available to process 1 Kinesis Stream into 1 Destination - use the same app name and it will scale out horizontally.
If you instead want to have that same stream loaded to different destinations you need to use a different app name so that it has a different lease table and different checkpointing on the stream.
Hope this helps!
That’s very good information. Thanks, josh.
Are there any documents on scaling this stack out(using kinesis)? We are currently working on several strategies around CPU utilization and traffic but it’s still a work in progress.
There is scattered documentation but nothing very concrete. Internally we scale the consumers based on CPU usage and have found this to be the most reliable metric to scale off of. Adding step-scaling rules in place to handle spikes effectively can help also.
For Kinesis you have really just two options:
- Develop an auto-scaling solution for it - there is nothing out of the box here so does require a bit of investment OR;
- Monitor and set CloudWatch Alarms on Byte GET / PUT rates to the stream and scale up / down as needed by hand
Depending on your traffic variance you might get away with just a static shard allocation for your stream without needing to do anything else.
One thing to keep in mind is that multiple servers cannot process the same shard so your ability to process data is dependant on having enough shards in Kinesis. Essentially having more servers than shards will not yield better performance as you cannot split the work.