So we have been collecting with the stream collector without issues and feeding into elasticsearch and s3, we have recently added some more events that have increased the number of events we are processing and all of a sudden the ES sink is falling behind.
We have increased the number of Kinesis shards (4) to handle the traffic and have 4 ES sink workers running, they just cannot keep up where the S3 sink has no issues.
I have tried to increase the number of events we send in each request to ES but just get a timeout trying to send any more than say 800 per batch.
Hi @lookaflyingdonkey - we have found that at scale certain node types in the Amazon Elasticsearch Cluster do not work very well. To resolve these kinds of issues we have had to bump the ES nodes to at least an m3/m4.xlarge.
You can then send more events more often without incurring lag.
From the language they use on that page I would say the data nodes, as they say you can use T2 masters for up to 10 instances and then they reference “instance” when talking about HTTP limits
I am not sure of the exact request routing that amazon does behind the scenes. I would imagine that they have a load balancer hooked directly to the data nodes to ensure that the masters do not have too much work to do!
You can see there that 2-3 batches of 15k fire off and then nothing for a while, is this just the buffer or is something else happening? Like waiting for the ES cluster to settle or something?
Hey @lookaflyingdonkey if there are no more records to process at that time then it will sleep. Another issue could be with the provisioned throughput of the applications KCL table - if your app is checkpointing very frequently then it can be a cause of throttling.
As well you are limited somewhat in how often you can get information from the stream - so if are attempting to get more aggressively than the stream can handle the worker will back off also.
Is the application keeping up with the processing now?
Which looks very strange, we wouldn’t have that sudden drop off and we have approx 40 million events a day, so it seems the last 24 hours has been much lower, so either something is happening at the collector/enricher or the sink is still broken. I am just running the last 24 hours through the EMR process and will count up there and see whats happening.
Nothing is going into the the ES bad stream either, so not a bad format.
What do the Kinesis Stream metrics look like for the Enriched stream that these consumers are reading from?
The failure cases we see come up with consumers are generally:
Exceeding the egress limits of the stream; so if you are consuming faster than the stream allows
Too many workers on the stream; aggressively attempting to get records from an under-provisioned stream
DynamoDB table limits being exceeded; while you have not hit the limits yet we often see the write units exceeding the default of 10 at these kinds of loads
A few other things to check:
How is your Elasticsearch Cluster configured? If you are using the TTL capabilities at very high load this can have some quite serious performance implications as by default every record is checked for having exceeded the TTL every 60 seconds.
Are you running the sinks in a public facing subnet or in a private subnet proxying through NAT instances? In the later case a bottleneck can be the throughput capabilities of these instances if they are not provisioned with sufficient capacity.
The other thing to try would be to slightly lower your buffer limits. From before it looked like you were sending somewhere around 15k events at a time which is much higher than the maximum you can get from Kinesis at a time (10k). This might cause lots of workers to very aggressively get records to fill that buffer and cause backoff issues. I would recommend lowering the event buffer limit to 10k or less.
Sorry that there is no specific answer! There are a lot of edge cases in realtime processing that can cause issues.
If I think of anything else to try I will post it here.
The enriched stream looks to be filling OK, you can see in the metrics below that that iterator age has grown significantly, which i would expect if there is nothing being read from the stream
I do have the TTL set to 7 days on ES, so that may be it. Although I just reset the dynamo DB table for the sink and set the initial position to LATEST and it has started syncing events again without issue, in blocks of ~15k.
So this is a strange one, it seems to keep up for around 3 days and then just stops.
Another place to check is the Elasticsearch cluster itself. What does the monitoring for this look like? Is it possible that it not large enough to handle the ingress rates?
@josh You may be onto something there, I had just scaled that up the other day, but you can see high CPU and JVM usage while the workers are feeding in
Hey @lookaflyingdonkey another thing to swatch out for is that the m4.large instance is still limited to 10 MB requests. I would recommend bumping up to m4.xlarge instances to get the 100 MB throughput allowance.
The other thing to be aware of is your index shard count and how those shards are distributed across your nodes. If for example you have 4 shards and 1 replica of the index for HA (8 shards to balance). You will not be able to evenly balance across the cluster causing some nodes to fill up - even if you have space available in the cluster overall. If this happens you will not be able to load either which could well be the issue here.
What does the minimum available space on the dashboard screen say?