Elasticsearch Sink falling behind

So we have been collecting with the stream collector without issues and feeding into elasticsearch and s3, we have recently added some more events that have increased the number of events we are processing and all of a sudden the ES sink is falling behind.

We have increased the number of Kinesis shards (4) to handle the traffic and have 4 ES sink workers running, they just cannot keep up where the S3 sink has no issues.

I have tried to increase the number of events we send in each request to ES but just get a timeout trying to send any more than say 800 per batch.

The error I get when trying to increase the buffer sizes on the ES sink is

Any suggestions?

Cheers,
Dean

Hi @lookaflyingdonkey - we have found that at scale certain node types in the Amazon Elasticsearch Cluster do not work very well. To resolve these kinds of issues we have had to bump the ES nodes to at least an m3/m4.xlarge.

You can then send more events more often without incurring lag.

What is your current Elasticsearch cluster setup?

Hi @josh,

Thanks for the heads up, I am assuming this is because of the AWS ES limits described at https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-limits.html

I was on the large instance which only allows 10mb incoming requests whereas xlarge and above have 100mb

I will try scale up and see if it helps, I did set the sinks to have a maximum 8mb before sending but I guess it just cant keep up.

Do you know if the master nodes need to be xlarge or just the data nodes?

Cheers,
Dean

From the language they use on that page I would say the data nodes, as they say you can use T2 masters for up to 10 instances and then they reference “instance” when talking about HTTP limits

Will report back!

Hey @lookaflyingdonkey,

I am not sure of the exact request routing that amazon does behind the scenes. I would imagine that they have a load balancer hooked directly to the data nodes to ensure that the masters do not have too much work to do!

Looking forward to the results!

Cheers,
Josh

Hey @josh,

So scaling up to an xlarge that allows for 100mb requests vs 10mb is now allowing up to send ~15k events per batch vs the ~800 we were having.

There still seems to be some strange behaviour, but not sure if it is by design or not.

We have 4 shards on our kineses stream, so 2 workers get 2 leases each.

They seem to have a little spurt of sending events and then sleep for a while, is this normal?

You can see there that 2-3 batches of 15k fire off and then nothing for a while, is this just the buffer or is something else happening? Like waiting for the ES cluster to settle or something?

Cheers,
Dean

Hey @lookaflyingdonkey if there are no more records to process at that time then it will sleep. Another issue could be with the provisioned throughput of the applications KCL table - if your app is checkpointing very frequently then it can be a cause of throttling.

As well you are limited somewhat in how often you can get information from the stream - so if are attempting to get more aggressively than the stream can handle the worker will back off also.

Is the application keeping up with the processing now?

Hey @josh,

So I had reset the cursor so it had a whole 24 hours of data to push to ES so I don’t think it was running out.

The dynamo table is well below limits

So I had flushed ES and reset the cursors and tried to re process the last 24 hours of data, and it looks like this

Which looks very strange, we wouldn’t have that sudden drop off and we have approx 40 million events a day, so it seems the last 24 hours has been much lower, so either something is happening at the collector/enricher or the sink is still broken. I am just running the last 24 hours through the EMR process and will count up there and see whats happening.

Nothing is going into the the ES bad stream either, so not a bad format.

Cheers,
Dean

So it happened again, just stopped reporting

Data is still coming through the batch pipeline fine.

I see the following logs from the worker

and even after resetting it comes back online for a short period but then just stalls

Hey @lookaflyingdonkey;

What do the Kinesis Stream metrics look like for the Enriched stream that these consumers are reading from?

The failure cases we see come up with consumers are generally:

  1. Exceeding the egress limits of the stream; so if you are consuming faster than the stream allows
  2. Too many workers on the stream; aggressively attempting to get records from an under-provisioned stream
  3. DynamoDB table limits being exceeded; while you have not hit the limits yet we often see the write units exceeding the default of 10 at these kinds of loads

A few other things to check:

How is your Elasticsearch Cluster configured? If you are using the TTL capabilities at very high load this can have some quite serious performance implications as by default every record is checked for having exceeded the TTL every 60 seconds.

Are you running the sinks in a public facing subnet or in a private subnet proxying through NAT instances? In the later case a bottleneck can be the throughput capabilities of these instances if they are not provisioned with sufficient capacity.

The other thing to try would be to slightly lower your buffer limits. From before it looked like you were sending somewhere around 15k events at a time which is much higher than the maximum you can get from Kinesis at a time (10k). This might cause lots of workers to very aggressively get records to fill that buffer and cause backoff issues. I would recommend lowering the event buffer limit to 10k or less.


Sorry that there is no specific answer! There are a lot of edge cases in realtime processing that can cause issues.

If I think of anything else to try I will post it here.

Cheers,

Josh

Hey @josh,

Thanks for the suggestions, some more data below

The enriched stream looks to be filling OK, you can see in the metrics below that that iterator age has grown significantly, which i would expect if there is nothing being read from the stream

I only have 2 instances running against that stream, they each have 2 workers I believe, so a total of 4 workers, with 4 shards on that stream.

Everything looks in order with the DynamoDB setup too

Not approaching the Read or Write limit at all.

I do have the TTL set to 7 days on ES, so that may be it. Although I just reset the dynamo DB table for the sink and set the initial position to LATEST and it has started syncing events again without issue, in blocks of ~15k.

So this is a strange one, it seems to keep up for around 3 days and then just stops.

Cheers,
Dean

Another place to check is the Elasticsearch cluster itself. What does the monitoring for this look like? Is it possible that it not large enough to handle the ingress rates?

@josh You may be onto something there, I had just scaled that up the other day, but you can see high CPU and JVM usage while the workers are feeding in

The cluster config is as follows

with 512gb EBS volumes on each node

Hey @lookaflyingdonkey another thing to swatch out for is that the m4.large instance is still limited to 10 MB requests. I would recommend bumping up to m4.xlarge instances to get the 100 MB throughput allowance.

The other thing to be aware of is your index shard count and how those shards are distributed across your nodes. If for example you have 4 shards and 1 replica of the index for HA (8 shards to balance). You will not be able to evenly balance across the cluster causing some nodes to fill up - even if you have space available in the cluster overall. If this happens you will not be able to load either which could well be the issue here.

What does the minimum available space on the dashboard screen say?