After upgrading the ES loaders from version 0.10.1 to the latest 2.0.5/6, we are seeing the following is happening randomly: the loaders’ KCL apps can become idle at random shards for uncertain periods of time.
That effectively leads to the spikes in the GetRecords.IteratorAgeMillis metric on a source Kinesis stream, and also results in incomplete data in ElasticSearch: some of the events are just missing there, because their respective kinesis shards were just not read yet.
The restart of a loader can alleviate that for some time. But later this happens again. The loader’s error log doesn’t show any errors.
Please see the screenshots attached to illustrate the problem. The upper chart shows incoming bytes, per shard, into the source Kinesis stream. The bottom chart shows the records processed per shard by the ES loader’s KCL app.
On the first screenshot, before the spikes on the bottom chart, we can see incoming bytes are evenly dispersed across all the shards.
On the second screenshot, before the spikes on the bottom chart, we can see there are no records processed metrics for some of the shards.
On the third screenshot, right after the spike on the bottom chart, we see the missing metrics have appeared.
Hi @Arthur_Khakimov ,
Welcome to Snowplow community !
Do you see some errors in the logs before the problem happens ?
Hi @BenB ,
Thank you! We didn’t see any warnings/errors with the INFO logging level. The issue still persists. We reached out to the AWS support and provided them the DEBUG-level logs. They escalated to their engineering for investigation.
The problem you described seems like very similar to this one. We were seeing that problem with oversized events. May I ask if you have any estimate about the average size of the items sent to the Elasticsearh Loader ?
Another thing we can recommend at the moment is using version 1.0.7. We think that we start to get this problem starting with 2.0.0. If you start to use 1.0.7 and have this problem again, can you let us know, please ?
Hi @enes_aldemir ,
Depending on the events type, the sizes range from several KBs to a couple of dozens of KBs per event, not bigger.
Thank you regarding 1.0.7. (Sure, we will let you know if we try using it.)
AWS engineers investigated the logs and found situations where our ES cluster sporadically responds Forbidden to a write request. Soon after that we begin to see the log lines like:
2022-07-29 09:48:11:938 +0000 [main] DEBUG com.amazonaws.services .kinesis.clientlibrary.lib.worker.ShardConsumer - Previous PROCESS task still pending for shard shardId-00000000000X since MILLISECONDS ms ago. Not submitting new task.
where the MILLISECONDS value keeps growing.
Other shards can keep working normally, i.e. their records being loaded to the cluster. But the shard X got stuck.
KCL does not populate metrics when there is no activity. KCL waits for the customer’s record processor to return before it will attempt to retrieve any more records from Kinesis. Here record processor faced authorization issues from OpenSearch and that could not complete activity.
To prevent KCL getting blocked make sure RecordProcesser is not stuck when it faces authorization issue from ES cluster.
I understand that the cluster’s Forbidden response is an issue in itself on our end. Please let me know if you need more info.