Enricher high CPU utilisation issue

With 16 shards we can’t have 400k events per second, is it 400k events per day ?

To determine the optimal number of shards you need to know the maximal throughput per second and then you can use the formula here.

Then what we usually do is we set up an auto scaling group that decides the number of enricher instances based on the CPU usage.

@BenB Sorry for late reply… Here is correction in my input. We are getting around 3000 to 5000 records per seconds as I saw in kinesis monitoring graph… Please find below get and Put records
graph from kinesis snowplow-good queue:

Also, we have increased the number of CPUs for enricher… below is the configuration

2 machine of 16 cores each and 5 instances of 2 CPU each… After increasing the CPUs to 36, we are still getting over utilisation of CPUs(CPU getting utilised over 100%)

So, any suggestion on number of CPUs required to balance the enricher? as we have tried deploying this solution in K8s with auto scaling, the enricher in k8s started properly but it started creating new instances after every few minutes. So, it causes problem in k8s cluster.

@BenB any suggestions on above ?

Hi @karan, sorry for the late reply.

Your k8s configuration might be correct. The problem comes from Stream Enrich with the CPU going 100% without apparent reason, which leads to more instances being created.

We have just released Stream Enrich 1.3.0, which we hope will fix the issue with a bump of KCL library, feel free to try it out and see if the problem persists. If not, we will use profiling to check what is using CPU in Stream Enrich to identify the culprit.

Thanks @BenB for reply… Definitely I will try new enricher and let you know the outcome…

@BenB I tried the new enricher its looks much stable and consume less CPU as compared to the previous enricher. But when I checked logs most of the records are getting dumped in Snowplow-enriched-bad queue as compare to the previous enricher where most of the records are getting dumped in snowplow-enriched-good.

Also, major thing is processing speed previous enricher is able to process event very fast as compared to the new enricher(with pervious enricher we are able to process 150K+ records but with 1.3.0 enricher its around 80-90K only almost half).

New Enricher Logs v 1.3.0

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Initializing record processor for shard: shardId-00XXXXXXXX

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 251 out of 251 records

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-bad

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-bad

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 601 records from shardId-000000XXXXXXX

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 249 out of 500 records

[RecordProcessor-0000] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - 251 records failed with error code ProvisionedThroughputExceededException. Example error message: Rate exceeded for shard shardId-00000XXXXX in stream snowplow-enriched-bad under account.

[RecordProcessor-0000] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Retrying all failed records in 8179 milliseconds...

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 445 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 445 out of 445 records

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-bad

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-bad

[RecordProcessor-0001] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 249 out of 500 records

[RecordProcessor-0001] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - 251 records failed with error code ProvisionedThroughputExceededException. Example error message: Rate exceeded for shard shardId-00XXXXXXXX in stream snowplow-enriched-bad under account.

[RecordProcessor-0001] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Retrying all failed records in 5919 milliseconds...

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 251 out of 251 records

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 76 records to Kinesis stream snowplow-enriched-bad

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 76 out of 76 records

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Checkpointing shard shardId-0000XXXXXXXX

[RecordProcessor-0000] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 397 records from shardId-0000XXXXXXX

Previous version logs :

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0003] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 46 records from shardId-00000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 43 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 49 records from shardId-0000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 43 out of 43 records

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Checkpointing shard shardId-0000000

[RecordProcessor-0004] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 557 records from shardId-00000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 24 records from shardId-00000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 26 records from shardId-00000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0003] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 552 records from shardId-0000000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 66 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 66 out of 66 records

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Checkpointing shard shardId-00000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 46 records from shardId-0000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 61 records from shardId-00000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 26 records from shardId-000000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Processing 30 records from shardId-000

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 7 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 7 out of 7 records

[RecordProcessor-0002] INFO com.snowplowanalytics.snowplow.enrich.stream.sources.KinesisSource - Checkpointing shard shardId-0000

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

[RecordProcessor-0005] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 11 records to Kinesis stream snowplow-enriched-good

So, any how can I increase the processing speed of new enricher ?

Thanks a lot for the feedback @karan.

It looks much stable and consume less CPU as compared to the previous enricher.

That’s great news. Does this mean that the CPU isn’t going 100% any more?

But when I checked logs most of the records are getting dumped in Snowplow-enriched-bad queue as compare to the previous enricher where most of the records are getting dumped in snowplow-enriched-good.

Could you please take a look at the errors contained in this bad queue? They will say what the problem is.

with pervious enricher we are able to process 150K+ records but with 1.3.0 enricher its around 80-90K only almost half

This might be related to all these bad rows that you have, In the new logs we can see Rate exceeded for shard shardId-00XXXXXXXX in stream snowplow-enriched-bad. The number of shards for the bad queue is not big enough for the number of bad rows that you have, thus Stream Enrich is being throttled and during this time it’s not processing new data, thus the drop in throughput. You could increase the number of shards for the bad queue, but still we need to see what these bad rows contain and fix the issue.

@BenB : No, its again consuming 100%+ CPU, I removed the the sql enrichment due to which records are getting failed and check the usage it b/w 100-200% CPU Utilization.

Below is the error I am getting for which SQL enrichment is getting failed… Same messages are getting passed in old enricher.

{“enrichment”:{“schemaKey”:“iglu:com.snowplowanalytics.snowplow.enrichments/sql_query_enrichment_config/jsonschema/1-0-0”,“identifier”:“sql-query”},“message”:{“error”:“The placeholder map error. The map: Some(IntMap()), where count is: 1”}}]}

To be precise every message is getting failed for SQL enrichment.

Below is my sql enrichment config

{
  "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/sql_query_enrichment_config/jsonschema/1-0-0",
  "data": {
    "name": "sql_query_enrichment_config",
    "vendor": "com.snowplowanalytics.snowplow.enrichments",
    "enabled": true,
    "parameters": {
      "inputs": [
        {
          "placeholder": 1,
          "json": {
            "field": "contexts",
            "schemaCriterion": "iglu:com.snowplowanalytics.iglu/anything-a/jsonschema/1-*-*",
            "jsonPath": "$.question_id"
          }
        }
      ],
      "database": {
        "mysql": {
          "host": "host",
          "port": port,
          "sslMode": false,
          "username": "user",
          "password": "password",
          "database": "database"
        }
      },
      "query": {
        "sql": "select field1,field2... fieldn from questions_meta a inner join  questions b on a.question_id = b.question_id AND a.question_id = ?"
      },
      "output": {
        "expectedRows": "AT_MOST_ONE",
        "json": {
          "schema": "iglu:com.snowplowanalytics.iglu/anything-b/jsonschema/1-0-0",
          "describes": "EVERY_ROW",
          "propertyNames": "AS_IS"
        }
      },
      "cache": {
        "size": 3000,
        "ttl": 60
      }
    }
  }
}

Hi @karan !

Regarding SQL enrichment failing, we identified an issue where the enrichment was failing if one of the placeholder (question_id in your case) was null. This has been fixed in Stream Enrich 1.3.1.

We managed to reproduce the CPU issue and to use profiling on Stream Enrich to investigate. A race condition in scala-lru-map, used by IP lookups enrichment, was causing a memory leak, making memory usage keep growing, until staying close to 100% all the time, thus triggering a lot of garbage collection and using all the CPU for that. This explains why Stream Enrich was not processing events but not failing. This issue got solved in 1.3.0.

It’s not clear to me what 200% CPU usage means. Would you have a way to provide us with a graph showing the CPU usage of each machine running Stream Enrich over time, like this one (in CloudWatch metrics) ?

Also, which command do you use to start Stream Enrich?

Hi @BenB
we are running below command for enricher
sudo docker run -d -v /snowplow/config:/snowplow/config snowplow-docker-registry.bintray.io/snowplow/stream-enrich-kinesis:0.21.0 --config /snowplow/config/enrich.hocon --resolver file:/snowplow/config/resolver.json --enrichments file:/snowplow/config/enrichments/ --force-cached-files-download

Regarding the stats will update you soon

Hi @karan,

Latest enrich version is 1.3.1. It’s available directly on Docker Hub. We recommend to use this version. Please note that it comes with a new format for the bad rows emitted by enrich. You can read about it on this blog post.

The upgrade guides until this version can be found here.

Thanks @BenB will definitely use 1.3.1 and let you know

Hi @BenB

I tried 1.3.1 in that I am getting CPU utilisation as well as sql enricher failure issue both.

{"schemaKey":"iglu:com.snowplowanalytics.snowplow.enrichments/sql_query_enrichment_config/jsonschema/1-0-0","identifier":"sql-query"},"message":{"error":"The placeholder map error. The map: Some(IntMap()), where count is: 1"}}]}

Hi @karan,

Sorry for getting back and forth, but we recently realised that SQL Enrichment bug wasn’t get fixed in 1.3.1, please give a try to a fresh 1.3.2.

However, I’m bit puzzled by CPU utilisation bug as it did certainly go away in 1.3.x branch in our experience.

Hi @anton,

Thanks for the SqlEnrichment Fix… 1.3.2 is working fine for Sql Enrichments now.

But CPU utilisation issue is still there. I tried 1.3.2 by removing Sql enrichment as well but there is no improvement in CPU utilisation, it keeps on getting over 100%.

Hi @karan,

Great that SQL enrichment is now working!

Regarding the CPU going over 100%, it’s hard to guess what can be wrong. We managed to troubleshoot the same issue in one of our pipelines using profiling. To do that, you need to add -Dcom.sun.management.jmxremote.port=5555 -Dcom.sun.management.jmxremote.rmi.port=5555 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.rmi.server.hostname=127.0.0.1 to the JAVA options when running enrich (5555 being the port that you want to use) and then you can inspect the JVM that runs enrich with a tool like visualvm for instance.

In our case we discovered that enrich was constantly doing garbage collection, and we found out that this was due to a memory leak (fixed by 48e4ce8be913).

Using profiling you’ll be able to determine how the CPU is used and to find the culprit.

Hi @BenB,

Sorry for the late reply… I will definitely test this and share the outcome with you .

Hi @BenB,

I tested the fix and sql enrichment is working as expected. Only concern I have here is that the enricher is lagging behind the collector means number of records pushed by the collector are more as compare to the number of records processed by the enricher. Due this there is huge amount of lag get introduced in pipeline.

Any suggestion to overcome this issue?

I tested the fix and sql enrichment is working as expected.

Good to hear !!

the enricher is lagging behind the collector

Have you checked in Kinesis metrics with collector payloads that you didn’t reach the reading quota ?

Aren’t there any errors in the logs of enrich in Cloudwatch ?

@BenB

Thanks for the reply…

I will check this and get back