Enricher high CPU utilisation issue

Hi,

I am running a stream-enrich-kinesis in production, after running enricher for fews minutes only my machine hosting enricher goes down due to high CPU utilisation by the enricher.

The CPU utilisation goes beyond 100% many times. Also, I am getting around 200K records but my enricher is able to process only 120K due to which latency is very high in snowplow pipeline.

I am running 8 instances instances of enricher with 2cpu and 8Gi memory and having 16 shards in kinesis.

Below is the config of my enricher.

enrich {

  streams {
    appName = "snowplow-enrich"
    sourceSink {
      enabled = kinesis
      region = <>

      threadPoolSize = 10

      aws {
            accessKey = <>
            secretKey = <>
      }
      initialPosition = LATEST
      backoffPolicy {
        minBackoff = 3000
        maxBackoff = 600000
      }
      maxRecords = 10000
    }

    in {
      raw = "snowplow-good"
    }

    out {
      enriched = "snowplow-enriched-good"
      bad = "snowplow-enriched-bad"
      pii = "snowplow-enriched-pii"
      partitionKey = "user_ipaddress"
    }

    buffer {
      byteLimit = 4500000
      recordLimit = 500
      timeLimit = 3000
    }

  }
}

Kindly help me to resolve this issue. Our all uses cases are dependent on snowplow performance only.

Hi @karan,

Welcome to the Snowplow community! Thank you for reporting the issue.

We’re currently investigating what the issue could be, we’ll get back to you once we’ve identified the culprit.

Thanks… Please let me know ASAP my pipelines are dependent on snowplow only.

@karan could you share the logs of Stream Enrich when this happens please ?

@BenB Please find below logs of enricher. We are getting errors as below but it’s not failing the enricher. Also CPU usage increases abruptly will writing processing records.

[RecordProcessor-0010] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0010] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 249 out of 500 records

[RecordProcessor-0010] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - 251 records failed with error code ProvisionedThroughputExceededException. Example error message: Rate exceeded for shard shardId-000 in stream snowplow-enriched-good under account 942XXXXX.

[RecordProcessor-0010] ERROR com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Retrying all failed records in 8161 milliseconds…

[RecordProcessor-0011] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 140 out of 140 records

[RecordProcessor-0011] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Writing 500 records to Kinesis stream snowplow-enriched-good

[RecordProcessor-0011] INFO com.snowplowanalytics.snowplow.enrich.stream.sinks.KinesisSink - Successfully wrote 500 out of 500 records

Thanks for the logs @karan.

This happens when Kinesis quotas are reached. Please check AWS documentation to know more about these quotas.

In your configuration we can see partitionKey = "user_ipaddress". Could you please check the throughput of each shard and check that they are equally balanced when using the IP address as a partition key?

If you don’t need to have the data partitioned by IP downstream, we would recommend to not set this parameter to have the data equally partitioned by random UUIDs.

IP is not required, I will remove it and will update you regarding the performance… Thanks

@BenB. Thanks again… I have made the changes and removed partition key. Please find below configuration

   out {
      enriched = "snowplow-enriched-good"
      bad = "snowplow-enriched-bad"
      pii = "snowplow-enriched-pii"
      partitionKey = ""
    } 

Above configuration changes helped me to increase the processing speed of enricher by 50%(now enricher is processing around 180K+ records as compared to previous speed of 120K).
But enricher still showing the high CPU utilisation in machines. Please find below screenshots of CPU utilisation by enricher.

Thanks for your feedback @karan, good to hear that you could increase the speed of Enrich with the same number of machines.

Regarding the high CPU utilization, we suspect KCL to be the culprit. On this comment we can see someone else having the CPU going 100%. Stream Enrich is still using an old version of the lib and we plan to bump it to the latest (PR here). We will keep you updated when the new version will be released.

Do the CPUs stay at 100% forever or does it get back to normal after some time ?

Initially it behaves normally but after few mins its starts utilising CPU above 100% continuously.

Also could you let me know what’s the enricher config should be to process 400k records?

Right now, I have 16 shards(kinesis behaving normal ), 8 instances of 2 CPU and 4Gi memory each.

With 16 shards we can’t have 400k events per second, is it 400k events per day ?

To determine the optimal number of shards you need to know the maximal throughput per second and then you can use the formula here.

Then what we usually do is we set up an auto scaling group that decides the number of enricher instances based on the CPU usage.

@BenB Sorry for late reply… Here is correction in my input. We are getting around 3000 to 5000 records per seconds as I saw in kinesis monitoring graph… Please find below get and Put records
graph from kinesis snowplow-good queue:

Also, we have increased the number of CPUs for enricher… below is the configuration

2 machine of 16 cores each and 5 instances of 2 CPU each… After increasing the CPUs to 36, we are still getting over utilisation of CPUs(CPU getting utilised over 100%)

So, any suggestion on number of CPUs required to balance the enricher? as we have tried deploying this solution in K8s with auto scaling, the enricher in k8s started properly but it started creating new instances after every few minutes. So, it causes problem in k8s cluster.