Kinesis Enricher CPU usage recovers slowly after peak

Hello,

we are using Snowplow on a news website where there might be peaks during the day. Whenever a peak occurs, we observe that the CPU usage does not recover for some enrichers:

Our events are distributed uniformly (via EventID) to our partitions. However, it seems randomly which enrichers are recovering fast and which ones are recovering faster. We have only the enrichers on our EC2 instances (Kubernetes nodes) and there is also no Kinesis errors.

Does someone else observe something similar or can give us a hint? We already tested with different Enricher configurations regarding Kinesis but we could not solve the issue with this.

Side information:

  • 32 enrichers (64 shards)
  • 2 shards per container (2GB RAM limit, 1.9vCPU limit)
  • 8vCPU per EC2 machine
  • ~1000 events/sec
  • Enricher Version: 1.4.2

Our enrichment configuration:

    sourceSink {
      enabled =  "kinesis"

      region = eu-central-1
      threadPoolSize = 100
      disableCloudWatch = true
      aws {
        accessKey = default
        secretKey = default
      }

      maxRecords = 10000
      initialPosition = TRIM_HORIZON
      initialTimestamp = "2017-05-17T10:00:00Z"

      backoffPolicy {
        minBackoff = 1000
        maxBackoff = 10000
      }
    }
    buffer {
      byteLimit = 4500000
      recordLimit = 500 # Not supported by Kafka; will be ignored
      timeLimit = 250
    }

    appName = "{{ .Values.environment }}-{{ .Values.tenant }}-stream-enrich-manifest"
  }

Thank you :slight_smile:

Hi @capchriscap
Welcome to the Snowplow community.
Cheers,
Eddie

This seems unusual.

This isn’t a particularly old version of enrich (~1.4.2) but it might be worth updating anyway.

Do you scale the number of pods for enrich on your Kubernetes cluster or is this a static number? Do you have some logs about which containers are picking up / discarding leases?

Hey @mike ,

thanks for the fast reply.

We plan to update soon hopefully but with major updates we are a little bit cautious :slight_smile:

Regarding scaling: we don’t scale currently because we firstly would like “stabilize” (and understand) the enrichers under load to find the right scaling policy.
I unfortunately also did not find any logs regarding lease changes. This seems to be stable.

Can theses issues be caused by some garbage collection issues or something similar? Up to 60-80% CPU load, the enrichers work smoothly but between 80-100% CPU load, the performance seem to be reduced (for some enrichers) :thinking:

btw: I really appreciate that you guys are supporting us users here so well even if some are “only” using the open source variant! :+1: :partying_face:

PS: in the first place we thought that it is a EC2 CPU limitation by AWS but as we are using M-machines we don’t have any CPU limitations (credits).

Hi @capchriscap ,

The issue seems to be similar to this one, but it had been fixed in 1.3.0.

Are you using a partitionKey for the data written to Kinesis ?
Have you checked that you’re not reaching Kinesis quotas ?

Can theses issues be caused by some garbage collection issues or something similar? Up to 60-80% CPU load, the enrichers work smoothly but between 80-100% CPU load, the performance seem to be reduced (for some enrichers)

It’s hard to make guesses. Would you be able to run the enrichers with the JMX port open on the JVM, so that when an enricher gets stuck you can use a profiler (e.g. visualvm). To do that you can add -Dcom.sun.management.jmxremote.port=5555 -Dcom.sun.management.jmxremote.rmi.port=5555 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.rmi.server.hostname=127.0.0.1 to the JAVA_OPTS.

Hi @BenB,

we are using “event_id” as partition key to evenly distribute the events even if we have “Power users” (e.g. company network with same IP) that make more requests than others and may break the evenly distribution.

Thanks for sharing the similar post. However, as we are not using the SQL enrichment, I thought it is not the case for us (even if the CPU example you posted is pretty similar to ours!). Our enrichments are:

  • anon_ip
  • campaign_attribution
  • iab_spiders_and_robots (fetched from S3)
  • referer_parser
  • yauaa

Additionally, our Kinesis streams has also no “Write throughput exceeded” errors and our enriched bad events are also pretty low in numbers without spikes during CPU peak time.

It’s hard to make guesses. Would you be able to run the enrichers with the JMX port open on the JVM, so that when an enricher gets stuck you can use a profiler (e.g. visualvm). To do that you can add -Dcom.sun.management.jmxremote.port=5555 -Dcom.sun.management.jmxremote.rmi.port=5555 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.rmi.server.hostname=127.0.0.1 to the JAVA_OPTS .

I mean I can always set the options inside of our containers via environment variables. I hope that I can reproduce this issue with a load test of multiple days. Maybe I know then more. But from your words I would interpret that you did not have this issue yet, which means that it probably don’t is the issue.

As I have no clue what else it could be, I was thinking of Java issues like garbage collector.
You gare scaling directly at 60% CPU to prevent these issues of high CPU out of the box, isn’t it? I just read it somewhere in a previous posts. Maybe this is one option to add more Kinesis Shards and add more enrichers (with lower CPU limit, e.g. 0,8 vCPU per Enricher Shard instead of 0,95 vCPU).