RDB Stream Shredder: java.lang.OutOfMemoryError: Java heap space

Hello,

I just discovered that my RDB Stream Shredder have stopped working since Nov 16 (my bad for not monitoring it) with the following error: java.lang.OutOfMemoryError: Java heap space. It’s weird cause it has been working without any issue during more than 1 month prior to that date (see my other topic). Could be because of an increase in traffic during this period but not sure.

I’m adding the logs at the end.

It was running on a t3a.small and I just tried to upgrade it to a t3a.medium then to a t3a.large, without success. After a few minutes the error comes back.

I eventually installed htop on the instance and the memory is not even close to reaching its limit.

Some more details:

  • I’m running the shredder in AWS Elastic Beanstalk Docker
  • I have 1 day of data retention on the Kinesis stream so I guess when I restart the shredder now it shreds the events from yesterday (as it was failing since Nov 16 and I saw the problem today)

Dockerfile:

FROM snowplow/snowplow-rdb-stream-shredder:1.2.0

# Required by Beanstalk
EXPOSE 8000

CMD [\
  "--config", \
  # Must be in base 64. To generate it: base64 congif.hocon
  "ewogICMgSHVtYW4t...", \
  "--iglu-config", \
  # Must be in base 64. To generate it: base64 iglu-resolver.json
  "ewogICJzY2hlbWEiOiAia..." \
]

config.hocon:

{
  # Human-readable identificator, can be random
  "name": "Acme Redshift",
  # Machine-readable unique identificator, must be UUID
  "id": "123e4567-e89b-12d3-a456-426655440000",

  # Data Lake (S3) region
  "region": ${S3_REGION},
  # SQS topic name used by Shredder and Loader to communicate
  "messageQueue": ${SQS_TOPIC},

  # Shredder-specific configs
  "shredder": {
    # "batch" for Spark job and "stream" for fs2 streaming app
    "type" : "stream",
    # For batch: path to enriched archive (must be populated separately with run=YYYY-MM-DD-hh-mm-ss directories) for S3 input
    # "input": "s3://bucket/input/",
    # For stream: appName, streamName, region triple for kinesis
    "input": {
     # kinesis and file are the only options for stream shredder
     "type": "kinesis",
     # KCL app name - a DynamoDB table will be created with the same name
     "appName": ${DYNAMO_TABLE},
     # Kinesis Stream name
     "streamName": ${KINESIS_STREAM_ENRICHER_GOOD},
     # Kinesis region
     "region": ${KINESIS_REGION},
     # Kinesis position: LATEST or TRIM_HORIZON
     "position": "TRIM_HORIZON"
    },
    # For stream shredder : frequency to emit loading finished message - 5,10,15,20,30,60 etc minutes
    "windowing": "5 minutes",
    # Path to shredded archive
    "output": {
      # Path to shredded output
      "path": ${S3_BUCKET_SCHREDDED},
      # Shredder output compression, GZIP or NONE
      "compression": "GZIP"
    }
  },

  # Schema-specific format settings (recommended to leave all three groups empty and use TSV as default)
  "formats": {
    # Format used by default (TSV or JSON)
    "default": "TSV",
    # Schemas to be shredded as JSONs, corresponding JSONPath files must be present. Automigrations will be disabled
    "json": [
      "iglu:com.acme/json-event/jsonschema/1-0-0",
      "iglu:com.acme/json-event/jsonschema/2-*-*"
    ],
    # Schemas to be shredded as TSVs, presence of the schema on Iglu Server is necessary. Automigartions enabled
    "tsv": [ ],
    # Schemas that won't be loaded
    "skip": [
      "iglu:com.acme/skip-event/jsonschema/1-*-*"
    ]
  },

  # Optional. S3 path that holds JSONPaths
  #"jsonpaths": "s3://bucket/jsonpaths/",

  # Warehouse connection details
  "storage" : {
    # Database, redshift is the only acceptable option
    "type": "redshift",
    # Redshift hostname
    "host": ${REDSHIFT_HOST},
    # Database name
    "database": ${REDSHIFT_DATABASE},
    # Database port
    "port": 5439,
    # AWS Role ARN allowing Redshift to load data from S3
    "roleArn": ${REDSHIFT_IAM_ROLE_ARN},
    # DB schema name
    "schema": ${REDSHIFT_SCHEMA},
    # DB user with permissions to load data
    "username": ${REDSHIFT_USER},
    # DB password
    "password": ${REDSHIFT_PASSWORD},
    # Custom JDBC configuration
    "jdbc": {"ssl": true},
    # MAXERROR, amount of acceptable loading errors
    "maxError": 10
  },

  # Additional steps. analyze, vacuum and transit_load are valid values
  "steps": ["analyze"],

  # Observability and reporting options
  "monitoring": {
  #   # Snowplow tracking (optional)
  #   "snowplow": {
  #     "appId": "redshift-loader",
  #     "collector": "snplow.acme.com",
  #   }

  #   # Optional, for tracking runtime exceptions
  #   "sentry": {
  #     "dsn": "http://sentry.acme.com"
  #   },

  #   # Optional, configure how metrics are reported
  #   "metrics": {
  #     # Optional, send metrics to StatsD server
  #     "statsd": {
  #       "hostname": "localhost",
  #       "port": 8125,
  #       # Any key-value pairs to be tagged on every StatsD metric
  #       "tags": {
  #         "app": "rdb-loader"
  #       }
  #       # Optional, override the default metric prefix
  #       # "prefix": "snowplow.rdbloader."
  #     },

  #     # Optional, print metrics on stdout (with slf4j)
  #     "stdout": {
  #       # Optional, override the default metric prefix
  #       # "prefix": "snowplow.rdbloader."
  #     }
  #   }
  }
}

Logs when I discovered the incident: https://pastebin.com/raw/Ezzd4d9K

Note the last log: INFO software.amazon.kinesis.leases.LeaseCleanupManager - Number of pending leases to clean before the scan : 0. This message has been looping since Nov 16 to Dec 01. There was nothing else at the tail of the logs.

Logs when I restart the shredder, after a while: https://pastebin.com/raw/gvvuCeuc

It works for like 5-10 minutes and eventually java.lang.OutOfMemoryError: Java heap space comes back.

Any idea how I could solve the issue?

I might try to create new Kinesis/SQS instances in the meantime, in case if it comes from there.

Edit: the last restart failed again but with a different error, and what’s interesting in this case is that it eventually shutdowns itself and so Docker was able to restart it automatically: https://pastebin.com/raw/Yw7bsQXD

Edit2: there is a new error about the KinesisAsyncClient: https://pastebin.com/raw/RD5g3YPW. It’s now looping on this error.

Edit3: from what I’m starting to understand:

after a few minutes there is a message: INFO software.amazon.kinesis.coordinator.PeriodicShardSyncManager - Skipping shard sync for ${kinesis_stream} due to the reason - Hash Ranges are complete for ${kinesis_stream} (don’t know what that means tho, maybe an error of sync between Kinesis, Dynamo and Shredder?)

which is directly followed by something about “elected leaders”: INFO software.amazon.kinesis.coordinator.DeterministicShuffleShardSyncLeaderDecider - Elected leaders: ${uuid}

which is directly followed by a tentative to send some stuff into CloudWatch, which always fails: WARN software.amazon.kinesis.metrics.CloudWatchMetricsPublisher - Could not publish 20 datums to CloudWatch ... java.util.concurrent.TimeoutException (is there something to configure in the hocon? or do I need to modify the IAM role?)

from there it stays forever in error and eventually java.lang.OutOfMemoryError: Java heap space messages start to appear.

Edit4: I added cloudwatch:* on resource:* to the IAM role and sending the metrics to CloudWatch still fails: https://pastebin.com/raw/3BfMTRna. Or maybe it’s a more general network issue? (all the ping and connection errors). Seems like it’s restarting by itself tho, so at least it keeps shredding (for now).

Hi @guillaume, you could try configuring the app to use more of the memory available, e.g. extend your CMD with these command line options: -Xms6G -Xmx6G. These are standard jvm flags for setting the heap size. By default the jvm will use 25% of the available memory, which explains why htop says it is not close to reaching its limit.

Having said that… I don’t really know why your stream shredder requires so much memory. The stream-shredder still has never been used much in production, so we haven’t explored it’s failure scenarios.

2 Likes

@istreeter thanks for the help, I had no idea the JVM was using only 25% of the available memory.

And sorry for the late follow up.

Just in case it can help someone else:

I couldn’t pass -Xms6G -Xmx6G as arguments to the CMD:

[ioapp-compute-0] ERROR com.snowplowanalytics.snowplow.rdbloader.shredder.stream.Main - Configuration error: Unexpected option: -X

But I was able to do it by using env variables in the Dockerfile:

ENV JAVA_TOOL_OPTIONS="-Xms6G -Xmx6G"

Which logs:

Picked up JAVA_TOOL_OPTIONS: -Xms6G -Xmx6G

And indeed it increases the memory allocated to the app:

I’m still testing if it solves the issue tho.

FYI I didn’t test it until now because the issue never happened again since December 2021. I recreated the pipeline at that time and everything was fine until this week.

But the same issue just happened 3 days in a row (around the same hours, might be related to a surge in traffic?). Until today I was just recreating the pipeline when the issue happened (using Terraform it takes a few minutes) but if it happens every day it won’t be sustainable.

I’m testing with the Java heap space arguments for now and I’m gonna update the stream-shredder to the version 3.0.1 in case some of the changes solved the issue.

Edit: I think the theory about the surge in traffic could be the good one, cause it happened first a few days before Christmas (we usually observe a change from 1.5 → 3M events/days on our biggest site), and this week one of the trackers almost did x10 (150k → 1M events/days).

Is it normal that it keeps increasing? Could there be a memory leak?

After restarting the container:

We’ve observed with enrich-kinesis (also using KCL) that while the app is running, its memory keeps increasing until reaching the maximum allowed and then it plateaus.

@BenB that matches what I observed.

But it seems like once it plateaus, if there is a surge in traffic then it doesn’t have enough memory to handle the surge (hence the java.lang.OutOfMemoryError: Java heap space, even with 6GB allocated to the JVM).

Although last time it happened, just restarting the container fixed the issue (since the used memory goes back to 3GB, as seen on the screenshots. Then it had enough memory to handle it). So I just added an alert that tells me when to restart the container.

1 Like