So 3.2.0 and previous (back to 3.1.3) is giving us issues mostly around Kinesis shards, the patch notes for each new version are a mirror of our experience. We have now found that 3.2.1 is very much improved, and load tested it heavily.
However from an automation point of view i.e. the pod should be killed and restarted based off the logs/parsing/polling, the following errors probably should be changed to something else other than “error”:
- ERROR software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher
- ERROR com.snowplowanalytics.snowplow.enrich.kinesis.Sink
- ERROR software.amazon.kinesis.coordinator.Scheduler - Worker.run caught exception
They aren’t really errors while enrich is running more informational warnings, a full log example would be (note the warn then error):
[pool-1-thread-2] WARN com.snowplowanalytics.snowplow.enrich.kinesis.KinesisRun - Skipping checkpointing of shard shardId-000000000011 because this worker no longer owns the lease [prefetch-cache-shardId-000000000011-0000] ERROR software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher -
A suggested change would be to change away from error to something like caught exception:
[pool-1-thread-2] WARN com.snowplowanalytics.snowplow.enrich.kinesis.KinesisRun - Skipping checkpointing of shard shardId-000000000011 because this worker no longer owns the lease [prefetch-cache-shardId-000000000011-0000] CAUGHT EXCEPTION software.amazon.kinesis.retrieval.polling.PrefetchRecordsPublisher -
This came about as we parse the logs we get warn (yellow) then error (red), so for log parsing and polling it makes it difficult. Generally if we see an error we would trigger a restart while a warn would just be logged.