Kinesis stream enrich failing

Hey guys - in the past few days ive started running into some issues with our realtime stream-enrich.

I have a collection of errors that i keep running into and was hoping you could provide some insight:

ERROR com.snowplowanalytics.snowplow.enrich.kinesis.sinks.KinesisSink - Writing failed. java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]

ERROR com.snowplowanalytics.snowplow.enrich.kinesis.sinks.KinesisSink - Writing failed. com.amazonaws.AmazonClientException: Unable to marshall request to JSON: Java heap space

I am running this on an EC2 (m4.Large) on a Kinesis stream with 9 shards
config:

enrich {
  source = "kinesis"
  sink = "kinesis"
  aws {
    access-key: "default"
    secret-key: "default"
  }
  streams {
    in: {
      raw: "good"
      maxRecords: 100
      buffer: {
        byte-limit: 1000000
        record-limit: 50
        time-limit: 30000
      }
    }
    out: {
      enriched: "enrich"
      bad: "bad"

      backoffPolicy: {
        minBackoff: 10000
        maxBackoff: 600000
      }
    }
    app-name: "dynamo-table"
    initial-position = "TRIM_HORIZON"
    region: "region"
  }
}

I have also tried adjusting the backoff policies to larger and smaller, as well as adjusting max records and buffer size.

Thanks!

hi @13scoobie

As per description of one of your errors:

You reached mac capacity of Kinesis stream. In CloudWatch you may examine if you lose data. I would suggest to add a shard or two asap. Note, that enriched data stream has bigger requirements than raw due to the new data added by enrichment itself.

Thanks @grzegorzewald - it turns out that it was not Kinesis that was the problem (from what i have found so far, but instead had to do with scaling the # of EC2’s running the scala app.

We had started to see a latency in getting data into Elasticsearch, but since have put each of the scala Apps into an ASG, and data is flowing in.

It is weird we would get a throughput provision error, but i am thinking that some of the problem came from trying to read and write to so many shards with a single worker, rather than putting behind ASG and scaling the # of workers based off amount of data.

That makes sense @13scoobie - for our Managed Service customers we choose an instance type that can support 2 shard workers (within 1 KCL instance) comfortably, and then we basically need shard count/2 as the number of EC2 instances hosting that app within the give ASG.

1 Like

awesome! thank you. Going to update my ASGs now.

Cool! Remember that with the KCL you can have more shards than workers (though of course that could lead to lag), but more workers than shards will leave workers sitting idle.