Kinesis stream enrich failing


#1

Hey guys - in the past few days ive started running into some issues with our realtime stream-enrich.

I have a collection of errors that i keep running into and was hoping you could provide some insight:

ERROR com.snowplowanalytics.snowplow.enrich.kinesis.sinks.KinesisSink - Writing failed. java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]

ERROR com.snowplowanalytics.snowplow.enrich.kinesis.sinks.KinesisSink - Writing failed. com.amazonaws.AmazonClientException: Unable to marshall request to JSON: Java heap space

I am running this on an EC2 (m4.Large) on a Kinesis stream with 9 shards
config:

enrich {
  source = "kinesis"
  sink = "kinesis"
  aws {
    access-key: "default"
    secret-key: "default"
  }
  streams {
    in: {
      raw: "good"
      maxRecords: 100
      buffer: {
        byte-limit: 1000000
        record-limit: 50
        time-limit: 30000
      }
    }
    out: {
      enriched: "enrich"
      bad: "bad"

      backoffPolicy: {
        minBackoff: 10000
        maxBackoff: 600000
      }
    }
    app-name: "dynamo-table"
    initial-position = "TRIM_HORIZON"
    region: "region"
  }
}

I have also tried adjusting the backoff policies to larger and smaller, as well as adjusting max records and buffer size.

Thanks!


#2

hi @13scoobie

As per description of one of your errors:

You reached mac capacity of Kinesis stream. In CloudWatch you may examine if you lose data. I would suggest to add a shard or two asap. Note, that enriched data stream has bigger requirements than raw due to the new data added by enrichment itself.


#3

Thanks @grzegorzewald - it turns out that it was not Kinesis that was the problem (from what i have found so far, but instead had to do with scaling the # of EC2’s running the scala app.

We had started to see a latency in getting data into Elasticsearch, but since have put each of the scala Apps into an ASG, and data is flowing in.

It is weird we would get a throughput provision error, but i am thinking that some of the problem came from trying to read and write to so many shards with a single worker, rather than putting behind ASG and scaling the # of workers based off amount of data.


#4

That makes sense @13scoobie - for our Managed Service customers we choose an instance type that can support 2 shard workers (within 1 KCL instance) comfortably, and then we basically need shard count/2 as the number of EC2 instances hosting that app within the give ASG.


#5

awesome! thank you. Going to update my ASGs now.


#6

Cool! Remember that with the KCL you can have more shards than workers (though of course that could lead to lag), but more workers than shards will leave workers sitting idle.