S3 loader writes same amount of records

#1

We have varying levels of collector traffic throughout the day and this is confirmed both at the kinesis data stream and also at the HTTPS level. However we are seeing that the s3 loader is writing the same amount of records every hour.

This causes the manifest after ETL to be extremely uniform down to the thousand which is very suspect.

We run the s3 loader using an ECS scheduled job which starts and stops our s3 loader docker container every time.

Is our configuration correct?

kinesis {
  # LATEST: most recent data.
  # TRIM_HORIZON: oldest available data.
  # "AT_TIMESTAMP": Start from the record at or after the specified timestamp
  # Note: This only affects the first run of this application on a stream.
  initialPosition = "TRIM_HORIZON"

  # Need to be specified when initialPosition is "AT_TIMESTAMP".
  # Timestamp format need to be in "yyyy-MM-ddTHH:mm:ssZ".
  # Ex: "2017-05-17T10:00:00Z"
  # Note: Time need to specified in UTC.
  initialTimestamp = "{{timestamp}}"

  # Maximum number of records to read per GetRecords call
  maxRecords = 10

  region = "us-east-1"

  # "appName" is used for a DynamoDB table to maintain stream state.
  appName = "s3-loader"

  ## Optional endpoint url configuration to override aws kinesis endpoints,
  ## this can be used to specify local endpoints when using localstack
  # customEndpoint = {{kinesisEndpoint}}
}

streams {
  # Input stream name
  inStreamName = "snowplow-raw-good"

  # Stream for events for which the storage process fails
  outStreamName = "snowplow-s3-bad"

  # Events are accumulated in a buffer before being sent to S3.
  # The buffer is emptied whenever:
  # - the combined size of the stored records exceeds byteLimit or
  # - the number of stored records exceeds recordLimit or
  # - the time in milliseconds since it was last emptied exceeds timeLimit
  buffer {
    byteLimit = 104857600 # 100Mb
    recordLimit = 1000000 # 1M records
    timeLimit = 3600000 # 1 hour
  }
}

s3 {
  region = "us-east-1"
  bucket = "test-snowplow"
  # optional directory structure to use while storing data on s3, you can place date format strings which will be
  # replaced with date time values
  # eg: directoryPattern = "enriched/good/run_yr={YYYY}/run_mo={MM}/run_dy={DD}"
  # directoryPattern = "enriched/good/run_yr={YYYY}/run_mo={MM}/run_dy={DD}"
  # directoryPattern = ""

  # Format is one of lzo or gzip
  # Note, that you can use gzip only for enriched data stream.
  format = "lzo"

  # Maximum Timeout that the application is allowed to fail for (in milliseconds)
  maxTimeout = 10000

  ## Optional endpoint url configuration to override aws s3 endpoints,
  ## this can be used to specify local endpoints when using localstack
  # customEndpoint = {{kinesisEndpoint}}
}
#2

Correction to previous post, i do run as a ECS task, but it’s not scheduled.

Followup question:

  buffer {
    byteLimit = 104857600 # 100Mb
    recordLimit = 1000000 # 1M records
    timeLimit = 3600000 # 1 hour
  }

Q1. Does anyone know if the byte limit is uncompressed?
Q2. Is there a way we can enable more logging on which buffer rule it hit to cause a write?