Can't see any data sinked into S3


#1

I have deployed kinesis -s3-sink. After deploying it on ec2 instance it
runs successfully and logs repeatedly “Successfully published X datums”.
But, I dont see any data in my S3 bucket. Is there something I am
missing? If not, the sinking process is realtime or does it takes a lot
of time for initial setup.


#2

Hi @shailesh17mar,

Could you possibly post the configuration with sensitive details removed here? The “successfully published X datums” message tends to be something that’s output by the Amazon Kinesis client when publishing metrics to Cloudwatch, rather than S3.


#3

this is my sink config. I am a newbie in aws and snowplow so, pardon my stupid mistakes.
# Default configuration for kinesis-lzo-s3-sink

sink {

  # The following are used to authenticate for the Amazon Kinesis sink.
  #
  # If both are set to 'default', the default provider chain is used
  # (see http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html)
  #
  # If both are set to 'iam', use AWS IAM Roles to provision credentials.
  #
  # If both are set to 'env', use environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
  aws {
    access-key: "iam
    secret-key: "iam"
  }

  kinesis {
    in {
      # Kinesis input stream name
      stream-name: "raw-events-pipe"

      # LATEST: most recent data.
      # TRIM_HORIZON: oldest available data.
      # Note: This only affects the first run of this application
      # on a stream.
      initial-position: "TRIM_HORIZON"

      # Maximum number of records to read per GetRecords call     
      max-records: "100"
    }


    out {
      # Stream for events for which the storage process fails
      stream-name: "enriched-events-pipe"
    }
    region:"us-west-2"
    # "app-name" is used for a DynamoDB table to maintain stream state.
    # You can set it automatically using: "SnowplowLzoS3Sink-$\\{sink.kinesis.in.stream-name\\}"
    app-name: "snowplow-sink"
  }

  s3 {
    # If using us-east-1, then endpoint should be "http://s3.amazonaws.com".
    # Otherwise "http://s3-<<region>>.s3.amazonaws.com", e.g.
    region:"us-west-2"
    bucket: "stream-sink"

    # Format is one of lzo or gzip
    # Note, that you can use gzip only for enriched data stream.
    format: "lzo"

    # Maximum Timeout that the application is allowed to fail for
    max-timeout: "30000"
  }

  # Events are accumulated in a buffer before being sent to S3.
  # The buffer is emptied whenever:
  # - the combined size of the stored records exceeds byte-limit or
  # - the number of stored records exceeds record-limit or
  # - the time in milliseconds since it was last emptied exceeds time-limit
  buffer {
    byte-limit: 128000000
    record-limit: 40000
    time-limit: 7200000
  }

  # Set the Logging Level for the S3 Sink
  # Options: ERROR, WARN, INFO, DEBUG, TRACE
  logging {
    level: "INFO"
  }

}

#4

No worries. There’s a few minor things that I’ve noticed but that may not be the root cause.

  1. On line 12, there’s a missing closing quote for “iam”.
  2. max-records and max-timeout should both be integers rather than strings.
  3. Your limits in the buffer are quite high (128 mb or 40000 records or 120 minutes). This part of the configuration will configure when Kinesis is flushed to S3, it seems possible that you may not be hitting these buffer limits in which case you would not see any data being flushed to S3.

#5

Point 1 was mistake made during pasting here. Fixed the 2nd point.
What should be configuration for just testing snowplow? If I set it as (1,1,1) would it work then?


#6

Perhaps try setting the byte limit to 50000 bytes, record limit to 100 and time-limit to 60000 (60 seconds). This way if everything is working correctly (with a small number of test events) you should expect data to arrive in S3 approximately every 60 seconds (or attempt to write to S3 at least).


#7
WARNING: idleTimeBetweenReads is greater than bufferTimeMillisecondsLimit. For best results, ensure that bufferTimeMillisecondsLimit is more than or equal to idleTimeBetweenReads 

WARNING: Received configuration for both region name as us-west-2, and Amazon Kinesis endpoint as https://kinesis.us-west-2.amazonaws.com. Amazon Kinesis endpoint will overwrite region name.

I get these warnings but, after that it starts publishing metrics to Cloudwatch. No sign of S3 whatsoever. Do I need to set some permissions for bucket also? My S3 bucket is still empty. I have set parameters as suggested still no change. :confused:

Aug 15, 2016 5:18:11 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 13 datums.
Aug 15, 2016 5:18:21 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 13 datums.
Aug 15, 2016 5:18:31 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 13 datums.
Aug 15, 2016 5:18:41 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 13 datums.
Aug 15, 2016 5:18:51 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 13 datums.
Aug 15, 2016 5:19:01 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 18 datums.
Aug 15, 2016 5:19:06 AM com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker info
INFO: Current stream shard assignments: shardId-000000000000
Aug 15, 2016 5:19:06 AM com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker info
INFO: Sleeping ...

#8

Hey @mike, Its working now, apparently my collector wasn’t working because I had stopped the machine because of which its public ip changed due to which my test android application was not streaming events to it. My fault.


#9

Cool, glad to hear it’s now working!


#10

I’m facing the same issue as mentioned.

Log shows no errors,

Apr 24, 2017 11:56:48 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 4 datums.
Apr 24, 2017 11:56:58 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 4 datums.
Apr 24, 2017 11:57:09 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 4 datums.
Apr 24, 2017 11:57:19 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 14 datums.
Apr 24, 2017 11:57:29 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 4 datums.
Apr 24, 2017 11:57:39 AM com.amazonaws.services.kinesis.metrics.impl.CWPublisherRunnable publishMetrics
INFO: Successfully published 4 datums.
Apr 24, 2017 11:57:42 AM com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker info
INFO: No activities assigned
Apr 24, 2017 11:57:42 AM com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker info
INFO: Sleeping …

However, no data is available in S3. My configuration is similar to what is specified above including the parameters specified above for testing.

I’ve manually checked the input kinesis stream for records and they exist.

What could be the issue?


#11

I have found the issue. I was using the same app-name for both my collector and enrich (which is used to point to a Dynamo DB table to keep track of streaming). Changing the app-name in my enrich component fixed the issue.


Elasticsearch Loader idle
#12

Useful thread guys, nice to see an example config file for the kinesis-s3 connector.

Could you give me some guidance or post an example of the minimum IAM user permissions that are required to get this working?


#13

Hi,

My problem is kinda similar to this. I have a realtime application producing event data on to Kinesis-stream and I used KCL to read the Kinesis Stream and write each event as a file in S3. I was able to perform the operation successfully and able to see some files in S3 but, I do not see all the event data that is being written into S3.

My ingestion rate is ~3 events / sec, and I get the success message after every event but files in S3 are created (or visible) one for every 3-4 minutes. I read about eventual consistency but I never see the files that are being written to S3 (even after the application is kept running for 24 hrs). Any insight into this ?