Thanks for the recommendation @alex. I've been testing several configurations and here's one that seems to work for now -
Kinesis S3 sink config.hocon
byte-limit: 1000000000 # 1GB
record-limit: 600000 # 600K
time-limit: 10800000 # 180 mins
Java Heap size - 24GB
I've also set the time limit to 4 hrs but that leads out of memory sometimes. The raw flie size in S3 ranges from 250 MB to 750MB based on traffic and time of day.
EMR config.yml -
core_instance_ebs: # Optional. Attach an EBS volume to each core instance.
volume_size: 500 # Gigabytes
volume_iops: 400 # Optional. Will only be used if volume_type is "io1"
ebs_optimized: true # Optional. Will default to true
task_instance_count: 2 # Increase to use spot instances
task_instance_bid: x.xx # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
Here are some entries from manifest with the configs above -
ets_tstamp commit_tstamp event_count shredded_cardinality
2017-03-08 18:07:12.698 2017-03-08 23:36:01.769959 4153081 26
2017-03-09 18:31:42.146 2017-03-09 22:37:26.136232 8443300 15
2017-03-11 14:11:17.092 2017-03-11 17:09:09.50273 13051161 16
2017-03-12 15:37:12.826 2017-03-12 23:06:16.329663 36738025 16
2017-03-13 03:55:09.428 2017-03-13 06:56:47.938757 13362932 15
We're using all enrichments under -
These configs will need to change for holiday traffic (Black Friday, Christmas etc) and we'll be fine tuning them further.
Please let me know if you have any thoughts.