Snowplow R88 Angkor Wat released

anton · April 28, 2017, 2:52am

We are pleased to announce the release of Snowplow 88 Angkor Wat

This release includes important configuration refactoring as well as long-awaited DynamoDB-powered cross-batch natural deduplication.

bernardosrulzon · April 28, 2017, 2:00pm

Great work, looking forward to testing it out!

Question regaring DynamoDB costs - can we decrease throughput to 1 after EMR runs, and switch it back to 100 right before the next EMR? This should allow a great cost reduction.

Cheers,
Bernardo

alex · April 28, 2017, 3:21pm

Hey @bernardosrulzon - you can, but you are limited in how many DynamoDB throughput changes you can make to a given table per day:

You can decrease the ReadCapacityUnits or WriteCapacityUnits settings for a table, but no more than four times per table in a single UTC calendar day.

So it depends a bit on how often you run the batch pipeline…

bernardosrulzon · April 28, 2017, 4:55pm

Should be enough for everyone running 2 batches per day. Does it make sense to update the snowplow-runner.sh with an option to change throughput configs before and after EMR? That would be a one liner with the AWS CLI.

bernardosrulzon · April 28, 2017, 5:39pm

@anton The wget URL on the post should be: https://raw.githubusercontent.com/snowplow/snowplow/master/5-data-modeling/event-manifest-populator/run.py

It’s currently pointing at the development branch, which is a 404 now.

anton · April 28, 2017, 5:48pm

Ah, thanks Bernardo! Fixed.

bernardosrulzon · April 28, 2017, 5:52pm

Is it possible to use the EC2 Role as credential for the DynamoDB target?

bernardosrulzon · April 28, 2017, 7:45pm

@anton The event manifest populator job is failing with the following error. Any ideas what could be causing it?

17/04/28 19:39:34 INFO Client: Deleted staging directory hdfs://ip-10-169-52-242.ec2.internal:8020/user/hadoop/.sparkStaging/application_1493407256271_0001
Exception in thread "main" org.apache.spark.SparkException: Application application_1493407256271_0001 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/04/28 19:39:34 INFO ShutdownHookManager: Shutdown hook called
17/04/28 19:39:34 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-684b653d-74a8-44b8-ac50-68a1b0ca3e23

anton · April 28, 2017, 8:13pm

@bernardosrulzon there’s should be detailed traceback somewhere in application_1493407256271_0001. I believe this is either non-available S3 path or some permissions issue.

bernardosrulzon · April 30, 2017, 2:02pm

Thanks @anton - the underlying issue was that I was trying to load events older than R73. Setting the --since argument solved the issue.

Regarding DynamoDB: does Snowplow perform any read operations at all, or just rely on conditional writes to identify duplicates? I see that the consumed read capacity is zero on my table at all times. If that’s indeed the case, we can set the read throughput to 1, halving the Dynamo costs presented on the post.

Thanks!

anton · May 1, 2017, 12:43pm

Hello @bernardosrulzon,

There’s no explicit reads/queries/scans in deduplication code. However, if I remember right there was some insignificant amount of reads consumed in our test loads . Also, I’m quite surprised it halves costs, as it should save about 8USD per month at most.

Anyway, you can leave read capacity on very low values to see how this will impact real costs - it won’t interrupt job nor throw any exception - so, you’re free to experiment with it.

bernardosrulzon · May 1, 2017, 4:49pm

You’re right - I assumed read and write had the same cost per unit, but it turns out writing is 5x more expensive. Cost reduction should be something in the order of 15% then.

Btw, I’ve just ran the EMR with ReadThroughput=1 and had no noticeable impact in run time. Should be safe to use as default.

alex · May 1, 2017, 5:58pm

Thanks @bernardosrulzon! If I remember correctly, we believed that you would need some read throughput for the scenario where a material number of the conditional writes did not result in a write (because otherwise those reads are effectively “free”), but perhaps that is not in fact the case…

bernardosrulzon · May 1, 2017, 6:03pm

Exactly @alex - per DynamoDB documentation, a write operation will never consume read capacity units, even when the item already exists: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.ConditionalUpdate

Topic		Replies	Views
Recommended/Supported EMR Versions? Enrichment	3	1065	March 31, 2021
Dataflow Runner released New releases	2	1444	February 11, 2017
DescribeJobFlows deprecated AWS batch pipeline (Legacy)	19	3319	August 10, 2016
Has anyone benchmarked ETL EMR? AWS batch pipeline (Legacy)	0	1196	November 21, 2016
Rerunning logs (new to Snowplow) For engineers	2	1320	December 19, 2019

Snowplow R88 Angkor Wat released

Related Topics