Snowplow R88 Angkor Wat released


#1

We are pleased to announce the release of Snowplow 88 Angkor Wat

This release includes important configuration refactoring as well as long-awaited DynamoDB-powered cross-batch natural deduplication.


#2

Great work, looking forward to testing it out! :slight_smile:

Question regaring DynamoDB costs - can we decrease throughput to 1 after EMR runs, and switch it back to 100 right before the next EMR? This should allow a great cost reduction.

Cheers,
Bernardo


#3

Hey @bernardosrulzon - you can, but you are limited in how many DynamoDB throughput changes you can make to a given table per day:

You can decrease the ReadCapacityUnits or WriteCapacityUnits settings for a table, but no more than four times per table in a single UTC calendar day.

So it depends a bit on how often you run the batch pipeline…

http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html


#4

Should be enough for everyone running 2 batches per day. Does it make sense to update the snowplow-runner.sh with an option to change throughput configs before and after EMR? That would be a one liner with the AWS CLI.


#5

@anton The wget URL on the post should be: https://raw.githubusercontent.com/snowplow/snowplow/master/5-data-modeling/event-manifest-populator/run.py

It’s currently pointing at the development branch, which is a 404 now.


#6

Ah, thanks Bernardo! Fixed.


#7

Is it possible to use the EC2 Role as credential for the DynamoDB target?


#8

@anton The event manifest populator job is failing with the following error. Any ideas what could be causing it?

17/04/28 19:39:34 INFO Client: Deleted staging directory hdfs://ip-10-169-52-242.ec2.internal:8020/user/hadoop/.sparkStaging/application_1493407256271_0001
Exception in thread "main" org.apache.spark.SparkException: Application application_1493407256271_0001 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1167)
	at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1213)
	at org.apache.spark.deploy.yarn.Client.main(Client.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
17/04/28 19:39:34 INFO ShutdownHookManager: Shutdown hook called
17/04/28 19:39:34 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-684b653d-74a8-44b8-ac50-68a1b0ca3e23

#9

@bernardosrulzon there’s should be detailed traceback somewhere in application_1493407256271_0001. I believe this is either non-available S3 path or some permissions issue.


#10

Thanks @anton - the underlying issue was that I was trying to load events older than R73. Setting the --since argument solved the issue.

Regarding DynamoDB: does Snowplow perform any read operations at all, or just rely on conditional writes to identify duplicates? I see that the consumed read capacity is zero on my table at all times. If that’s indeed the case, we can set the read throughput to 1, halving the Dynamo costs presented on the post.

Thanks!


#11

Hello @bernardosrulzon,

There’s no explicit reads/queries/scans in deduplication code. However, if I remember right there was some insignificant amount of reads consumed in our test loads . Also, I’m quite surprised it halves costs, as it should save about 8USD per month at most.

Anyway, you can leave read capacity on very low values to see how this will impact real costs - it won’t interrupt job nor throw any exception - so, you’re free to experiment with it.


#12

You’re right - I assumed read and write had the same cost per unit, but it turns out writing is 5x more expensive. Cost reduction should be something in the order of 15% then.

Btw, I’ve just ran the EMR with ReadThroughput=1 and had no noticeable impact in run time. Should be safe to use as default.


#13

Thanks @bernardosrulzon! If I remember correctly, we believed that you would need some read throughput for the scenario where a material number of the conditional writes did not result in a write (because otherwise those reads are effectively “free”), but perhaps that is not in fact the case…


#14

Exactly @alex - per DynamoDB documentation, a write operation will never consume read capacity units, even when the item already exists: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.ConditionalUpdate