The guide you are referring to is not up-to-date with the latest Snowplow releases. In particular, R87 moved the "archive_raw" step (step 10) to EMR cluster:
we have migrated the archival code for raw collector payloads from EmrEtlRunner into the EMR cluster itself, where the work is performed by the S3DistCp distributed tool. This should reduce the strain on your server running EmrEtlRunner, and should improve the speed of that step. Note that as a result of this, the raw files are now archived in the same way as the enriched and shredded files, using run= sub-folders.
As a result of this change, the recovery step which includes
--skip staging,emr should be
--skip staging,enrich,shred which also should work for earlier releases.
Running the pipeline with
--skip staging,enrich fails at shredding because no files are available in HDFS once the EMR cluster is terminated. Hence, you need to deleted "enriched:good" and spin the EMR cluster with
As for deduplication handling, the events in the
manifest table are tracked by means of
- Event id - used to identify event
- Event fingerprint - used in conjunction with event id to identify natural duplicates
- ETL timestamp - used to check if previous Hadoop Shred was aborted and event is being reprocessed
and the conditional update feature of DynamoDB.
Thus, it is easy to determine if the same events (Event id and Event fingerprint) have been processed (ETL timestamp). Reprocessing the events does generate a new ETL timestamp. The deduplication process is described here: https://github.com/snowplow/snowplow/wiki/Relational-Database-Shredder#33-cross-batch-natural-de-duplication