What to do with raw data after EmrEtlRunner completes?

zeldein · June 27, 2018, 9:05am

After EmrEtlRunner completes its task (shredding, importing to Postgres), it does not delete the raw-in data. My question is, do I need to schedule a job to delete the raw data? Or, should I just leave the raw data alone?

My worry is that if the raw data remains there, the next time EmrEtlRunner runs, it will proceed with the old data again, thus causing duplicated records. I don’t know. Maybe it is smart enough to skip the old data?

Really appreciate anyone who can clear my confusion.

mike · June 27, 2018, 9:33am

The diagram for batch processing here gives you an idea of the steps going on through the EMR process. As part of this (step 12) is that the raw data from the EMR cluster is copied back to S3 into an archive bucket.

Raw data isn’t deleted (just in case for some reason you need to reprocess / enrich the data again) but it is possible to delete it (or move it to Glacier / other S3 storage) if required. EMREtlRunner uses a ‘staging’ bucket/S3 path (step 1 where data is moved from raw-in to raw-processing) so that each run will only ever process data once. Once the data has been processed (from raw-processing) that bucket is cleared out ready for the next run (which will again move files from raw:in to raw:processing).

Hope that clears it up a bit!

zeldein · June 27, 2018, 6:39pm

thanks for the super quick reply~
your input really cleared the puzzle for me !!
thanks a lot, bro!

Topic		Replies	Views
No Snowplow logs to process since last run For engineers	1	878	June 27, 2018
Processing logs for a specific time period AWS batch pipeline (Legacy)	5	1405	November 14, 2016
Question on EmrEtlRunner options For engineers	11	2532	March 14, 2017
Empty s3 shredded logs after successful EmrEtlRunner job AWS batch pipeline (Legacy)	5	1845	August 9, 2018
EmrEtlRunner Issues - taking too long on step 2 AWS batch pipeline (Legacy)	13	3363	March 29, 2017

What to do with raw data after EmrEtlRunner completes?

Related Topics