After EmrEtlRunner completes its task (shredding, importing to Postgres), it does not delete the raw-in data. My question is, do I need to schedule a job to delete the raw data? Or, should I just leave the raw data alone?
My worry is that if the raw data remains there, the next time EmrEtlRunner runs, it will proceed with the old data again, thus causing duplicated records. I don’t know. Maybe it is smart enough to skip the old data?
Really appreciate anyone who can clear my confusion.