Reprocessing / Rerunning logs from IGLU server failure for unstructured events


#1

Morning everyone,

You can read what happened in the post below. Our IGLU server was unavailable for about a week and all the unstructured events that couldn’t be validated by IGLU didn’t make it to redshift but the structured events were fine and loaded. Any advice on the best way to rerun data for last 7 days without causing duplicates in redshift? Would having de-duplication turned on in enrichment cause only rows that are missing in redshift to be loaded? Other option is just to re-process and then run your de-duplication SQL that i saw in another post as well?

http://discourse.snowplowanalytics.com/t/both-r89-and-r88-are-taking-forever-to-enrich/1233


#2

Do you use POST with your trackers, or just GET?


#3

@alex i believe only GET. would that affect rerunning data?

and thanks for your help.


#4

would just rerunning data and having dupes in redshift and then using this tutorial work?
http://discourse.snowplowanalytics.com/t/de-deduplicating-events-in-hadoop-and-redshift-tutorial/248

i don’t think de-duplication in enrichment would work since we’ve never had it on and it only works for batch runs right? and doesn’t look at what’s in redshift vs what’s being run in ETL etc. we’d have to run all the logs since 6/14 as one big batch for it to work? and even then we’d still have dupes in redshift from previous good loads.


#5

Hi @mjensen - check out this documentation:

And this tutorial:

http://discourse.snowplowanalytics.com/t/using-hadoop-event-recovery-to-recover-events-with-a-missing-schema-tutorial/351

Because you are only using GETs, you won’t encounter the duplication problem that can occur when recovering bad events from POST payloads (see the caveats section) in the documentation.