Reprocessing / Rerunning logs from IGLU server failure for unstructured events

mjensen · June 22, 2017, 1:23pm

Morning everyone,

You can read what happened in the post below. Our IGLU server was unavailable for about a week and all the unstructured events that couldn’t be validated by IGLU didn’t make it to redshift but the structured events were fine and loaded. Any advice on the best way to rerun data for last 7 days without causing duplicates in redshift? Would having de-duplication turned on in enrichment cause only rows that are missing in redshift to be loaded? Other option is just to re-process and then run your de-duplication SQL that i saw in another post as well?

http://discourse.snowplow.io/t/both-r89-and-r88-are-taking-forever-to-enrich/1233

alex · June 22, 2017, 8:13pm

Do you use POST with your trackers, or just GET?

mjensen · June 23, 2017, 2:56pm

@alex i believe only GET. would that affect rerunning data?

and thanks for your help.

mjensen · June 23, 2017, 6:27pm

would just rerunning data and having dupes in redshift and then using this tutorial work?
http://discourse.snowplow.io/t/de-deduplicating-events-in-hadoop-and-redshift-tutorial/248

i don’t think de-duplication in enrichment would work since we’ve never had it on and it only works for batch runs right? and doesn’t look at what’s in redshift vs what’s being run in ETL etc. we’d have to run all the logs since 6/14 as one big batch for it to work? and even then we’d still have dupes in redshift from previous good loads.

alex · June 25, 2017, 5:20pm

Hi @mjensen - check out this documentation:

And this tutorial:

Because you are only using GETs, you won’t encounter the duplication problem that can occur when recovering bad events from POST payloads (see the caveats section) in the documentation.

Topic		Replies	Views
Re-run the enrichment bad log For engineers	6	2585	August 18, 2016
Using Hadoop Event Recovery to recover events with a missing schema [tutorial] Troubleshooting	17	5106	June 1, 2017
Snowplow Event Recovery on GCP GCP pipeline	4	533	September 27, 2023
Running Hadoop Event Recovery with Dataflow Runner [tutorial] Troubleshooting	1	1724	October 12, 2017
R71: JSON validation in Scala Common Enrich Enrichment	3	1714	November 27, 2017

Reprocessing / Rerunning logs from IGLU server failure for unstructured events

Related Topics