How to debug shredding process?


#1

I’ve just wasted quite a lot of time debugging an issue with my shredding process - I created a custom javascript enrichment which output to a custom type. The enrichment worked fine, but my JSON schema was incorrect, so the records would never get shredded. It took me forever to locate where the issue was, though, because there were no errors whatsoever telling me what happened - the bad shredded output bucket contained only 0 byte files, and the records were completely missing from the good shredded bucket.

Is there something I’m missing or that I have misconfigured? Should there be an error message or bad record somewhere if validation fails during shredding? Is there any way to test just the shredding process outside of EMR so it doesn’t take 20-30 minutes to debug a change? Thanks!


#2

Hi @mrosack - that’s very unusual. Did you check in the errors bucket as well - was that also empty? Every row of input should end up in either good, bad or errors. If it’s not, then that’s a bug in our shredding process.

Would you be able to share an example enriched event which disappears in shredding? We can then trace that through and figure out what is going on.


#3

I thought the errors bucket was it at first - I didn’t have the errors bucket configured, but even after I set it up the errors bucket was empty and nothing changed. Attached are the enriched events I was testing with - if the shredding process can’t find a schema for or validate com.ferritelabs.snowplow/touchpoint_interaction/jsonschema/1-0-0 the records disappear.

https://s3-us-west-1.amazonaws.com/snowplow-example-shred-failure/part-00001.gz

Thanks for your help!


#4

Thanks, created:

The title reflects the fact that we won’t know for sure if Spark Shred exhibits the same issue you found in Hadoop Shred.