EMR job writes empty files in enriched.bad and shredded.bad buckets


#1

I noticed some bad records in realtime pipeline in Elasticsearch. However when I looked into batch processing pipeline I noticed that EMR job just writes empty files in enriched.bad and shredded.bad buckets. It looks like this:

Any idea why it might happen?


#2

Hey @tyomo4ka - to be getting bads in your RT pipeline but not in batch suggests some difference between the two pipelines in terms of event validation or enrichment.

What messages are you seeing in the bads in your RT pipeline in Elasticsearch?


#3

HI @alex - I see messages like this in RT pipeline in bad index in Elasticsearch:

{
  "level": "error",
  "message": "error: instance type (string) does not match any allowed primitive type (allowed: [\"integer\"])\n    level: \"error\"\n    schema: {\"loadingURI\":\"#\",\"pointer\":\"/properties/age\"}\n    instance: {\"pointer\":\"/age\"}\n    domain: \"validation\"\n    keyword: \"type\"\n    found: \"string\"\n    expected: [\"integer\"]\n"
}

It’s pretty much obvious issue. It is expected that emr etl runner won’t enrich and shrink data as the data doesn’t match schema.

My problem is that instead of bad events in enriched.bad bucket in S3 I get these empty files.

P.S.: I also have some empty files in shredded.bad. I guess it might be related to the same issue.


#4

Hi @tyomo4ka - that’s odd. Empty files means no events failed validation that run. Have you checked all the run folders?


#5

Hi @alex!

Yeah I did check all run folders. I was unable to find any non-empty file in enriched.bad and shredded.bad folders.

In shredded.good folder I found correctly shredded data. And I also can see correct data in Redshift. The only problem is those empty files.

I use self-describing events just for a case.