Encoded bad rows in Elasticsearch - advanced debugging support


After loading bad events into Elasticsearch, I’m seeing that the ‘line’ field is base64 encoded (and I believe thrift serialized). I think this is due to the fact I’m using the scala collector. In the tutorials about debugging bad rows it doesn’t touch on this point, and it appears the lines are not encoded. Is there some way, when using the scala collector, to load these events unencoded (and even deserialized)? It would go a long way in enabling more in-depth interrogation of the errors, for instance, the ability to monitor failing event types, error counts per different applications, and so on.


To clarify, the bad rows I’m loading are coming from Scala Hadoop Enrich (R85). From reading various posts here, it seems as if these should no longer be thrift messages. Is there a way to control the output format of bad rows from Hadoop Enrich, possibly in a later version? While this is pertaining to debugging, I’ll also need to recover these events. I think I’ll need them in TSV for Event Recovery too, correct?

Any advice on how to handle bad rows from Enrichment where the collector format is set to thrift is appreciated.



Indeed, the bad events generally take form

   line: "original raw event as a string record", 
         "message":"Here what's wrong with this event"

This is true regardless of the collector you use though it would be base64 encoded in your case.

Unfortunately, we currently do not provide an ability to present that data decoded.

Here’s the tutorial describing how to debug the bad data in Elasticsearch using curl: Debugging bad rows in Elasticsearch using curl (without Kibana) [tutorial]. The approach is to filter out the events we do not care about (generated by bots, resulted due to OPTIONS requests, etc). The remaining would need to be examined to determine the reason for failure. It means decoding the value in line parameter (and fixing the underlying reason).

With regard to recovering data, if you are not using Lambda batch in your architechture - that is the “bad” events are not saved in S3 then I’m afraid you won’t be able to recover them. Having the events in S3 allows you to get them recovered and fed back to batch pipeline with the help of Hadoop Event Recovery. I think it’s reasonable taken the nature of “real-time” vs “batch” pipelines.