Debugging bad rows in Elasticsearch using curl (without Kibana) [tutorial]


#1

In the debugging bad data in Elasticsearch and Kibana guide, I walked through the process of debugging bad data in Elasticsearch using Kibana.

Sometimes it may not be possible to use Kibana: we’ve had occasions when Kibana is simply too slow / unresponsive in the browser. In these types of situations, you can perform the same exercise using Elasticsearch and curl directly.

How many bad rows are there that shouldn’t be?

As discussed in the other post, a number of bad rows can safely be ignored.

We can get a count of the total number of bad rows in Elasticsearch using the following curl:

$ curl -XGET 'https://search-snowplow-placeholder-name.us-east-1.es.amazonaws.com/snowplow/bad_rows/_search?search_type=count' -d '
{
   "query": {
      "match_all": {}
   }
}'

We can filter out bad rows we do not need to worry about and get a count of the remaining rows with the following query:

$ curl -XGET 'https://search-snowplow-placeholder-name.us-east-1.es.amazonaws.com/snowplow/bad_rows/_search?search_type=count' -d '
{
   "query": {
      "bool":{
          "must_not": { "match_phrase": { "errors.message": "(/)vendor/version(/) pattern nor is a legacy /i(ce.png) request" }},
          "must_not": { "match_phrase": { "errors.message": "Unrecognized event [null]" }}
      }
   }
}'

Diagnosing underlying data collection problems

Now that we’ve identified the number of bad rows that we need to diagnose, let’s start by identifying the first error message. We can pull a random set of results from Elasticsearch:

$ curl -XGET 'https://search-snowplow-placeholder-name.us-east-1.es.amazonaws.com/snowplow/bad_rows/_search?pretty=true&size=10' -d '
{
   "query": {
      "bool":{
          "must_not": { "match_phrase": { "errors.message": "(/)vendor/version(/) pattern nor is a legacy /i(ce.png) request" }},
          "must_not": { "match_phrase": { "errors.message": "Unrecognized event [null]" }}
      }
   }
}'

and inspect the output to see both the error messages generated and the actual data generating those errors.

Once we’ve identified the issue, we can add the new value of error.message to the query above to filter out those bad rows and see how many remain. By iterating through this process, we can work through the different error messages until there are none left. The process is described in more detail in the Kibana blog post. The same process applies here.

One thing we may want to understand for a particular error is whether it is ongoing or sandboxed to a particular point in time. We can do this using Elasticsearch’s aggregation API to count the number of bad rows that match the event by day or by hour, as in the example below:

$ curl -XGET 'https://search-snowplow-placeholder-name.us-east-1.es.amazonaws.com/snowplow/bad_rows/_search?search_type=count' -d '
{
   "query": {
      "match_phrase": {
          "errors.message": "Could not find schema with key iglu:com.mycompany/event_name/jsonschema/1-0-0 in any repository"
      }
   },
   "aggs": {
       "day": {
           "date_histogram": {
               "field": "failure_tstamp",
               "interval": "day",
               "format": "yyyy-MM-dd"
           }
       }
   }
}'

The above query might show us that an error only occurred for a day or two and has already been resolved.

Removing bad rows from Elasticsearch

Once the errors have been identified, we recommend removing the associated bad rows from Elasticsearch to prevent the database getting larger indefinitely. (Reducing the associated hardware bills.) This is described in the Kibana guide.


Encoded bad rows in Elasticsearch - advanced debugging support
Debugging bad rows in Athena [tutorial]
EmrEtlRunner skip issues configuration
Debugging bad rows in Spark and Zeppelin [tutorial]