Debugging bad rows in Athena (Real-time tutorial)


#1

Having recently gone through the process of debugging data using Athena for a real-time pipeline, I thought I’d share the steps I went through and the queries I was able to use successfully. The reason being that there are some differences from doing this for a batch Snowplow setup which was covered brilliantly in Yali’s Debugging Bad Row In Athena Tutorial.

The key difference

With one of our batch pipelines, you’ll notice that the enriched bad events (data which hasn’t landed in your production database) are partitioned by run:

The equivalent S3 bucket that you’ll need to query if you’re using one of our real-time pipelines will be something like sp-com-yourcompany-12345-0-kinesis-s3-bad/main. You’ll notice that the bad rows aren’t partitioned by run:

This has implications on how you can create and query tables in Athena. I’ll now run through the process outlined in Yali’s tutorial for a real-time pipeline.

1. Log into Athena and create the bad rows table

In the AWS console navigate to Athena. Enter the following statement to create your bad rows table, being sure to update the S3 path to the location of your bad stream. (This should be identifiable from your Stream Enrich config.)

CREATE EXTERNAL TABLE IF NOT EXISTS bad_rows (
  line string,
  errors array<struct<level: string, message: string>>,
  failure_tstamp string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE
LOCATION 's3://sp-com-yourcompany-kinesis-s3-bad/main/'

We are now ready to start analysing the bad stream!

Query warning

Because there are no partitions in S3 with a real-time pipeline, this means that you need to be extra-vigilant when querying the data. The table created above will allow you to query your full enriched bad dataset without the need to manually load partitions (as you have to do for a batch pipeline). AWS charge based on the amount of data scanned per query therefore it is worth limiting the amount of data scanned as much as possible!

2. Querying the data

2a. Count the number of bad rows per day

We can then see how many bad rows are generated per day.

SELECT 
substr(failure_tstamp, 1, 10),
COUNT(*)
FROM bad_rows
WHERE failure_tstamp IS NOT NULL
GROUP BY 1 
ORDER BY 1
LIMIT 10

2b. Viewing a selection of bad rows for a particular day

SELECT 
errors
FROM bad_rows
WHERE substr(failure_tstamp, 1, 10) = '2018-07-25'
LIMIT 100;

As discussed in our earlier post, there are a number of bad rows that we don’t need to worry about. We can easily filter these out:

SELECT
errors[1].message
FROM bad_rows
WHERE substr(failure_tstamp, 1, 10) > '2018-07-25' 
AND position('does not match (/)vendor/version(/) pattern nor is a legacy' in errors[1].message) = 0
AND position('Unrecognized event [null]' in errors[1].message) = 0
AND position('Querystring is empty: no raw event to process' in errors[1].message) = 0
LIMIT 100;

Note that in the above example we’re filtering out bad rows with the two specific error messages listed in the earlier post using the position function, which returns 0 if the specified text cannot be found the error message.

Identifying an error and measuring what impact it’s having

Below are some of the rows that were extracted from the query above:

error: Could not find schema with key iglu:com.snowplow/video_event/jsonschema/1-0-1 in any repository, tried: level: "error" repositories: ["Iglu Central - Mirror 01 [HTTP]","Iglu Central [HTTP]","Iglu Client Embedded [embedded]","Iglu com.snowplow-main [HTTP]"]|
error: Could not find schema with key iglu:com.snowplow/video_event/jsonschema/1-0-1 in any repository, tried: level: "error" repositories: ["Iglu Central - Mirror 01 [HTTP]","Iglu Central [HTTP]","Iglu Client Embedded [embedded]","Iglu com.snowplow-main [HTTP]"]|
error: Could not find schema with key iglu:com.snowplow/video_event/jsonschema/1-0-1 in any repository, tried: level: "error" repositories: ["Iglu Central - Mirror 01 [HTTP]","Iglu Central [HTTP]","Iglu Client Embedded [embedded]","Iglu com.snowplow-main [HTTP]"]|
error: Could not find schema with key iglu:com.snowplow/page_view_event/jsonschema/1-0-1 in any repository, tried: level: "error" repositories: ["Iglu Central - Mirror 01 [HTTP]","Iglu Central [HTTP]","Iglu Client Embedded [embedded]","Iglu com.snowplow-main [HTTP]"]|

As you can see we’ve identified errors related to two of our schemas video_event and page_view_event.

We can inspect a set of lines associated with the event:

SELECT
line
FROM bad_rows
WHERE substr(failure_tstamp, 1, 10) > '2018-07-25' 
AND position('video_view' in errors[1].message) > 0
LIMIT 100;

By base64 decoding the payload body we can understand where this event came from and use that to home in on the tracking issue that generated it. In this particular example there is no need to do so since the schema associated with the event could not be found - implying that the schema was not uploaded or named incorrectly (e.g the schema uploaded was called video_play).

We can also look at how many bad rows this error has accounted for over time:

SELECT
substr(failure_tstamp, 1, 10) AS error_date,
COUNT(*) AS errors
FROM bad_rows
WHERE substr(failure_tstamp, 1, 10) > '2018-07-25' 
AND position('video_view' in errors[1].message) > 0
GROUP BY 1
ORDER BY 1

Which outputted:

 	error_date	errors
1	2018-07-30	100
2	2018-07-31	200

Assuming that we’ve since fixed the issue, we can easily filter out these rows to continue debugging the remaining bad rows:

SELECT
errors[1].message
FROM bad_rows
WHERE substr(failure_tstamp, 1, 10) > '2018-07-25'
AND position('does not match (/)vendor/version(/) pattern nor is a legacy' in errors[1].message) = 0
AND position('Unrecognized event [null]' in errors[1].message) = 0
AND position('Querystring is empty: no raw event to process' in errors[1].message) = 0
-- Adding the new condition to our filter below so we can see what errors remain to be diagnosed
AND position('video_view' in errors[1].message) = 0
LIMIT 100;