HI,
We are trying to recover some events that failed during the enrichment step because the schema wasn’t published in our iglu-schema repository. We are using the doc using the snowplow-event-recovery-spark-0.3.1.jar.
When the job finishes we get an empty file on S3 and all the failed events:
~ aws s3 ls s3://{THE-BUCKET}recovery_output_2021-12-02-23/ --recursive --human-readable ✔ │ 11s
2022-01-07 14:57:03 0 Bytes recovery_output_2021-12-02-23/_SUCCESS
2022-01-07 14:57:03 42 Bytes recovery_output_2021-12-02-23/part-r-00000.lzo
~ aws s3 ls s3:/{THE-BUCKET}/recovery_failed_output_2021-12-02-23/ --recursive --human-readable ✔
2022-01-07 15:09:59 0 Bytes recovery_failed_output_2021-12-02-23/com.snowplowanalytics.snowplow.badrows.recovery_error/_SUCCESS
2022-01-07 15:03:12 6.0 GiB recovery_failed_output_2021-12-02-23/com.snowplowanalytics.snowplow.badrows.recovery_error/part-00000-535cdfba-8591-4581-b91e-a2a10ffb8390-c000
All the failed events on the S3 path s3:/{THE-BUCKET}/recovery_failed_output_2021-12-02-23/
have the error:
{
"schema": "iglu:com.snowplowanalytics.snowplow.badrows/collector_payload_format_violation/jsonschema/1-0-0",
"data": {
"processor": {
"artifact": "snowplow-event-recovery",
"version": "0.2.0"
},
"failure": {
"timestamp": "2022-01-06T15:59:56.293Z",
"loader": "",
"message": {
"error": "Attempt to decode value on failed cursor: DownField(error),DownField(failure),DownField(data)"
}
},
"payload": "{\"schema\":\"iglu:com.snowplowanalytics.snowplow.badrows/schema_violations/jsonschema/2-0-0\",\"data\":{\"payload\":{\"enriched\":{\"mkt_network\":null,\"tr_total\":null,\"br_name\":null,\"doc_charset\":\"UTF-8\",\"br_features_director\":null,\"page_urlpath\":null,\"br_features_quicktime\":null,\"tr_total_base\":null,\"mkt_term\":null,\"mkt_source\":null,\"ti_price\":null,\"tr_tax\":null,\"br_renderengine\":null,\"refr_urlhost\":null,\"v_tracker\":\"js-3.1.6\",\"mkt_clickid\":null,\"page_urlscheme\":null,\"mkt_campaign\":null,\"doc_height\":5199,\"geo_timezone\":null,\"app_id\":\"consumer-web\",\"ip_domain\":null,\"mkt_medium\":null,\"geo_longitude\":null,\"br_features_java\":null,\"refr_urlscheme\":null,\"user_id\":null,\"geo_region_name\":null,\"page_referrer\":\"https://www.google.com/\",\"os_timezone\":null,\"refr_source\":null,\"geo_region\":null,\"dvce_ismobile\":null,\"page_urlquery\":null,\"br_cookies\":1,\"useragent\":\"Mozilla/5.0 (Linux; Android 11; SAMSUNG SM-N986U) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/16.0 Chrome/92.0.4515.16
....
ERROR Attempt to decode value on failed cursor: DownField(error),DownField(failure),DownField(data)
Spark Submit command executed inside the AWS EMR cluster:
spark-submit \
--deploy-mode cluster \
--master yarn \
snowplow-event-recovery-spark-0.3.1.jar \
--input s3://{THE-BUCKET}/enriched/bad/date_at=2021-12-02/hour=23/ \
--failedOutput s3://{THE-BUCKET}/recovery_failed_output_2021-12-02-23/ \
--unrecoverableOutput s3://{THE-BUCKET}/recovery_unrecoverable_output_2021-12-02-23/ \
--directoryOutput s3://{THE-BUCKET}/recovery_output_2021-12-02-23/ \
--region eu-west-1 \
--resolver "eyJzY2hlbWEiOiJpZ2x1OmN..........................=" \
--config "eyAic2NoZW1hIjogImlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLnNub3dwbG93L3JlY292ZXJpZXMvanNvbnNjaGVtYS8zLTAtMCIsICJkYXRhIjogeyAiaWdsdTpjb20uc25vd3Bsb3dhbmFseXRpY3Muc25vd3Bsb3cuYmFkcm93cy9zY2hlbWFfdmlvbGF0aW9ucy9qc29uc2NoZW1hLzItMC0wIjogW3sibmFtZSI6ICJwYXNzdGhyb3VnaCIsICJjb25kaXRpb25zIjogW10sICJzdGVwcyI6IFtdfV19fQ=="
Configuration based on the doc:
{ "schema": "iglu:com.snowplowanalytics.snowplow/recoveries/jsonschema/3-0-0", "data": { "iglu:com.snowplowanalytics.snowplow.badrows/schema_violations/jsonschema/2-0-0": [{"name": "passthrough", "conditions": [], "steps": []}]}}
Bad Event Example:
{
"schema": "iglu:com.snowplowanalytics.snowplow.badrows/schema_violations/jsonschema/2-0-0",
"data": {
"payload": {
"enriched": {
"mkt_network": null,
.......
"true_tstamp": null
},
"raw": {
"headers": [
"Timeout-Access: <function1>",
"X-Forwarded-For: 2.234.152.192, 64.252.144.71",
"X-Forwarded-Proto: https",
"X-Forwarded-Port: 443",
.....
"application/json"
],
"ipAddress": "fasf",
"useragent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34",
"encoding": "UTF-8",
"version": "tp2",
"userId": "1d5e2027-6d33-450e-90c1-011b099f8d2b",
"refererUri": "https://onefootball.com/",
"hostname": "thehost.onefootball.com",
"loaderName": "ssc-2.3.1-kinesis",
"vendor": "com.snowplowanalytics.snowplow",
"parameters": ".....",
"contentType": "application/json",
"timestamp": "2021-12-01T20:21:03.743Z"
}
},
"failure": {
"messages": [
{
"error": {
"lookupHistory": [
{
"lastAttempt": "2021-12-01T20:14:00.159Z",
"repository": "Iglu Central",
"errors": [
{
"error": "NotFound"
}
],
"attempts": 52
},
{
"lastAttempt": "2021-12-01T20:14:00.253Z",
"repository": "Iglu Central - GCP Mirror",
"errors": [
{
"error": "NotFound"
}
],
"attempts": 52
},
{
"lastAttempt": "2021-12-01T09:00:35.386Z",
"repository": "Iglu Client Embedded",
"errors": [
{
"error": "NotFound"
}
],
"attempts": 1
},
{
"lastAttempt": "2021-12-01T20:13:43.815Z",
"repository": "Production Repo",
"errors": [
{
"error": "NotFound"
}
],
"attempts": 52
}
],
"error": "ResolutionError"
},
"schemaKey": "iglu:com.onefootball/consumer_web_stream_context/jsonschema/1-0-2"
}
],
"timestamp": "2021-12-01T20:21:06.066019Z"
},
"processor": {
"artifact": "stream-enrich",
"version": "1.4.2"
}
}
}
I already tried to check for this error on others threads but could not find anything related with the error Attempt to decode value on failed cursor: DownField(error),DownField(failure),DownField(data)*
.
One thing important to mention! Those bad events were retrieved from our Elasticsearch bad index because we didn’t create an s3_loader for the enrich_bad Kinesis stream at that time. Basically, we wrote a python script to retrieve those bad events from the ElasticSearch bad index, and save them in S3 path enriched/bad
.