Question regarding Snowplow event recovery 1.0.0 configuration format

Hello,

We have a bad row that looks like this

{
    "line": "<base64 encoded string>",
    "errors": [
      {
        "level": "error",
        "message": "error: object instance has properties which are not allowed by the schema: [\"submitted\"]\n    level: \"error\"\n    schema: {\"loadingURI\":\"#\",\"pointer\":\"\"}\n    instance: {\"pointer\":\"\"}\n    domain: \"validation\"\n    keyword: \"additionalProperties\"\n    unwanted: [\"submitted\"]\n"
      }
    ],
    "failure_tstamp": "2021-03-21T10:03:26.280Z"
  }

We are trying to use snowplow-event-recovery-spark-0.1.0.jar to correct the bad row. We are just unsure as to what to give as the error filter in the config. Specifically, what should go inside the ‘error’ property in the configuration. Should we just copy the message field as follows?

{
  "schema": "iglu:com.snowplowanalytics.snowplow/recoveries/jsonschema/1-0-0",
  "data": [
    {
      "name": "RemoveFromBody",
      "error": "error: object instance has properties which are not allowed by the schema: [\"submitted\"]\n    level: \"error\"\n    schema: {\"loadingURI\":\"#\",\"pointer\":\"\"}\n    instance: {\"pointer\":\"\"}\n    domain: \"validation\"\n    keyword: \"additionalProperties\"\n    unwanted: [\"submitted\"]\n",
      "toRemove": "\"submitted\":\".*\",?"
    }
  ]
}

@onnu_thonala_ad , yes, you can either use the whole string with the exact characters as they are in the bad data error or just part of it sufficient to identify the rejected event you are after. For example, you could use just “object instance has properties which are not allowed by the schema: [“submitted”]”.

This is also shown in the doc example as

    # Removes a field which shouldn't be there
    {
      "name": "RemoveFromBody",
      "error": "object instance has properties which are not allowed by the schema: [\"test\"]",
      "toRemove": "\"test\":\".*\",?"
    }

Thank you @ihor

@ihor could you help us with one more thing? We have a field in the body (base64 encoded) that we need to replace. We need to replace

{
  "submitted": {
    "data": "xyz"
  }
}

with

{
  "submitted_modified": "xyz"
}

Could you help us with the regex for this? Thanks!

@onnu_thonala_ad, I think it would be something like this

"toReplace": "\"submitted\":\{\"data\":\"(.*)\"\}",
"replacement": "\"submitted_modified\":\"$1\""

I’m not sure if the raw data already has curly brackets escaped. If so, you would have \\{ and \\} in “toReplace”.

1 Like

Thanks a lot for helping out @ihor

Hello @ihor, I tried the regex that you had given but it didn’t work. I tried testing on my local but it was throwing errors. Basically, the parsing of the recoveryScenarios JSON using the circe parser fails for nested JSONs.

val recoveryScenarios = io.circe.parser.parse(getResourceContent("/recovery_scenarios.json"))
    .flatMap(_.hcursor.get[List[RecoveryScenario]]("data"))
    .fold(f => throw new Exception(s"invalid recovery scenarios: ${f.getMessage}"), identity)

I tried 3 different regex’s, but none of them worked. Attaching the screenshots for your reference -

Do you know where we’re going wrong?

Thanks!

@ihor Could you kindly help with this?

@onnu_thonala_ad , could you share an example of the whole bad record?

@ihor I’m afraid I wouldn’t be able to share it on the public forum because of some company policies. Is there a way I can DM you or email you?