Redshift loading error: null byte - field longer than 1 byte

Hi Snowplowers!

Today we stumbled upon this issue. Basically a client-side, self-describing event was fired with a field that contained the Unicode character null (\u0000).

To unblock the storage loader, we manually removed the character from the event and rerun it. It’s only the second time this has happened to us (last time was a year ago or so), but it could be quite distracting if someone started sending them on purpose.

We considered how to fix the issue properly and we came up with some options:

  • Force the tracker users to remove those chars before sending the event. It’s not ideal because the sanitizing code will be spread in many places.
  • Remove those chars in the tracker. We don’t like it much either, because it would require changing a lot of trackers.
  • Use the event schemas to invalidate events that contain this character. We don’t like it much because it would complicate most of the schemas. The whole event would also be discarded, which is not ideal.
  • Add some sanitizing code in scala-common-enrich that removes null characters for all self-describing events. Something like event.unstruct_event = sanitizeString(event.unstruct_event).

Of the alternatives we prefer the 4th one, because centralizes the logic in one place and doesn’t force the tracker or its users to deal with a DB-specific issue. On the other hand, the solution is quite blunt.

What do you think? Maybe you have other options?

Thanks,
Dani

3 Likes

Thanks for sharing @danisola - and for tracking down the exact issue. What do Snowplow users reckon about the various options?

It’s funny you mention this as I ran into a similar issue this week - though instead of a few odd unicode characters it was a large number of them causing a load to fail. Unfortunately MAXERROR wouldn’t work so the only way was to remove the offending lines from the file.

I think you make a good argument for option 4 as keeping the logic centralised and therefore consistent across enrichers rather than the numerous trackers makes a lot of sense.

What do you think, should I create a PR with the 4t option @alex?

Hey @danisola - just thinking about this some more - as you said,

doesn’t force the tracker or its users to deal with a DB-specific issue

If it’s a DB-specific issue, we should probably add the sanitization code into Hadoop Shred, which is the DB-specific component, not Common Enrich, which is meant to be database-agnostic.

Makes sense, I will open a PR soon.

Thanks everyone!

Thanks @danisola!

Hi @danisola - did you manage to get round to this? I’ve created a ticket here:

Hey @alex,
Sorry for the late reply, I was on holiday. To get around the issue I did a quick fix (hack?) in the shredder code that I wasn’t very proud of, here it is in case you want to use it:

def sanitizeUnstructEvent(event: EnrichedEvent) = {
  def sanitizeString(value: String): String =
    if (value != null) value.replace("\\u0000", "")
    else value

  event.unstruct_event = sanitizeString(event.unstruct_event)
  event
}

Thanks @danisola! I’ve updated the GH issue.