Unstructured data handling and storage

Hey @raghavn!

Good to hear you’ve got up and running!

I had set up a basic pipeline using Snowplow to check whether it works for us or not. I worked for us though I see a lot of Bad data along with Good data.

If you have a lot of bad data, generally it’s down to issues which can be fixed in the tracking, or by adapting schemas. If you take a look into the bad data itself, the error messages should give you an indication of what’s going wrong, and with a little investigation it can generally be fixed. Here’s some docs on digging into bad rows (assuming you’re using latest version).

JS tracker → Kinesis firehose → S3

Since you mentioned you’re getting good and bad rows, I assume the architecture is actually JS tracker > Snowplow pipeline > Kinesis firehose > S3.

Generally, we don’t recommend using Kinesis firehose for the S3 sink - there are some differences in how format is handled. Instead, we generally recommend using the Snowplow S3 loader - it’s designed to be compatible with the rest of the pipeline components (including loaders, which I’ll get to in a bit).

Now, I have to set up a production system. We have 100+ custom events the fields within these events are frequently changed. One of the requirements is to store the events in DB. As per the snowplow GitHub, for storage, we have Redshift, Postgres, and S3.

As I mentioned, the S3 loader is best for S3. For Redshift, we use the RDB loader - which is a set of tools to ‘shred’ the data (ie split out the self-describing JSONS into a flat relational table structure), and load to Redshift.

As for Postgres, all we have released so far are experimental attempts at this, but we have been working on a more fully featured loader specifically for Postgres, I don’t have much detail or timelines (I’m not on the team that works on it), but watch this space. :slight_smile:

On the topic of changing schemas - as long as you follow schemaver semantic versioning, all of the evolution of database tables will be handled for you if you use our loader setup. Just a heads up that this means that a production schema is immutable, and changes require versioning. If you break it by editing-in-place, you’re opening the door for big headaches (regardless of what tech you use - some changes just don’t work with Databases! - eg type changes).

To your specific questions:

  1. As we have 100+ different events which mean 100+ different iglu schemas, I do not think the RDBMS is the best option to choose.

It’s a design decision to make for sure - but we run Redshift pipelines for customers with huge amounts of custom events and contexts using the above mentioned Redshift loader. It works, but yeah depending on how you use the data, Redshift mightn’t be the best storage target.

If you load to S3, you can query the data directly from there using Athena or Redshift Spectrum too, data lake style - here’s a guide to using Athena in this way.

Does Snowplow support NoSQL/Document DBs?
2. If it does support NoSQL, then how to handle this scenario?

We have a loader for Elasticsearch - that’s one option. For other NoSQL targets, you’ll need to figure out your own load process - which you can set up to read the S3 data or read from the ‘enriched good’ kinesis stream. I know that several of our community have built their own custom loaders which used the elasticsearch loader as a jumping off point. Others have used AWS lambda to get data into a different target.

  1. Should I create 100+ schemas or one schema with all the entries?

I definitely would advise against one schema for everything. But additionally - the fact that you’ve got 100+ different custom schemas to begin with suggests to me that it’s probably wise to re-examine the tracking design. One can end up with that many for certain, but generally it’s a good idea to start small and iterate. There are a few things that might help with this:

  • If you’ve gone with a ‘track everything’ approach, I would look at what the data is to be used for, and design the tracking to suit that - starting with the most important.
  • If you haven’t already, consider using custom entities (aka contexts) in your tracking design - repeated information can be attached to more generic events separately in this way.
  • In general, consider whether you can rationalise your events to track a single more generic event instead of multiple more specific ones - and generally it’s best to deign for user interactions rather than everything that happens (“measure the behaviour, not the technology”)

On this topic this blog series is generally a nice way to frame things - I’d recommend taking a look.

  1. If I have to change the unstructured JSON fields to multiple DB columns, how can I achieve that?

I think this is covered - for Redshift, the shred process via RDB load handles it - no need to do it yourself. (Btw, Snowflake handles this data in a much more amenable fashion - it is quite expensive, but if you do need that many custom schemas it might be a better experience - we also have a loader for that DB).

  1. Do I need to use S3 as an intermediate phase before moving the data to DB? Or Can I use a DB as storage directly?

For Redshift and Snowflake, currently yes. We have plans to implement stream loading for them, but currently the closest we have is very frequent micro-batches, still reading from S3 (one can use a persistent cluster and load very frequently though!)


Reading this back, I did not realise the answer would be this long when I started! :smiley: The TL;DR is that a) we have tech to deal with loading-related things, so do use it! and b) 100+ schemas right off the bat is a very high number, it’s a good idea to take a step back and figure out ways to simplify the tracking design if you can.

I hope that’s helpful! Do shout if you have follow-up questions.

5 Likes