With the recent introduction of our full support for Snowflake DB, we’d love to hear from you what your experiences with the databases are or why you’re considering making the switch. And, now that we have Snowflake support, what other storage targets are you interested in seeing from us?
A thing that we are actually considering is moving away from specific data storage target but rather keeping the event data in a data lake manner with an MetaStore on top (e.g. Hive or AWS Glue). This may be piped out to various storage targets, as many come with individual advantages and disadvantages. Also we can address multi-tenency and access policy issues more easily then having everything in a single data storage.
How do you see that? Looking at the last re:Invent, especially AWS Glue, what are your thoughts considering SnowPlow architecture?
Would be awesome to have Redshift Spectrum capability without having to do extra work on the directory structure etc
Also as far as additional storage targets, Google BigQuery seems like an obvious one
First of all - thanks for kicking off this discussion. The approach you described quite precisely matches our vision of how data should be stored and processed. Snowplow never was all about single data warehouse - we’re empowering smart people to do more with their data and we don’t want these people to be restricted in how they analyze the data.
Redshift is a great warehouse and it continues to serve very well to our users as most stable solution. But while we worked on Redshift improvements and Snowplow in general, we developed a very generic and flexible framework on top of Iglu that allows us to load data to each storage in a way friendly to this particular storage. Snowflake was just another step in this direction and for example shaped up our load-tracking approach, which is going to be back-ported to Redshift soon.
You can think of
Iglu + Loaders platform as of some kind of high-level and analytics-friendly generalization of
AWS Glue approach. “Generalization” here implies that technologies can embrace each other, not necessary to compete. What is also very important is that unlike AWS Glue approach, which is coupled with S3 data lake, our approach is cloud-provider-independent. And our next big goal as @travisdevitt mentioned is Google BigQuery (as per RFC). So, stay tuned and as always - new ideas and proposals are very welcome.
To add to @anton’s great answer - it’s important to flag that the Snowplow enriched event archive in S3 is what we currently recommend using as your long-term event archive, feeding various downstream specific data storage targets such as Redshift or Snowflake.
As our approach to storage loading manifests improves, it should get easier and easier to be more agile and selective in terms of what eventsets you load into these downstream stores (e.g. “just load last quarter’s events into a new Redshift cluster for analysis”).
An Avro+Parquet+Hive Metastore+S3 storage target (let’s call this APMS for brevity) is definitely something we are interested in, and it’s a common request from Snowplow users and customers. Why do I call APMS a storage target, rather than a data lake or event archive? Really for a couple of reasons:
- Transforming Snowplow enriched events into Avro is a lossy process, because Avro is less expressive than our heterogeneous JSON Schema-based event. Anything that is lossy can’t be the definitive Snowplow event archive
- As soon as we are using Parquet (a columnar format), we are optimizing for OLAP-style queries. This is great for typical data analytics workloads, but not appropriate for row-wise operations like bulk-loading downstream databases
To be clear - there is definitely plenty of scope for improving our Snowplow enriched event archive format and layout on disk - but it’s worth decoupling this from the awesome idea of a Avro+Parquet+Hive Metastore+S3 storage target, as they are different things.