Snowplow for Google Cloud Platform is here – Snowplow


#1

Since the early days of Snowplow Analytics, we’ve been committed to giving our users very granular, highly structured data because we believe that’s what you need to be truly data driven. Doing awesome things with this data, though, has been historically challenging because of how detailed it is. Thanks to Google, we have a solution to that problem.


This is a companion discussion topic for the original entry at https://snowplowanalytics.com/blog/2018/12/03/snowplow-for-google-cloud-platform-is-here/

#2

You talked about the benefits of BigQuery over Redshift. You also support SnowFlake, which is a very exciting database. How do the three compare and are there use cases where Redshift or SnowFlake is a better choice than BigQuery?


#3

Hi @John,

I like this question, thanks for asking :slight_smile:

I’ve got at least some experience with all three, so I can give you a brief answer here. The below are obviously all opinions, I’d be interested to hear what other people’s perspectives are on the whole thing.

(E: I should add that I don’t consider myself an expert by any means, but I have a lot of experience with Redshift, a considerable amount with Snowflake, and some experience with BigQuery - but still learning on all three fronts. Please do add a comment to correct me if you notice anything in the below that isn’t quite right as Mike has done below).

How do the three compare

They’re totally different types of Database.

Redshift

Redshift is a traditional relational database which chunks data into blocks, sorts via a SORTKEY and distributes via a DISTKEY under the hood. The dialect is very close to POSTGRESQL and so is familiar to most analysts who have turned their hands to relational SQL. A major drawback is that it requires manual work to maintain very high volume, and doesn’t support nested data. So you’re always trying to balance priorities across jobs to maintain order, writing SQL to produce efficient joins, and figuring ways around the headaches associated with volume.

If you’re a business who has a lot of back end systems, or a lot of ETL/modeling tasks which are written to a relational structure, or have consumers of the data who are used to relational structures, Redshift has advantages.

Snowflake

Snowflake is a different beast, but still ostensibly a type of relational database (As far as I understand it but maybe someone knows better than me). It utilises a set of cluster keys rather than sortkey/distribution and has a very smart auto-clustering algorithm. It supports nested data much better than Redshift. However, I’ve found that several (quite technical) people I’ve spoken to about it have struggled to wrap their heads around the concept of cluster keys and finding fields of the ideal cardinality to use - although you don’t need to manually set a cluster key unless your volumes are very high so that’s not too much of a problem.

It can be tricky to find efficient ways to query data in Snowflake if you need to both join and unnest data in the same queries. A final drawback is that I’ve come across the occasional bug - but there’s always been a workaround and the Snowflake team are super responsive and do release fixes on a decent turnaround so it’s not a major issue.

The major advantage of Snowflake over Redshift is that you don’t have to choose between compute and storage nodes, as they’re billed separately - and you can scale each according to your requirements. So choose a warehouse size that lines up well with your task and you’re on your way.

BigQuery

BigQuery isn’t so similar a relational database in the traditional sense. I think of it more a datalake with a query syntax on top - a bit like AWS’s Athena, in that the data is stored under the hood in ColumnIO format, and partitions & sharding rather than cluster keys/distribution keys/sort keys do the work of data distribution as far as I understand it. The pricing model is different too - you get charged for the amount of data scanned (with a 1TB/month free allowance).

Querying BQ is super super fast and it handles massive datasets really well. But it can’t do some things that other databases can - if you’re SELECTing a lot of data then ordering that data can cause an out of memory issue in my experience (or similar, can’t remeber the specific error message). If you normally need to run window functions over large datasets then you’re going to have to get pretty creative about it (A similar issue has cropped up in running certain types of window function, at least). Plus, you better have done a good job of deciding upon your partition key and getting your partition/sharding strategy right, because the in-built pseudo-partitions don’t always limit scans (and therefore don’t limit your bill). Also only Javascript UDFs are compatible at this point, which is annoying for some (I don’t really mind that tbh).

But I can’t stress enough how unbelievably impressive BQ is on performance - it’s a monster for performance in a good way. And the BQ UI and GCP ecosystem plays very nicely together - AWS have lot of products, Google have seamlessly integrated products. BQ supports querying external data sources (eg. CloudStorage) either as Native tables (implicit load to BQ, which is free if coming from CloudStorage), or External tables (can update live). And DataStudio is an amazing visualisation tool, which has its obvious advantages.


This is turning into an essay so I’ll leave the above stream-of-consciousness synopsis where it is. I think the above addresses your second question, but to quickly summarise the answer directly:

are there use cases where Redshift or SnowFlake is a better choice than BigQuery?

Yup, depending on your requirements, what data you’re putting in the database, and what trade-offs make sense for your business. At the end of the day every piece of tech involves some design decisions that make some trade-off, so you can only really compare them in the context of what you need to do/what your business looks like (in terms of architecture, skillsets and requirements).

I’m very curious to hear whether people disagree with any of the above or if people have their own perspectives though!

Best,

(Edits to rephrase poorly worded sentences and make my brain dump coherent
Further edits to correct some inaccuracies as pointed out by Mike below)


#4

@Colm has given a very comprehensive answer above but I’ll add a few things in. I have less experience with Snowflake DB so I’ve added in Redshift and BigQuery items.

Redshift

  • A major advantage to having Redshift is a predictable fixed cost. You only pay for the nodes that you’ve provisioned.
  • You can nest data in Redshift (as JSON) but these columns are essentially just larger VARCHAR columns. Redshift doesn’t calculate statistics within these columns so the query optimiser can’t take advantage of knowing things about the structure of the data ahead of time.
  • Resizing used to be a pain but has been made better by elastic resize which was released about 2 months ago that allows you to scale up and down compute and storage (from hours to minutes).
  • User management is pretty easy with Redshift whether you are doing it natively (Postgres-style), through IAM (AWS) or through Active Directory
  • There’s no autovacuum and VACUUM SORT operations can often be expensive if not run regularly enough.
  • Redshift doesn’t pick compression encodings automatically (though it can suggest some based on sampling table data) so this is worth doing properly
  • Redshift used to not select distribution styles automatically and these would have to be manually specified. The November release has added DISTKEY AUTO which helps out however.
  • Redshift Spectrum allows you to run queries on external data sources (like S3). Note that Spectrum uses a different query optimiser to Athena however so you won’t get identical speeds.
  • Supports UDFs in Python and SQL

BigQuery

  • Supports real time streaming. It’s the only of the 3 options here where streaming (rather than microbatching) has been designed in to the architecture.
  • Fast* to run queries across large amounts of data. Compute is decoupled from storage so queries can run at a consistently fast speed across very large volumes of data.
  • Fast: BigQuery is not a good database to use if you’re expecting sub-second queries like you would get in an OLTP database. Most queries that are submitted run more than once and the fastest result is returned from BigQuery.
  • Data can be queried on external sources (like Google Cloud Storage)
  • Supports Javascript UDFs
  • Has some support for geographic functions and basic machine learning built in
  • You can limit the amount of data scanned in a query by using maximumBytesBilled which can help with cost control. This can be set at a query, user or project level.

I believe the data under the hood is still in ColumnIO format which is Googles proprietary compressed format rather than JSON.

You can add fields to a column with type RECORD but you can’t make any change that isn’t backwards compatible (e.g., deleting a field, renaming a field, changing it’s type etc).


#6

Thanks @mike, some awesome additional input there!

Thanks also for correcting my inaccuracies - still learning the ropes on BigQuery and I definitely said the wrong thing here and there. Have edited the comment now to reflect more accurate statements - as always your input helps me get better :slight_smile: .

For those interested:

in that the data is stored under the hood in JSON format

I believe the data under the hood is still in ColumnIO format which is Googles proprietary compressed format rather than JSON.

Correct you are! I got confused on this point - what I was thinking of is that you can build a table from JSON format data - which Snowplow tables are. In fact, the speed of BQ isn’t to do with this but is to do with their awesome compression plus I think how it efficiently scans compressed data.

Finally you can’t edit an existing nested column like you can for Snowflake, so if you’re adding fields to a nested JSON column

You can add fields to a column with type RECORD but you can’t make any change that isn’t backwards compatible (e.g., deleting a field, renaming a field, changing it’s type etc).

I can’t remember the details of this - but we have definitely had issues with amending STRUCT objects to add new fields. I wasn’t the one to solve that problem so I’d need to figure out what the issue was specifically - but it lead me to believe that you couldn’t do it which may not in fact be the case, looking at the docs.

Best,