Google Cloud Dataflow example project released


#1

We are pleased to announce the Google Cloud Dataflow example project.

This project will help you start your own real-time event processing pipeline, using some of the great services and tools offered by Google Cloud Platform: Pub/Sub (for distributed message queueing), Dataflow (for data processing) and Bigtable (for NoSQL storage).


#2

This is the first output from Gui’s winter internship at Snowplow working on Google Cloud Platform - great work @colobas!


#3

very interesting, thanks!

couple of questions:

  1. Is this a test pilot or is there more plans to do future improvements to sample pipeline?
  2. Any reason why you haven’t used kafka+bigquery and existing stream collector+enrich?

Cheers,
Evaldas


#4

Hi @evaldas! I can probably answer for @colobas:

  • This is a standalone test pilot, independent of any future work porting Snowplow to GCP
  • We didn’t use Kafka because we were trying to learn about GCP, not Kafka!
  • We thought about using BigQuery but as we had done some experimentation with BigQuery a while back, it was more interesting to try out Bigtable. Plus it fitted this analytics-on-write use case better
  • We didn’t use existing Snowplow components because this is meant to be a standalone example project - no prior knowledge of Snowplow required

Hope this helps. Stay tuned for the Snowplow on GCP Request for Comments @evaldas - it sounds like this is more what you are looking for…


#5

Hi @alex, thanks for the info. I was meaning to try all of those standard GCP parts as that is usually the canonical setup that google always demo’s for any event streaming example projects. Now I’ll have a good reason to try this out. I guess it does make sense to use BT in some cases especially if you need high throughput for QPS, which BigTable provides, though for dwh type analysis BigQuery has a lot more advantages being a columnar store, having nested data structs and that you don’t need to manage the nodes yourself. Also it supports streaming inserts as well which are not available in Redshift (though it has some caveats too).

The Dataflow seems to be interesting especially if you combine with Apache Beam abstraction to manage the pipelines it might offer best of both worlds not locked cloud option and ability to switch to any other solution.

Will be very interesting to see the RFC for GCP!


#6

btw, when I try any inv command I always get " did not receive all required positional arguments!" though vagrant up completed ok


#7

Hi @evaldas - yes, the potential for “programming to the interface” and using Beam for other, non-GCP environments too is super interesting.

Feel free to raise a bug in the repository!


#8

Hey @evaldas , I’m sorry it took me so long to answer. I forked the repo and tried to correct the problem, but unfortunately I have no GCP account w/ available resources to test it out right now. If you have the time, could you try it out? If it works I’ll open a PR. It’s here: https://github.com/colobas/google-cloud-dataflow-example-project .

I believe the problem has to do with Python 3 vs 2 conflicts, but it was my fault for sure - I probably didn’t test the helper script properly inside the vagrant machine (Python 2 environment), and only tested it in my development environment (Python 3 environment). Sorry!


#9

Hey @colobas, thanks for the fix I’ve tried your fork and run into another error which I posted here: https://github.com/snowplow/google-cloud-dataflow-example-project/issues/5

Wouldn’t be easier just to update vagrant to use python 3 instead?

Cheers,
Evaldas