As a digital analyst I want reliable analytics data without the need of much administration so that I can focus on supporting the business with numbers and insights.
I tried to find the best solution for below criteria:
- stream processing, enabling near-realtime data
- ideally fully-managed services, no administration/maintenance after initial deployment
- ideally no fixed costs, pay-per-use
- ideally automatic scaling, no managing of instance sizes, numbers, clusters, load balancing etc.
I believe that Google Cloud Platform is a great platform for fully-managed services:
- Firebase Hosting comes with a free SSL certificate for a custom domain (see here)
- Firebase Hosting handles dynamic requests with Cloud Functions (see here)
- Firebase Hosting Blaze plan is pay-per-use with no monthly fee, the first 2,000,000 Cloud Function invocations are free
- Google Cloud Pub/Sub is pay-per-use with no monthly fee, the first 10GB are free (see here)
- Google Cloud Dataflow is pay-per-use with no monthly fee (see here)
- Google BigQuery is pay-per-use and costs $0.05 per GB for streaming data inserts and $0.02 per GB stored after the first 10GB (see here)
How it would work
/i path on the custom domain HTTPS host points to a Node.js Cloud Function
- The Cloud Function takes care of cookie management and puts the payload into Google Cloud Pub/Sub (SnowCannon might be useful)
- Cloud Pub/Sub triggers Snowplow’s
scala-stream-collector on Cloud Dataflow
- From the
scala-stream-collector the data goes to Snowplow’s
stream-enrich on Cloud Dataflow
- From Cloud Dataflow the data is streamed into BigQuery where it can be queried directly, from Cloud Data Studio and Cloud Data Lab, or via third party analytics and visualization tools that support BigQuery, including Apache Superset.
What I like about this is that any size of website or business can benefit from real-time clickstream analytics and the costs directly correlate with the amount of data.
What do you think about above proposed solution?
You are more than welcome to join me to make the adjustments necessary to deploy to GCP. Below projects might be useful resources:
- Google Cloud Dataflow example project, see here and here
- Cloud Functions Pub/Sub example, see here
- (old) Node.js collector, see here
- Streaming data into BigQuery from Java, see here