Low cost stream pipeline

Hey all,
Have been working with multiple clients to provide p.o.c stacks with Snowplow, most of these clients will have less then 250k events per month and very basic requirements, however now with all the updates thats happening with 3rd party cookies, google samesite, ITP etc i need to update the offering and can’t really do batch pipelines anymore, however before setting anything up I was wondering if anyone here is running any setup where they have less then 250k - 500k events per month and what cost that is per month?

Can’t speak to Snowplow Mini - I’ve heard it could be able to handle that sort of volume. Might be worth a look for POC Snowplow RT installs.

As for low-cost RT pipelines, we’ve been toying with batching realtime/GCP by firing up Beam Enrich for a few hours each day. Haven’t tested it at volume, but I imagine it’ll be cost effective compared with AWS EMR/Batch. Using an n1-standard-4 instance for a couple of hours each day should keep your costs to no more than $0.50/day for enrichment/loading.

Just need crontab and a simple shell script to orchestrate this:

  1. Start Beam Enrich @ 6am
  2. Stop Beam Enrich @ 7am
#!/bin/bash

# Drain active beam-enrich jobs
JOBNAME="beam-enrich"
JOBID=$(gcloud dataflow jobs list --status=active --region us-central1 | grep "$JOBNAME" | awk '{ print $1 }')

if [[ -z "$JOBID" ]]; then
    echo "No active $JOBNAME jobs"
elif [[ -n "$JOBID" ]]; then
    echo "Stopping $JOBNAME: $JOBID"
    for JOB in $JOBID; do
        gcloud dataflow jobs drain $JOB --region us-central1
    done
fi

Then just rinse and repeat with BQ Loader and BQ Mutator.

3 Likes

Hi @fwahlqvist and @robkingston,

On using Mini for a production pipeline… it’s not a recommended approach. We do not actively test Mini’s ability to perform in a production capacity. You’re more likely to hit issues if you do, not necessarily related to scaling.

Hey @fwahlqvist,

So I’m not sure if this fits your use case, but there’s a community project that might be worth looking at, which leverages serverless functions. I also can’t vouch for it myself since I haven’t looked into using it, but perhaps there are some others who can tell you more about their experience. I certainly think it’s a cool idea.

I believe it was built by someone working with the charity sector who neither have the budget nor the volumes to build a full-fat pipeline, which sounds a lot like what you’re looking at.

Best,

I think this (spinning up Dataflow temporarily) is a really interesting approach. Dataflow tends to be the most expensive part of the pipeline at lower volumes and there’s currently no ‘autoscale’ that scales down to 0 workers so you’re running at least 1 worker for each Dataflow job.

That said - there would be nothing to stop someone running Apache Beam on VMs directly rather than managed through Dataflow though the management overhead generally isn’t worth it.

1 Like