BigQuery Loader and emr-etl-runner


Hey guys,
Excited about the new BigQuery Loader. I have a working enrichments process setup and I’m a little unclear if I can still use it. I have a working r88 emr-etl-runner. Do I need to upgrade to R110 Valle dei Templi to have this work.

Really appreciate it as always,


I mean, I have enriched data sitting in my s3 bucket waiting to be loaded into BigQuery.


Hi @morris206 - from my understanding the BigQuery loader that was recently released is designed for loading data directly from Pubsub to BigQuery via the Beam job. This means that there isn’t currently a mechanism for loading from S3 to BigQuery using the current loader however it may be possible to adapt the Beam job to use the filesystem as an input rather than a Pubsub topic.


Hi @morris206,

@mike is right, BigQuery Loader loads data straight from PubSub and has very different architecture from our AWS batch pipeline, so there’s no connection between EmrEtlRunner and Snowplow GCP pipeline.

I think the shortest way to load data from S3 to BigQuery would be to stream whole your enriched archive into PubSub topic listened by BigQuery Loader. Messages in input topic should have usual TSV+JSON format.


So what enrichment process is preferred?


Also, I have to change the structure of the tracker itself. Can I use the


still? like where it shows the cloudfront collector ID, can I just change that to the Pub/Sub ID?



Do I have to use the Scala tracker instead? I really don’t want to change it, I am using the JavaScript, which I am comfortable with, and I do not know anything about Scala or how it works


There’s a few questions here so let me know if I’m not addressing them all.

If you want to sink data into BigQuery the current preferred method is to run the entire pipeline on GCP. The pipeline on GCP is real time and differs quite a bit from the AWS pipeline.

The GCP pipeline consists of the scala stream collector, beam enrich and the BigQuery loader. Data moves between these components using PubSub.

The AWS (realtime) pipeline consists of the scala stream collector, stream enrich and from there various other components (such as the S3 loader, Elasticsearch loader, and EMR ETL runner that includes loading and shredding). Data moves between these components using Kinesis (or Kafka).

The Javascript tracker sends events to the Scala stream collector. You don’t need to use the Scala tracker unless you are planning on sending events in Scala to the tracker.

Perhaps post a new question for the Javascript tracker / Cloudfront question.


Here is what I came to. I am going to use the Pub/Sub and Scala Stream to get the data into BigQuery- bypassing and shutting down the AWS pipeline. Thank you for the info. Much appreciated.