BigQuery Loader and emr-etl-runner

morris206 · December 4, 2018, 10:19pm

Hey guys,
Excited about the new BigQuery Loader. I have a working enrichments process setup and I’m a little unclear if I can still use it. I have a working r88 emr-etl-runner. Do I need to upgrade to R110 Valle dei Templi to have this work.

Really appreciate it as always,
Thanks,
Morris

morris206 · December 4, 2018, 10:27pm

I mean, I have enriched data sitting in my s3 bucket waiting to be loaded into BigQuery.

mike · December 5, 2018, 4:10am

Hi @morris206 - from my understanding the BigQuery loader that was recently released is designed for loading data directly from Pubsub to BigQuery via the Beam job. This means that there isn’t currently a mechanism for loading from S3 to BigQuery using the current loader however it may be possible to adapt the Beam job to use the filesystem as an input rather than a Pubsub topic.

anton · December 5, 2018, 6:15am

Hi @morris206,

@mike is right, BigQuery Loader loads data straight from PubSub and has very different architecture from our AWS batch pipeline, so there’s no connection between EmrEtlRunner and Snowplow GCP pipeline.

I think the shortest way to load data from S3 to BigQuery would be to stream whole your enriched archive into PubSub topic listened by BigQuery Loader. Messages in input topic should have usual TSV+JSON format.

morris206 · December 6, 2018, 4:27pm

So what enrichment process is preferred?

morris206 · December 6, 2018, 6:34pm

Also, I have to change the structure of the tracker itself. Can I use the

;(function(p,l,o,w,i,n,g){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
};p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","//d1fc8wv8zag5ca.cloudfront.net/2.9.0/sp.js","snowplow_name_here"));
</script>```


still? like where it shows the cloudfront collector ID, can I just change that to the Pub/Sub ID?

Thanks,
morris

morris206 · December 6, 2018, 6:39pm

Do I have to use the Scala tracker instead? I really don’t want to change it, I am using the JavaScript, which I am comfortable with, and I do not know anything about Scala or how it works

mike · December 6, 2018, 10:26pm

There’s a few questions here so let me know if I’m not addressing them all.

If you want to sink data into BigQuery the current preferred method is to run the entire pipeline on GCP. The pipeline on GCP is real time and differs quite a bit from the AWS pipeline.

The GCP pipeline consists of the scala stream collector, beam enrich and the BigQuery loader. Data moves between these components using PubSub.

The AWS (realtime) pipeline consists of the scala stream collector, stream enrich and from there various other components (such as the S3 loader, Elasticsearch loader, and EMR ETL runner that includes loading and shredding). Data moves between these components using Kinesis (or Kafka).

The Javascript tracker sends events to the Scala stream collector. You don’t need to use the Scala tracker unless you are planning on sending events in Scala to the tracker.

Perhaps post a new question for the Javascript tracker / Cloudfront question.

morris206 · December 6, 2018, 11:23pm

Here is what I came to. I am going to use the Pub/Sub and Scala Stream to get the data into BigQuery- bypassing and shutting down the AWS pipeline. Thank you for the info. Much appreciated.

Morris

Topic		Replies	Views
Loading from S3's enriched events to PostgreSQL	4	1257	May 20, 2019
Loading data from s3 to Redshift after EmrEtlRunner Troubleshooting	7	3236	November 19, 2018
BigQuery Loader and table schema GCP pipeline	1	1052	July 21, 2020
RDB Loader, Storage Loader, EmrEtlRunner Storage targets	14	2057	October 22, 2019
Can't load data back into redshift Troubleshooting	7	1747	June 4, 2018

BigQuery Loader and emr-etl-runner

Related Topics