Enrich (AWS) to BigQuery Loader


#1

Hi. Currently, my Scala stream collector is in AWS, which publishes to Google Pub/Sub, which is subscribed by Beam Enrich in GCP. However, I wish to have Stream Enrich in AWS as well, and push it to topics (Kinesis/PubSub) which can be picked up by BigQuery Loader in GCP using Dataflow, to do some cost optimization. Is this possible? If not, what are the other possibilities? My goal is to have as much as possible in AWS, but the final storage would be BigQuery.
Thanks


#2

Hey @crimsonb,

This is quite an atypical case, so apologies that I’m not able to give you a confident answer here.

Neither the AWS or GCP versions of the pipeline were built to be modular in this way - they’re designed for stability, scalability and data completeness - which feel like principles that are a bit at odds with plugging components together across different platforms.

I’m not at all sure that the loader can be set up to accept data in this way, so it would need to be tested, but what you’d need to achieve here is to send data from Kinesis Enriched stream to Pub/Sub, make sure that Pub/Sub format is the exact same as the output of the GCP enriched Pub/Sub topic, and pipe it to the loader.

Another, probably much safer approach is to run the pipeline on AWS without setting up a database load, then send the Enriched events from S3 to cloud storage, (potentially reformat them or run shredding on AWS to make life easier), and just query the data from cloud storage, or set up a job on a schedule to load from cloud storage into BigQuery. You lose real-time loading but you’re not maintaining a really complex infrastructure.

However there are a number of reasons that it’s worth rethinking your approach.

Firstly, you’ll still need a loader, a mutator and an Iglu instance in GCP. The iglu instance will need to be exactly the same as the one in AWS (or you’ll need to have the mutator speak to AWS). There’s a lot that can go wrong here, and you might end up in a scenario where you have more infrastructure to maintain and more complex maintenance. It’s definitely worth taking into account how much time you can afford to spend maintaining this kind of setup and fixing issues vs what you might save on resources.

Secondly, I’m not entirely confident that running the enrich component on AWS is going to work out cheaper - in my experience GCP has turned out to be cheaper for several of our customers, and the pricing model is really straightforward compared to AWS. Those are very high volume pipelines (thousands of events per second), so potentially it’s different at a lower scale (in which case why use BigQuery at all?).

I hope that’s helpful - sorry I can’t be more specific about what you directly want to do.

Best,


#3

Thank you @Colm for the nonetheless elaborate reply. I’m now definitely considering my options along the lines you have suggested.
Just a follow-up. Can I not make do without the loader and mutator if I am syncing from S3? Because apparently BigQuery directly connects to an external source (Cloud Storage) and it has schema auto-detect. Am I missing something here?
Thanks


#4

That’s a possible option, but you’ll have to handle nested/custom data first I believe, and make sure the table you’re loading to is correctly structured (custom events and contexts, multiple contexts etc.).

But basically all the complication here comes from loading to BigQuery - the mutator and loader do this work for you. If you wanted to you could pipe the data to Cloud Storage and just create external/native tables from it in order to query.

If that’s what you’re going for I think the simplest way to go would be to run shredding on the data in S3 before you pipe it to Cloud Storage. That way you won’t have to handle the nested data, but the data will be in a federated table structure.

It’s probably worth testing both with and without shred to see what works best. Am keen to hear how you get on with this so keep us posted!