There’s a few questions here so let me know if I’m not addressing them all.
If you want to sink data into BigQuery the current preferred method is to run the entire pipeline on GCP. The pipeline on GCP is real time and differs quite a bit from the AWS pipeline.
The GCP pipeline consists of the scala stream collector, beam enrich and the BigQuery loader. Data moves between these components using PubSub.
The AWS (realtime) pipeline consists of the scala stream collector, stream enrich and from there various other components (such as the S3 loader, Elasticsearch loader, and EMR ETL runner that includes loading and shredding). Data moves between these components using Kinesis (or Kafka).