Use Micro-Batching iinstead of streaming inputs

volderette · March 9, 2020, 8:07am

Hi,

Dataflow jobs offer the possibility of sharding. Would it be possible to use this for micro batching instead of streaming inserts into BigQuery? This could save some costs as Streaming inserts are billed separately…

Example Dataflow-Job, which loads data in micro batches:

.apply("Write to Custom BigQuery",
BigQueryIO.writeTableRows()
.withNumFileShards(30)
.withTriggeringFrequency(Duration.standardSeconds(90))
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withSchema(tableSchema)
.to(table);

Cheers
Andreas

anton · March 9, 2020, 8:40am

Hi @volderette,

Snowplow BigQuery Loader supports batch mode from the version 0.1.0: https://github.com/snowplow-incubator/snowplow-bigquery-loader/wiki/Setup-guide#loading-mode. However, I need to admit we never used it internally and I have a vague memory of someone complaining on this forum it was throwing OOM errors.

evaldas · March 9, 2020, 9:02am

If you have files on google storage then loading should be fairly straightforward as bigquery does it fairly quickly even for bigger files. Even better if data is partitioned by date in which case you could safely reload only files belonging to 1 date multiple times without having to worry about duplication. Though I’m not certain this is supported by DataFlow.

Topic		Replies	Views
GCP: Ideal setup For engineers	7	1166	April 30, 2020
[RFC] Big Query Loader (Google Cloud Dataflow version) deprecation RFCs	0	726	July 8, 2022
Is there a way to use the storage API when BigQuery Loader? GCP pipeline	2	359	November 30, 2023
Google Cloud Dataflow example project released New releases	8	2721	April 23, 2017
Only real-time pipeline AWS real-time pipeline	4	3047	March 12, 2017

Use Micro-Batching iinstead of streaming inputs

Related Topics