Dataflow jobs offer the possibility of sharding. Would it be possible to use this for micro batching instead of streaming inserts into BigQuery? This could save some costs as Streaming inserts are billed separately…
Example Dataflow-Job, which loads data in micro batches:
.apply("Write to Custom BigQuery",
BigQueryIO.writeTableRows()
.withNumFileShards(30)
.withTriggeringFrequency(Duration.standardSeconds(90))
.withMethod(BigQueryIO.Write.Method.FILE_LOADS)
.withSchema(tableSchema)
.to(table);
If you have files on google storage then loading should be fairly straightforward as bigquery does it fairly quickly even for bigger files. Even better if data is partitioned by date in which case you could safely reload only files belonging to 1 date multiple times without having to worry about duplication. Though I’m not certain this is supported by DataFlow.