Partitioning/Clustering with BigQuery Loader

Hi all,

I have a question about BigQuery Loader.
I can see in the documentation that there is available specification for column partition.

BigQuery also supports table clustering that can provide some performance and cost improvement.

Do you know if this cluster specification is with any modifications available?
Are there any technical limitations here that I might miss?

Thanks in advance.

Hi ral,

The clustered column are configured the moment the table is configured.

Is possible to specify up to four top-level columns and should be specified in descending priority order.

If you already have the table created, it will need to be recreated, is the only limitation I can see at the moment, other than that, it should work without further modifications.

Regards

3 Likes

Hi @ral
Welcome to the Snowplow Community.
Cheers,
Eddie

1 Like

Thanks for the response @jrluis

As I can see, one of the options to introduce clustering to the destination table would be to create it ready before inserting data into it. Do I get this one correctly?

Is it somehow also possible to define column clustering order as part of BigQuery Loaders mutator (sub-command create)?

Thanks @EddieM !

Hi ral,

Yes, is a good option to manualy create the table before inserting the data.

Regarding the change in the mutator sub command, I believe is not currently on the roadmap, but we are happy to review any PR that would add that feature.

Regards

1 Like

Just for those stumbling up on this question at later point.

In scenario where you have already set up schema with partitioning, but would like to add clustering, the following can be considered.

It is possible to use DDL to create both BigQuery Partitions and Clusters.
In this Stackoverflow question an approach to derive created tables DDL from metadata can be seen.

SELECT
 table_name, ddl
FROM
 `bigquery-public-data`.census_bureau_usa.INFORMATION_SCHEMA.TABLES;

By this you can find the DDL for generating same table schema.
In order to add clustering, follow the Google Clouds documentation.

-- Set up a table with clustering.
CREATE TABLE myDataset.data (column1 INT64, column2 INT64)
PARTITION BY _PARTITIONDATE
CLUSTER BY column1, column2;
1 Like

Regarding the clustering, you can enable the clustering after you’ve created the table. Here is the command you can use in your cli.

bq update --clustering_fields=${CLUSTERING_FIELD} ${DATASET}.${TABLE}
2 Likes

This definitely works well - but it should be noted that this will only cluster data after this action has been performed - any data that is already in that table will remain unclustered.

2 Likes