Google Cloud Platform data pipeline optimization

ian · December 19, 2018, 1:55pm

It works now! I pulled an all-nighter and was super tired, so probably it was just some stupid mistake that caused this issue. However, I’d like to use this thread now to optimize the setup since there are still minor issues.

I really like Simo’s automation approach and followed mostly below tutorial with some adjustments to reduce costs and add enrichments:

As you can see below, I added the enrichment configurations and downsized the Compute Engine instances (I also created a /srv/snowplow directory for everything Snowplow-related):

#! /bin/bash
enrich_version=“0.1.0”
bq_version=“0.1.0”
bucket_name=“BUCKET-NAME”
project_id=“PROJECT-ID”
region=“europe-west1”

sudo apt-get update
sudo apt-get -y install default-jre
sudo apt-get -y install unzip

mkdir /srv/snowplow
cd /srv/snowplow

wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_beam_enrich_$enrich_version.zip
unzip snowplow_beam_enrich_$enrich_version.zip

wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_bigquery_loader_$bq_version.zip
unzip snowplow_bigquery_loader_$bq_version.zip

wget https://dl.bintray.com/snowplow/snowplow-generic/snowplow_bigquery_mutator_$bq_version.zip
unzip snowplow_bigquery_mutator_$bq_version.zip

gsutil cp gs://$bucket_name/iglu_resolver.json .
gsutil cp gs://$bucket_name/bigquery_config.json .
gsutil cp -r gs://$bucket_name/enrichments .

./beam-enrich-$enrich_version/bin/beam-enrich --runner=DataFlowRunner --project=$project_id --streaming=true --region=$region --gcpTempLocation=gs://$bucket_name/temp-files --job-name=beam-enrich --raw=projects/$project_id/subscriptions/good-sub --enriched=projects/$project_id/topics/enriched-good --bad=projects/$project_id/topics/enriched-bad --resolver=iglu_resolver.json --workerMachineType=n1-standard-1 --enrichments=enrichments

./snowplow-bigquery-mutator-$bq_version/bin/snowplow-bigquery-mutator create --config $(cat bigquery_config.json | base64 -w 0) --resolver $(cat iglu_resolver.json | base64 -w 0)

./snowplow-bigquery-mutator-$bq_version/bin/snowplow-bigquery-mutator listen --config $(cat bigquery_config.json | base64 -w 0) --resolver $(cat iglu_resolver.json | base64 -w 0) &

./snowplow-bigquery-loader-$bq_version/bin/snowplow-bigquery-loader --config=$(cat bigquery_config.json | base64 -w 0) --resolver=$(cat iglu_resolver.json | base64 -w 0) --runner=DataFlowRunner --project=$project_id --region=$region --gcpTempLocation=gs://$bucket_name/temp-files --workerMachineType=n1-standard-1

My bigquery_config.json looks like this:

{
“schema”: “iglu:com.snowplowanalytics.snowplow.storage/bigquery_config/jsonschema/1-0-0”,
“data”: {
“name”: “Snowplow Atomic Events Data”,
“id”: “RANDOM-UUID”,
“projectId”: “PROJECT-ID”,
“datasetId”: “snowplow_dataset”,
“tableId”: “all_data”,
“input”: “enriched-good-sub”,
“typesTopic”: “bq-types”,
“typesSubscription”: “bq-types-sub”,
“badRows”: “bq-bad-rows”,
“failedInserts”: “bq-failed-inserts”,
“load”: {
“mode”: “STREAMING_INSERTS”,
“retry”: false
},
“purpose”: “ENRICHED_EVENTS”
}
}

My enrichment configurations contain below contents:

{
“schema”: “iglu:com.snowplowanalytics.snowplow/anon_ip/jsonschema/1-0-0”,
“data”: {
“name”: “anon_ip”,
“vendor”: “com.snowplowanalytics.snowplow”,
“enabled”: true,
“parameters”: {
“anonOctets”: 1
}
}
}

{
“schema”: “iglu:com.snowplowanalytics.snowplow/campaign_attribution/jsonschema/1-0-1”,
“data”: {
“name”: “campaign_attribution”,
“vendor”: “com.snowplowanalytics.snowplow”,
“enabled”: true,
“parameters”: {
“mapping”: “static”,
“fields”: {
“mktMedium”: [“utm_medium”],
“mktSource”: [“utm_source”],
“mktTerm”: [“utm_term”],
“mktContent”: [“utm_content”],
“mktCampaign”: [“utm_campaign”,“cmp”]
}
}
}
}

{
“schema”: “iglu:com.snowplowanalytics.snowplow/referer_parser/jsonschema/1-0-0”,
“data”: {
“name”: “referer_parser”,
“vendor”: “com.snowplowanalytics.snowplow”,
“enabled”: true,
“parameters”: {
“internalDomains”: [
“FQDN-1”,
“FQDN-2”,
“FQDN-3”
]
}
}
}

{
“schema”: “iglu:com.snowplowanalytics.snowplow/ua_parser_config/jsonschema/1-0-0”,
“data”: {
“vendor”: “com.snowplowanalytics.snowplow”,
“name”: “ua_parser_config”,
“enabled”: true,
“parameters”: {}
}
}

Open issues:

The ua_parser_config enrichment fails when I try to configure the database as mentioned here (complains about 2 parameters given while accepting only 0): ua parser enrichment · snowplow/snowplow Wiki · GitHub
When I enable the ip_lookups enrichment with below configuration I get an error in Dataflow complaining that http would be an unknown file type (or something like that, I will recreate it and share the exact error message). I guess it is related to the external resource but with Iglu this kind of technical scenario works:

{
“schema”: “iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/2-0-0”,
“data”: {
“name”: “ip_lookups”,
“vendor”: “com.snowplowanalytics.snowplow”,
“enabled”: true,
“parameters”: {
“geo”: {
“database”: “GeoLite2-City.mmdb”,
“uri”: “http://snowplow-hosted-assets.s3.amazonaws.com/third-party/maxmind”
}
}
}
}

Can you please explain how to use time-partitioning? The built-in option is to do it by ingestion time but your recommendation is to use derived_tstamp. In order to be able to use derived_tstamp and select the field as the partitioning field I added it to the schema manually (I didn’t add any other fields to the schema). Is that the correct approach? Should all the missing fields then be created automatically by the mutator? Unfortunately, it didn’t work, so I tried to use the built-in ingestion time partitioning but with no luck either. However, this can also be attributed to my all-nighter.
When updating the BigQuery configuration to a new project, dataset or table, would you recommend generating a new UUID?

Final thoughts:

I support your plan to remove the version numbers from all self-describing schemas.
Have you ever considered trying Google App Engine Flexible Environment instead of Compute Engine instances? Unfortunately, I’m not a JVM expert but this looks to me as though it could work, see running JAR files: The Java 8 runtime | Google App Engine flexible environment docs | Google Cloud If I find some time over the holidays, I will try to deploy at least the collector to GAE Flex since I generally prefer managed services.

Thanks again for this great piece of software and your work!!!

Best regards,
Ian

Topic		Replies	Views
Schema Violations error GCP pipeline	2	1000	January 27, 2022
Bq-failed-inserts topic reason GCP pipeline	3	959	September 1, 2021
BigQuery Forwarder for "Successful" Inserts GCP pipeline	3	1044	June 5, 2019
Update custom schema but mutator didn't work GCP pipeline	3	862	October 26, 2021
Example: Running Snowplow real-time pipeline on GCP with Kafka and Kubernetes Kafka real-time pipeline	6	2940	June 1, 2017

Google Cloud Platform data pipeline optimization

Related Topics