Backlog for BQ Stream Loader on AppEngine

Hey,

I have the bq stream loader setup on appengine flexible. The current scaling metric is cpu utilization which is set at 10% (the bqloader doesnt scale enough with other values).

Performed a load test on the pipeline with 8k requests/second for 90mins. The result was that the bq loader scaled to 163 instances(max set to 200) but there was a huge backlog and it only kept increasing until the load test came to an end. Once the test was over the backlog started decreasing though.

Backlog:

The docker file for bq stream loader is:

FROM openjdk:18-alpine

COPY snowplow-bigquery-streamloader-1.1.0.jar snowplow-bigquery-streamloader-1.1.0.jar
COPY config.hocon config.hocon
COPY resolver.json resolver.json
COPY script.sh script.sh

RUN apk add jq

CMD sh script.sh

The script.sh contents are:

jq '.data.repositories[0].connection.http.uri=env.SCHEMA_BUCKET' resolver.json >> tmp.json && mv tmp.json resolver.json
java -jar snowplow-bigquery-streamloader-1.1.0.jar --config $(cat config.hocon | base64 -w 0) --resolver $(cat resolver.json | base64 -w 0)

AppEngine service config being:

runtime: custom
api_version: '1.0'
env: flexible
threadsafe: true
env_variables: ...
automatic_scaling:
  cool_down_period: 120s
  min_num_instances: 2
  max_num_instances: 200
  cpu_utilization:
    target_utilization: 0.1
network: ...
liveness_check:
  initial_delay_sec: 300
  check_interval_sec: 30
  timeout_sec: 4
  failure_threshold: 4
  success_threshold: 2
readiness_check:
  check_interval_sec: 5
  timeout_sec: 4
  failure_threshold: 2
  success_threshold: 2
  app_start_timeout_sec: 300
service_account: ...

There are errors for the bq stream loader:
image

Is there a way to have the bqloader handle the load more efficiently without having the backlog increase while under load.

Could you help with this?

Have you figured out where the bottleneck is (e.g., processing, acknowledgement, sinking etc?). 163 is a huge number of instances for that amount of data so I wouldn’t be wildly surprised if you start to see instances trip over each other and start to potentially hit some quotas as PubSub figures out how to distribute messages between each subscriber client for a single subscription.

The deadline_exceeded suggests that something is slowing down in your processing between receiving the event and failing to ack it within the timeout window.

Hi @mike,

Thank you for the response.

I checked the quota limits for pub sub and haven’t come close to crossing them. The limits for BQ are also good.

So i tried updating the resources for the instances, set each instance to use 1 core and 1.6gb of ram and it worked perfectly!

Thank you!