GCS Loader upgrade?

Hey,

Will there be an update for the GCS Loader. The current version only runs as Dataflow jobs. Would it be possible to make this work on any other service.

Thank you!

Hi @siv that’s a good question! We talk about that often at Snowplow, because it is the last remaining Snowplow component that still relies on dataflow. Personally I would love to see a non-dataflow solution.

On the other hand, it is not on our immediate roadmap, just because we are prioritising other things at the moment. So it might be a while until this idea becomes reality.

1 Like

It’d certainly be possible to have this run on another service as the Dataflow job isn’t doing anything particularly complicated (reading from a PubSub subscription, and then writing out partitioned data to GCS). This could likely be moved to a scheduled job that runs on a virtual machine / within Kubernetes etc.

1 Like

Thank you @istreeter and @mike!

I’ll try running it on a VM.

1 Like

Hey,

So i’ve tried running the gcs loader on appengine by creating a custom docker image for the same. The docker image runs the gcs loader bat file.

The docker image runs successfully locally and on appengine, it can acknowledge events from the pubsub but it doesnt write any file to the gcs bucket.

Could someone help me out?

The reason i’m trying to run it outside of a dataflow job is here.

Contents of dockerfile:

FROM openjdk:8-jdk-alpine

COPY snowplow-google-cloud-storage-loader-0.3.2/bin /bin
COPY snowplow-google-cloud-storage-loader-0.3.2/lib /lib
COPY script.sh script.sh

RUN apk update && apk add bash

CMD sh script.sh

Contents of script.sh:

./bin/snowplow-google-cloud-storage-loader \
--project=${PROJECT} \
--runner=DirectRunner \
--inputSubscription=${INPUT_PUBSUB_SUB} \
--outputDirectory=${GCS_BUCKET} \
--outputFilenamePrefix=output \
--shardTemplate=-W-P-SSSSS-of-NNNNN \
--outputFilenameSuffix=.txt \
--windowDuration=${WINDOW_DURATION} \
--compression=none \
--numShards=1 \
--dateFormat=YYYY/MM/dd/HH/

Thank you!

Hi @siv ,

There is no need to create a new Docker image, you can use the original one directly. Instructions can be found here under the Docker image section.

That being said, whatever the method you choose to launch it, the loader will still run as a Dataflow job, until we write a replacement for it with a standalone app.

Hi @BenB ,

Thank you for the response.

So the gcs loader cannot be run locally as a DirectRunner?

No, we don’t support the DirectRunner. To use it it would require to specify it in the dependencies, but we don’t.

okay, got it, thank you @BenB!