Compression not working when running Google Cloud Storage Loader in GCP Dataflow

I’m running the version 0.3.1 of the storage loader in as a dataflow job and am providing the following arguments:

runner="DataFlowRunner"
jobName="my-storage-loader"
inputSubscription="enriched-good-sub",
outputDirectory="gs://my_bucket_name",
outputFilenamePrefix="output", 
shardTemplate="-W-P-SSSSS-of-NNNNN", 
outputFilenameSuffix=".gzip", 
windowDuration=5, 
compression="gzip", 
numShards=1 

When I look at the logs for the output part of the dataflow job, i don’t even see a compression argument being passed in to make the compression happen.

Does anyone know what i’m doing wrong of if this even works with dataflow?

Hi @blackknight467,

Your configuration looks correct. If you go to Dataflow UI for the job, do you see compression in the Pipeline options ?

2020-12-08-084547_457x80_scrot

When I look at the logs

You mean that when you look at data on GCS it’s there uncompressed?

Hi @BenB

yup! it’s there.

Screen Shot 2020-12-08 at 9.11.09 AM

When i look at the logs - yes in google storage they have the .gzip file name per the suffix argument but the file type is plain text. But also i mean you can select the output portion of the dataflow and press logs to see what it’s executing. I’m saying i don’t see the compression argument in that either.

Hi @blackknight467,

I deployed GCS loader 0.3.1 for both bad rows and enriched events, with gzip activated, and files are correctly gzipped in both cases.

but the file type is plain text

Where do you see that ? Just in case, if you download a gzipped text file and open it with vim, vim uncompresses it for you.

@BenB sorry for the delayed reply - but in my case the file type just says plain text in my storage.

When you download the data is it text/plain? The content type on Cloud Storage won’t always necessarily reflect the actual file type.

Pressing the download button opens it in a new tab and it is in plain readable text.