Formation of a batch of a certain size for Transformer

How do you separate a batch of a certain size from enriched data? We used an Airflow task to generate a list of files in S3 with a total size of 1.5GB and copy the files to a separate directory. After that, we already launch the dataflow runner. Perhaps there is an easier way through the steps in the EMR itself with S3DistCp or other?
I didn’t find anything similar in the discussions.

We use:

  • dataflow_runner_0.7.1
  • snowplow-enrich-kinesis-3.1.5
  • snowplow-s3-loader-2.2.2
  • snowplow-transformer-batch-4.2.1

Hi @Edward_Kim , maybe you can achieve what you want by a combination of settings in S3DistCp: -groupBy and -targetSize. The first let’s you create batches by specifying a group-by condition, and the second limits the size of how big those batches will be.

You can refer to the S3DistCp documentation for details about these settings.

Thanks for the advice @dilyan . But this will only help to form one file from several files grouped in a certain way, but at the same time it will take all the files from the specified directory. I need to select exactly as many files from the directory as the transformer can process.