We have implemented the snowplow pipeline to the point that “stream-enrich-kinesis” writes TSV files into the S3 loader bucket. At this point we decided to shred our enriched events into separate entities, using the RDB Shredder. It seems that RDB Shredder is part of emretlrunner batch process, however it is also mentioned here that it can be run manually. Now I have the following questions regarding the Shredder setup:
Does running Shredder manually actually mean to use this Snowplow hosted asset?
What is the difference between what “stream-enrich-kinesis” does and the enrichment of emretlrunner?
Can we actually run emretlrunner (in case it is necessary to do the Shredding job) within a Fargate instance? Else, what is the recommended implementation?
How should the emretlrunner config file be setup for the whole s3 block (below), in our case? We have only one bucket that collects the good enriched events). Are the buckets in the block, all required? Are they all outputs for the Shredder? In case not, which one is required for which steps?
s3: region: ADD HERE buckets: assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here log: ADD HERE encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3) raw: in: # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below - ADD HERE # e.g. s3://my-old-collector-bucket - ADD HERE # e.g. s3://my-new-collector-bucket processing: ADD HERE archive: ADD HERE # e.g. s3://my-archive-bucket/raw enriched: good: ADD HERE # e.g. s3://my-out-bucket/enriched/good bad: ADD HERE # e.g. s3://my-out-bucket/enriched/bad errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below archive: ADD HERE # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched shredded: good: ADD HERE # e.g. s3://my-out-bucket/shredded/good bad: ADD HERE # e.g. s3://my-out-bucket/shredded/bad errors: ADD HERE # Leave blank unless :continue_on_unexpected_error: set to true below archive: ADD HERE # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded consolidate_shredded_output: true # Whether to combine files when copying from hdfs to s3