EmrEtlRunner config.yml, cloudfront format


In the configuration for snowplow-emr-etl-runner, version r95, should the value of collectors.format be “tsv/com.amazon.aws.cloudfront/wd_access_log”, or just “cloudfront”?

The former is what I find in the docs (https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/config/config.yml.sample), the latter is what seems to work.

If I use “tsv/com.amazon.aws.cloudfront/wd_access_log”, then atomic.events.app_id is always null.

Context: I am working on replacing a two-year-old deployment of Snowplow (don’t know the version number). To test the R95 installation, I copied the S3 buckets (specified in config.yml, aws.s3.buckets.raw.in) to new buckets. In each Cloudfront logfile entry, the query string is the 12th field, and the app_id value is in parameter “aid”.


Just “cloudfront”, @wleftwich.

The other format you mention is if you want to parse CloudFront access logs as access logs (i.e. accesses to various objects served over your CDN), rather than as Snowplow CloudFront collector logs.