App_id is null even tough aid is set in cs_uri_query

#1

Hi,

I’m running the Javascript tracker -> cloudfront collector -> emr etl runner (latest versions)

Everything seems to work, data gets populated into atomic.events and com_amazon_aws_cloudfront_wd_access_log_1

However, many fields listed in atomic.events (app_id, page_url, etc) are null even though there is data in cs_uri_query (ai, url etc).

This is how I’m running the job

./snowplow-emr-etl-runner run --config config.yml.sample -r iglu.conf --target redshift/ --enrichments enrichements

Extract from config

    jobflow:
      job_name: Snowplow ETL # Give your job a name
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: tsv/com.amazon.aws.cloudfront/wd_access_log # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
    spark_enrich: 1.17.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP

I’m using default iglu.conf

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [ "com.snowplowanalytics" ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      }
    ]
  }
}

I’ve only added anon_ip.json to enrichments folder

What am I missing?

#2

Not all atomic fields are meant to be populated for every event. It depends on the type of event and your tracking code. For example, app_id is set manually in the initializing tag. The page_url is expected for pageview event unless configured manually for the other type of event.

As for com_amazon_aws_cloudfront_wd_access_log_1 it is a different type of event altogether. It is related to AWS logs and not so much to your tracking code. I expect it to contain neither app_id nor page_url.

#3

Hi @oldpa the issue is in your EMR ETL Runner configuration - where you have:

format: tsv/com.amazon.aws.cloudfront/wd_access_log

You should have:

format: “cloudfront”

The format you are processing for is only for extracting information from the access logs not for processing a Snowplow event.

1 Like
#4

Wow, super helpful. This fixed it.

I would perhaps change the comment in the config.yml.sample. This commen fooled me :slight_smile:

'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
#5

Glad you got it working! The sample can be a bit confusing for this scenario - have logged a ticket to get the sample to be more explicit.