Having Trouble Tracking Unstructured Events


#1

I’ve followed the instructions in the schema registry page and in the configuring shredding page but whenever I run an EMR process the data is not uploaded to the custom table I created. Here are the pertinent files I created (some of the names and words have been changed), I assume there’s just some minor detail I’m missing. Any help would be greatly appreciated.
  \custom_event_1.json - located at S3 bucket rr-snowplow-cloudfront-iglu-central/jsonpaths/com.rigdigbi

{
    "jsonpaths": [
        "$.schema.vendor",
        "$.schema.name",
        "$.schema.format",
        "$.schema.version",
        "$.hierarchy.rootId",
        "$.hierarchy.rootTstamp",
        "$.hierarchy.refRoot",
        "$.hierarchy.refTree",
        "$.hierarchy.refParent",
        "$.data.app_id",
        "$.data.g_number",
        "$.data.org_id",
        "$.data.user_id",
        "$.data.user_name"
    ]
}

  \custom_event.json - located at rr-snowplow-cloudfront-iglu-central/schemas/com.rigdigbi/custom_event/jsonschema/1-0-0

{
  "$schema": "http://rr-snowplow-cloudfront-iglu-central/schemas/com.rigdigbi/custom_event/jsonschema/1-0-0",
  "description": "Schema for custom organization event",
  "self": {
    "vendor": "com.rigdigbi",
    "name": "custom_event",
    "format": "jsonschema",
    "version": "1-0-0"
  },

  "type": "object",
  "properties": {
    "app_id": {
       "type": "string",
       "maxLength": 255
    },
    "g_number": {
       "type": "string",
       "maxLength": 255
    },
    "org_id": {
       "type": "integer"
    },
    "user_id": {
       "type": "string",
       "maxLength": 255
    },
    "user_name": {
       "type": "string",
       "maxLength": 255
    }
  },
  "required": ["app_id", "g_number", "org_id", "user_id", "user_name"],
  "additionalProperties": false
}

  \etl.resolver.json - located in Linux environment

{
  "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0",
  "data": {
    "cacheSize": 500,
    "repositories": [
      {
        "name": "Iglu Central",
        "priority": 0,
        "vendorPrefixes": [
          "com.snowplowanalytics"
        ],
        "connection": {
          "http": {
            "uri": "http://iglucentral.com"
          }
        }
      },
        {
                "name": "Custom Event",
                "priority": 0,
                "vendorPrefixes": [
                        "com.rigdigbi"
                ],
                "connection": {
                      "http": {
                        "uri": "http://d2y4sajn4og0gq.cloudfront.net/" #this reads from bucket rr-snowplow-cloudfront-iglu-central

                }
            }
        }
    ]
  }
}

  \etl.cf.yml - located in Linux environment

# this config file is for the cloudfront collector.
# Don't use for clojure collector
aws:
  access_key_id: 11111
  secret_access_key: aaaaa
  s3:
    region: us-east-1
    buckets:
      assets: s3://sp-hosted-assets
      log: s3://sp-log/emr
      raw:
        in:
        - s3://sp-cloudfront-dev-logs/
        processing: s3://sp-cloudfront-processing
        archive:    s3://sp-cloudfront-archive/raw
      enriched:
        good: s3://sp-cloudfront-enriched/good
        bad:  s3://sp-cloudfront-enriched/bad
        errors: s3://sp-cloudfront-enriched/errors
        archive: s3://sp-cloudfront-archive/enriched/good
      shredded:
        good: s3://sp-cloudfront-shredded/good
        bad: s3://sp-cloudfront-shredded/bad
        errors: s3://sp-cloudfront-shredded/errors
        archive: s3://sp-cloudfront-archive/shredded/good
      jsonpath_assets: s3://sp-cloudfront-iglu-central/jsonpaths
  emr:
    ami_version: 3.6.0
    region: us-east-1
    placement: us-east-1c
    ec2_subnet_id:
    jobflow_role: EMR_EC2_DefaultRole
    service_role: EMR_DefaultRole
    ec2_key_name: SisenseKeyPair
    software:
      hbase: # not used for ami_version 3.6.0
      lingual: # not used for ami_version 3.6.0
    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: c3.xlarge # m1.large
      task_instance_count: 0
      task_instance_type: m1.medium
      task_instance_bid: 0.015
    bootstrap_failure_tries: 3
collectors:
  format: cloudfront
enrich:
  job_name: Snowplow Cloudfront ETL
  versions:
    hadoop_enrich: 1.0.0
    hadoop_shred: 0.4.0
  continue_on_unexpected_error: false
  output_compression: NONE
storage:
  download:
    folder:
  targets:
  - name: RR Snowplow Events
    type: redshift
    host: snowplow.aaa111.us-east-1.redshift.amazonaws.com
    database: events
    port: 1234
    table: atomic.events
    username: user
    password: 1234567
    maxerror: 10
    comprows: 200000
monitoring:
  tags: {}
  logging:
    level: INFO
  snowplow:

  \index.html - Only Unstruct Event tracking code is shown

        snowplow_name_here('trackUnstructEvent', {
            schema: 'iglu:rr-snowplow-cloudfront-iglu-central/schemas/com.rigdigbi/custom_event/jsonschema/1-0-0',
            data: {
                app_id: 'etltesting',
                g_number: 'abc123',
                org_id: 1,
                user_id: 'zzz999',
                user_name: 'user',
            }
        });

#2

Hi @wyip,

Few notes here.

  1. In tracking code you need to use schema: 'iglu:com.rigdigbi/custom_event/jsonschema/1-0-0'. Tracker doesn’t need to know where physically your schema is stored.
  2. In cf.yml. What version of Snowplow you’re using? ami_version: 3.6.0 looks very outdated. This is generally fine, but I can imagine it would be easier to get help for more recent version.
  3. In resolver.json. Config looks valid, but notice that your schema should be available at http://d2y4sajn4og0gq.cloudfront.net/schemas/com.rigdigbi/custom_event/jsonschema/1-0-0. This is where ETL (enrich and shred) process will look for it.
  4. In JSON Schema. $schema should be http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#. This is a meta-data, identical across all JSON Schemas. I recommend you to use Igluctl to check for this kind of mistakes.

JSONPaths file look good.

Also, notice please that data should be loaded not during ETL process, but during StorageLoader run (in pre-R90 versions) and you need to have Redshift tables created at that moment.


#3

Hello Anton, thanks for the response. From a CDN, I’m using Snowplow version 2.3.0. I have a custom table called “atomic.com_rigdigbi_custom_event_1” which corresponds to the properties in the JSON file “custom_event_1.json”. I also have a Linux script that runs the EMR process and the storage loader process.
  Unfortunately, though I made the changes you suggested, data is not ending up in the custom table. Whenever I run the script, it fails on the step “Shredded HDFS” and I have to manually execute the EMR process to finish it so that might be interfering with the data upload. I will continue to make modifications to the files and test the script.


#4

Hi @wyip,

Could you provide EMR stdout/stderr logs? Also, you need to check whether your event reaches enriched.good and shredded.good buckets or remains in one of bad buckets (you can do this by inspecting both bad buckets).

My current guess is that ETL simply cannot fetch schema from your Iglu repository (point #3 in my previous post) and therefore invalidates event and leaves it in enriched.bad.

From a CDN, I’m using Snowplow version 2.3.0

This is a JS Tracker’s version I guess. Tracker can be upgraded to most recent 2.8.2. Also, Snowplow ETL itself looks very outdated with hadoop_enrich: 1.0.0 and hadoop_shred: 0.4.0, which is R66. Lot of things happened since then. Your problem is likely not in outdated versions, but it is generally good idea to use newer versions.


#5

I’m using a JS tracker. There’s some good news in that the events are now making it into the atomic.events table though still nothing in the custom table. I’ve checked the buckets and all the shredding is ending up in the bad folder. Here’s one error message:

{“line”:“test\tweb\t2017-08-23 15:34:58.123\t2017-08-23 15:26:56.000\t2017-08-23 15:26:56.337\tunstruct\t356a5c5e-5f2b-4764-b270-cc13d3c02018\t\tcf\tjs-2.3.0\tcloudfront\thadoop-1.0.0-common-0.14.0\t\t111.111.111.111\t2960142284\t601773da1808eae6\t1\t\tUS\tNC\tSalisbury\t21111\t11.133804\t-11.006\tNorth Carolina\t\t\t\t\thttp://dnyvknlia0uzv.cloudfront.net/in4.html\t\t\thttp\tdnyvknlia0uzv.cloudfront.net\t80\t/in4.html\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t{“schema”:“iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0”,“data”:[{“schema”:“iglu:com.google.analytics/cookies/jsonschema/1-0-0”,“data”:{}}]}\t\t\t\t\t\t{“schema”:“iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0”,“data”:{“schema”:“iglu:com.rigdigbi/custom_event/jsonschema/1-0-0”,“data”:{“app_id”:“etltesting”,“great_plains_number”:“abc123”,“org_id”:1,“user_id”:“zzz999”,“user_name”:“yipwill”}}}\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tMozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0\tFirefox 5\tFirefox\t54.0\tBrowser\tGECKO\ten-US\t0\t1\t0\t0\t0\t0\t0\t0\t0\t1\t24\t1920\t566\tWindows 7\tWindows\tMicrosoft Corporation\tAmerica/New_York\tComputer\t0\t1920\t1080\tUTF-8\t1920\t566\t\t\t\t\t\t\tUSD\tAmerica/New_York\t\t\t\t\t\t\t{“schema”:“iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1”,“data”:[{“schema”:“iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0”,“data”:{“useragentFamily”:“Firefox”,“useragentMajor”:“54”,“useragentMinor”:“0”,“useragentPatch”:null,“useragentVersion”:“Firefox 54.0”,“osFamily”:“Windows 7”,“osMajor”:null,“osMinor”:null,“osPatch”:null,“osPatchMinor”:null,“osVersion”:“Windows 7”,“deviceFamily”:“Other”}}]}\t\t”,“errors”:[{“level”:“error”,“message”:“Could not find schema with key iglu:com.rigdigbi/custom_event/jsonschema/1-0-0 in any repository, tried:”,“repositories”:[“Iglu Central [HTTP]”,“Iglu Client Embedded [embedded]”,“Custom Event [HTTP]”]}]}

Here’s the syslog message:

2017-08-23 15:22:10,553 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Running with args: --src hdfs:///local/snowplow/shredded-events/ --dest s3://william-test-bucket/shredded/good/run=2017-08-23-15-09-31/ --srcPattern .*part-.* --s3Endpoint s3.amazonaws.com 
2017-08-23 15:22:10,790 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): S3DistCp args: --src hdfs:///local/snowplow/shredded-events/ --dest s3://william-test-bucket/shredded/good/run=2017-08-23-15-09-31/ --srcPattern .*part-.* --s3Endpoint s3.amazonaws.com 
2017-08-23 15:22:16,679 INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem (main): Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
2017-08-23 15:22:17,228 INFO amazon.emr.metrics.MetricsSaver (main): MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false 
2017-08-23 15:22:17,228 INFO amazon.emr.metrics.MetricsSaver (main): Created MetricsSaver j-G79KQ0H31N8C:i-0b15f39e1351ee99d:RunJar:04079 period:60 /mnt/var/em/raw/i-0b15f39e1351ee99d_20170823_RunJar_04079_raw.bin
2017-08-23 15:22:19,511 FATAL com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system
java.io.FileNotFoundException: File does not exist: hdfs:/local/snowplow/shredded-events
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
	at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:630)
	at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:614)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
2017-08-23 15:22:19,519 INFO amazon.emr.metrics.MetricsSaver (Thread-4): Saved 2:2 records to /mnt/var/em/raw/i-0b15f39e1351ee99d_20170823_RunJar_04079_raw.bin

Here’s the stderr message:

Exception in thread "main" java.lang.RuntimeException: Failed to get source file system
	at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:633)
	at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:614)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs:/local/snowplow/shredded-events
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1128)
	at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
	at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:630)
	... 9 more

#6

By changing the JS tracker code, I can now get the EMR process to run successfully without error; see below code. However, the storage loader is now failing saying it “Cannot find JSON Paths file to load” which is strange since I do have a JSON Paths file and I’m referencing the bucket it is in in the yaml file. I feel that I am very close to getting the custom event set up though of course, any advice would speed up the process.

        snowplow_name_here('trackUnstructEvent', {
            schema: 'iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0',
            data: {
              schema: 'iglu:com.rigdigbi/custom_event/jsonschema/1-0-0',
              data: {
                  app_id: 'etltesting',
                  great_plains_number: 'abc123',
                  org_id: 1,
                  user_id: 'zzz999',
                  user_name: 'yipwill',
              }
            }
        });

#7

Hi @wyip,

Unfortunately, this tracking code does not what you want. You need remove wrapper with unstruct_event as Enrich process mistakenly think this is your custom unstructured event and looks JSONPaths file for unstruct_event, but it does not exist, it is an auxiliary JSON Schema, user don’t need to interfere with it.

As I said in previous messages and as message in bad rows state - your problem is that ETL process cannot access your JSON Schema - it must be available at http://d2y4sajn4og0gq.cloudfront.net/schemas/com.rigdigbi/custom_event/jsonschema/1-0-0, but it isn’t (I cannot wget it).


#8

I have good news: custom events are now being tracked and uploaded to the custom event table. My co-worker read the above message and noticed how when the links located at iglucentral are clicked, there would be a pop-up message to download a file. So, I put the schema information in a file named “1-0-0” and placed it in the jsonschema folder. I previously misunderstood the schema creation section in the schema registry page but now everything is working fine. Thanks for all the help!


#9

Glad to hear it, @wyip!


#10

Anton - @wyip and @sonnypolaris really appreciate your help on this one!:grinning: