Clojure Enrichment not processing but also no logs


#1

I am just setting up a Clojure enrichment and it appears to start running but after 15 seconds I get nothing:

>     ./snowplow-emr-etl-runner --config config/config.yml --resolver config/iglu_resolver.json --enrichments enrichments
>     D, [2017-03-23T07:28:41.156000 #24651] DEBUG -- : Staging raw logs...

I assumed there was some sort of error but went and looked in the log bucket I have in my yml file and it is empty. Below is my config file:

> aws:
>   # Credentials can be hardcoded or set in environment variables
>   access_key_id: XXXXX
>   secret_access_key: XXX
>   s3:
>     region: us-west-2
>     buckets:
>       assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
>       jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
>       log: s3://elasticbeanstalk-us-west-2-XXXXX/resources/environments/emrlog
>       raw:
>         in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
>           - s3://elasticbeanstalk-us-west-2-XXXXX/resources/environments/logs/publish/e-cftpjpq6vh         # e.g. s3://my-old-collector-bucket
>         processing: s3://XXXXX-etl/processing
>         archive: s3://XXXXX-archive/raw    # e.g. s3://my-archive-bucket/raw
>       enriched:
>         good: s3://XXXXX-data/enriched/good       # e.g. s3://my-out-bucket/enriched/good
>         bad: s3://XXXXX-data/enriched/bad        # e.g. s3://my-out-bucket/enriched/bad
>         errors:      # Leave blank unless :continue_on_unexpected_error: set to true below
>         archive: s3://XXXXX-data/enriched/archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
>       shredded:
>         good: s3://XXXXX-data/shredded/good       # e.g. s3://my-out-bucket/shredded/good
>         bad: s3://XXXXX-data/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
>         errors:      # Leave blank unless :continue_on_unexpected_error: set to true below
>         archive: s3://XXXXX-data/shredded/archive    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
>   emr:
>     ami_version: 4.5.0
>     region: us-west-2        # Always set this
>     jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
>     service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
>     placement: us-west-2b     # Set this if not running in VPC. Leave blank otherwise
>     ec2_subnet_id: ADD HERE # Set this if running in VPC. Leave blank otherwise
>     ec2_key_name: XXXXX
>     bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
>     software:
>       hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
>       lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
>     # Adjust your Hadoop cluster below
>     jobflow:
>       master_instance_type: m1.medium
>       core_instance_count: 2
>       core_instance_type: m1.medium
>       core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
>         volume_size: 100    # Gigabytes
>         volume_type: "gp2"
>         volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
>         ebs_optimized: false # Optional. Will default to true
>       task_instance_count: 0 # Increase to use spot instances
>       task_instance_type: m1.medium
>       task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
>     bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
>     additional_info:        # Optional JSON string for selecting additional features
> collectors:
>   format: clj-tomcat # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
> enrich:
>   job_name: XXXXX ETL # Give your job a name
>   versions:
>     hadoop_enrich: 1.8.0 # Version of the Hadoop Enrichment process
>     hadoop_shred: 0.10.0 # Version of the Hadoop Shredding process
>     hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
>   continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
>   output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
> storage:
>   download:
>     folder: downloads # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
>   targets:
>     - name: "XXXXX"
>       type: postgres
>       host: 127.0.0.1 # Hostname of database server
>       database: XXXXXetl # Name of database
>       port: 5432 # Default Postgres port
>       ssl_mode: disable # One of disable (default), require, verify-ca or verify-full
>       table: atomic.events
>       username: XXXXX
>       password: XXXXX
>       maxerror: # Not required for Postgres
>       comprows: # Not required for Postgres
> monitoring:
>   tags: {} # Name-value pairs describing this job
>   logging:
>     level: DEBUG # You can optionally switch to INFO for production

Any help is very much appreciated.


#2

Hi @sevenm,

Did you have a look at enriched/good or enriched/bad buckets? Also what is in your AWS EMR console? Do you have any logs step failures info?

My guess is that you didn’t wait till enrichment and shred jobs are finished. ETL process is quite long even on ultra-small volumes of data (due to bootstrap phases).