Emr-etl-runner run folder is not a folder, it's a file its a folder $folder$


#1


This is the enriched/good folder with my latest run.

Why is this happening? the command I run and the Command Line is ./r90-emr-etl-runner run --c config90.yml --r iglu_resolver.json --skip rdb_load . The --skip rdb_load is there because I have built a custom script to get the part-0000* files from s3 to the gcp.

 aws:
  2   # Credentials can be hardcoded or set in environment variables
  3   access_key_id: <%= ENV['AWS_ACCESS'] %>
  4   secret_access_key: <%= ENV['AWS_SECRET'] %>
  5   s3:
  6     region: us-west-2
  7     buckets:
  8       assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
  9       jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket her    e
 10       log: s3://xx-logs/logs
 11       raw:
 12         in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as b    elow
 13           - s3://xx-logs        # e.g. s3://my-new-collector-bucket
 14         processing: s3://xx/processing_data
 15         archive: s3://xx/archive_data    # e.g. s3://my-archive-bucket/raw
 16       enriched:
 17         good: s3://xx/enriched/good       # e.g. s3://my-out-bucket/enriched/good
 18         bad: s3://xx/enriched/bad        # e.g. s3://my-out-bucket/enriched/bad
 19         errors: s3://xx/enriched/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
 20         archive: s3://xx/enriched/archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enric    hed
 21       shredded:
 22         good: s3://xxt/shredded/good       # e.g. s3://my-out-bucket/shredded/good
 23         bad: s3://xx/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
 24         errors: s3://xx/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
 25         archive: s3://xx/shredded/archive    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shred    ded
 26   emr:
 27     ami_version: 5.5.0
 28     region: us-west-2        # Always set this
 29     jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
 30     service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
 31     placement:      # Set this if not running in VPC. Leave blank otherwise
 32     ec2_subnet_id: subnet-xxx  # Set this if running in VPC. Leave blank otherwise
 33     ec2_key_name: xx
 34     bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
 35     software:
 36       hbase:           
 37       lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
 38     # Adjust your Hadoop cluster below
 39     jobflow:
 40       job_name: Snowplow ETL # Give your job a name
 41       master_instance_type: m1.medium
 42       core_instance_count: 2
 43       core_instance_type: m1.medium
 44       core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
 45         volume_size: 100    # Gigabytes
 46         volume_type: "gp2"
 47         volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
 48         ebs_optimized: false # Optional. Will default to true
 49       task_instance_count: 0 # Increase to use spot instances
 50       task_instance_type: m1.medium
 51       task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
 52     bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
 53     configuration:
 54       yarn-site:
 55         yarn.resourcemanager.am.max-attempts: "1"
 56       spark:
 57         maximizeResourceAllocation: "true"
 58     additional_info:        # Optional JSON string for selecting additional features
 59 collectors:
 60   format: cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/w    d_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
 61 enrich:
 62   versions:
 63     spark_enrich: 1.9.0 # Version of the Spark Enrichment process
 64   continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
 65   output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, G    ZIP
 66 storage:
 67   versions:
 68     rdb_loader: 0.12.0
 69     rdb_shredder: 0.12.0        # Version of the Spark Shredding process
 70     hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
 71 monitoring:
 72   tags: {} # Name-value pairs describing this job
 73   logging:
 74     level: DEBUG # You can optionally switch to INFO for production
 75   snowplow:
 76     method: get
 77     app_id: xx  # e.g. snowplow
 78     collector: xx.cloudfront.net

I am using the r90 version of the emr-etl-runner![34%20AM|690x160]


#2

What do you mean by “this”? An empty file with the name ending $folder$?

If so, this is the outcome of using S3DistCp AWS utility as a means to move the files between buckets: https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/. I’m afraid we have no control over it at the moment.

On a side note, if you are skipping rdb_load you might want to skip shred,rdb_load,archive_shredded instead. You do not need to produce shredded files if no data is loaded into Redshift.


#4

Interesting, because it worked before. So there’s nothing I can do about this happening?


#5

Usage of S3DistCp has been introduced gradually to EMR cluster with various Snowplow releases. If you have upgraded your pipeline that could be the reason why empty files started appearing just now. Prior to S3DistCp we used home-built Sluice utility.

You can check the introduction of S3DistCp to the Snowplow releases by going over corresponding dataflow diagrams for the relevant releases: https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps.


#6

Thank you, Ihor. You have been very helpful in to all my questions.