Add tags to EMR EC2 instances


#1

Hey guys, I currently use the EMR enrichment process and I would like to tag the EC2 instances created automatically.
Is that possible?
I tried to set the tags on the config file, but they don’t get replicated to the EMR cluster.

Thanks in advance!


#2

Hi thiagophx,

Setting the monitoring.tags section of the configuration to a map of key-value pairs should add tags to your EMR cluster. However, it won’t necessarily add those tags to the instances which make up the cluster - according to Tagging Amazon EMR Clusters,

For Amazon EMR tags to propagate to your Amazon EC2 instances, your IAM policy for Amazon EC2 needs to allow permissions to call the Amazon EC2 CreateTags and DeleteTags APIs.

Could you paste in your configuration file with sensitive information removed? Also, could you confirm whether the cluster itself has the tags you added?

Regards,
Fred


#3

Hi Fred.
My config file looks like this:
monitoring: tags: { Name: "data-pipeline-enrichment" } logging: level: DEBUG snowplow: method: get app_id: emr-enrich collector: endpoint-to-collector
Indentation is wrong, but that’s just the editor.

In my case I don’t see the tags neither on the EMR cluster nor on the EC2 instances.

Thanks.


#4

When I run the app with the following in the configuration file, it works - the tag Name = data-pipeline-enrichment is visible in the console.

monitoring:
  tags: { Name: "data-pipeline-enrichment" }

Could you paste your configuration file with sensitive information removed to verify that the indentation is correct? To do this, type three backticks, paste the code on the next line, and then type three more backticks on the final line.

Additionally, if you have set ami_version to something less than 4.3.0, could you try it again with ami_version: 4.3.0?

Regards,
Fred


#5

Hi Fred.

I tried to bump the ami_version but it doesn’t work, it seems that I’m using an older version of Snowplow and I first need to upgrade it, so I can then use the new ami_version.

This is my current config file:

aws:
  access_key_id:
  secret_access_key:
  s3:
    region: "eu-west-1"
    buckets:
      assets: s3://snowplow-hosted-assets  # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: s3://my-bucket/jsonpaths
      log: s3n://my-bucket/logs/
      raw:
        in: ["s3n://my-bucket/raw/"]
        processing: s3n://my-bucket/processing/
        archive: s3n://my-bucket/archived/
      enriched:
        good: s3n://my-bucket/enriched/good       # e.g. s3://my-out-bucket/enriched/good
        bad: s3n://my-bucket/enriched/bad        # e.g. s3://my-out-bucket/enriched/bad
        errors: # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3n://my-bucket/enriched/archived    # Where to archive enriched events to, e.g. s3://my-out-bucket/enriched/archive
      shredded:
        good: s3n://my-bucket/shredded/good/       # e.g. s3://my-out-bucket/shredded/good
        bad: s3n://my-bucket/shredded/bad/        # e.g. s3://my-out-bucket/shredded/bad
        errors: # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3n://my-bucket/shredded/archived/   # Not required for Postgres currently
  emr:
    ami_version: 3.7.0      # Dont change this
    region: eu-west-1        # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement: eu-west-1b     # Set this if not running in VPC. Leave blank otherwise
    ec2_key_name: my-key
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      lingual: "1.1"             # To launch on cluster, provide version, "1.1", keep quotes
    # Adjust your Hadoop cluster below
    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 3
      core_instance_type: m1.large
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.large
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
collectors:
  format: thrift # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  job_name: Snowplow ETL # Give your job a name
  versions:
    hadoop_enrich: 1.3.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.6.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to true (and set :out_errors: above) if you dont want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  download:
    folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
  targets:
    - name: "My Redshift database"
      type: redshift
      host: "" # The endpoint as shown in the Redshift console
      database: "" # Name of database
      port: 5439 # Default Redshift port
      ssl_mode: disable # One of disable (default), require, verify-ca or verify-full
      table: atomic.events
      username: ""
      password: ""
      maxerror: 1 # Stop loading on first error, or increase to permit more load errors
      comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
monitoring:
  tags: { Name: "data-pipeline-enrichment" } # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: emr-enrich
    collector: ""

#6

Hi @thiagophx - yes that makes sense - you should only update AMI versions as part of regular Snowplow upgrades; there are some dependency issues between AMI versions and the Snowplow Hadoop Enrich and Shred components.