Hey guys, I currently use the EMR enrichment process and I would like to tag the EC2 instances created automatically.
Is that possible?
I tried to set the tags on the config file, but they don’t get replicated to the EMR cluster.
Setting the monitoring.tags section of the configuration to a map of key-value pairs should add tags to your EMR cluster. However, it won’t necessarily add those tags to the instances which make up the cluster - according to Tagging Amazon EMR Clusters,
For Amazon EMR tags to propagate to your Amazon EC2 instances, your IAM policy for Amazon EC2 needs to allow permissions to call the Amazon EC2 CreateTags and DeleteTags APIs.
Could you paste in your configuration file with sensitive information removed? Also, could you confirm whether the cluster itself has the tags you added?
Hi Fred.
My config file looks like this: monitoring: tags: { Name: "data-pipeline-enrichment" } logging: level: DEBUG snowplow: method: get app_id: emr-enrich collector: endpoint-to-collector Indentation is wrong, but that’s just the editor.
In my case I don’t see the tags neither on the EMR cluster nor on the EC2 instances.
Could you paste your configuration file with sensitive information removed to verify that the indentation is correct? To do this, type three backticks, paste the code on the next line, and then type three more backticks on the final line.
Additionally, if you have set ami_version to something less than 4.3.0, could you try it again with ami_version: 4.3.0?
I tried to bump the ami_version but it doesn’t work, it seems that I’m using an older version of Snowplow and I first need to upgrade it, so I can then use the new ami_version.
This is my current config file:
aws:
access_key_id:
secret_access_key:
s3:
region: "eu-west-1"
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: s3://my-bucket/jsonpaths
log: s3n://my-bucket/logs/
raw:
in: ["s3n://my-bucket/raw/"]
processing: s3n://my-bucket/processing/
archive: s3n://my-bucket/archived/
enriched:
good: s3n://my-bucket/enriched/good # e.g. s3://my-out-bucket/enriched/good
bad: s3n://my-bucket/enriched/bad # e.g. s3://my-out-bucket/enriched/bad
errors: # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3n://my-bucket/enriched/archived # Where to archive enriched events to, e.g. s3://my-out-bucket/enriched/archive
shredded:
good: s3n://my-bucket/shredded/good/ # e.g. s3://my-out-bucket/shredded/good
bad: s3n://my-bucket/shredded/bad/ # e.g. s3://my-out-bucket/shredded/bad
errors: # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3n://my-bucket/shredded/archived/ # Not required for Postgres currently
emr:
ami_version: 3.7.0 # Dont change this
region: eu-west-1 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
placement: eu-west-1b # Set this if not running in VPC. Leave blank otherwise
ec2_key_name: my-key
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
lingual: "1.1" # To launch on cluster, provide version, "1.1", keep quotes
# Adjust your Hadoop cluster below
jobflow:
master_instance_type: m1.medium
core_instance_count: 3
core_instance_type: m1.large
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.large
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
collectors:
format: thrift # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
job_name: Snowplow ETL # Give your job a name
versions:
hadoop_enrich: 1.3.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.6.0 # Version of the Hadoop Shredding process
hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
continue_on_unexpected_error: false # Set to true (and set :out_errors: above) if you dont want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
download:
folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
targets:
- name: "My Redshift database"
type: redshift
host: "" # The endpoint as shown in the Redshift console
database: "" # Name of database
port: 5439 # Default Redshift port
ssl_mode: disable # One of disable (default), require, verify-ca or verify-full
table: atomic.events
username: ""
password: ""
maxerror: 1 # Stop loading on first error, or increase to permit more load errors
comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
monitoring:
tags: { Name: "data-pipeline-enrichment" } # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
snowplow:
method: get
app_id: emr-enrich
collector: ""
Hi @thiagophx - yes that makes sense - you should only update AMI versions as part of regular Snowplow upgrades; there are some dependency issues between AMI versions and the Snowplow Hadoop Enrich and Shred components.