Help with provisioning rdb loader


#1

Hi, I was wondering if i could get some provisioning help. In our pipeline the last step is loading shredded files from s3 into Redshift. Here’s our json for the load step:

{
        "type": "CUSTOM_JAR",
        "name": "rdb load step",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "s3://snowplow-hosted-assets/4-storage/rdb-loader/snowplow-rdb-loader-0.14.0.jar",
        "arguments": [
          "--config",
          "{{base64File "/root/dataflow-runner_dir/configs/emr.yml"}}",
          "--target",
          "{{base64File "/root/dataflow-runner_dir/configs/targets/redshift.conf"}}",
          "--resolver",
          "{{base64File "/root/dataflow-runner_dir/configs/resolver.json"}}",
          "--folder",
          "s3n://piv-data-{{systemEnv "AWS_REGION"}}-{{systemEnv "PRODUCTION_ENV"}}-good/shredded/good/run={{timeWithFormat "1540322909" "2006-01-02-15-04-05"}}/",
          "--logkey",
          "s3n://piv-data-{{systemEnv "AWS_REGION"}}-{{systemEnv "PRODUCTION_ENV"}}-good/log/rdb-loader-{{timeWithFormat "1540322909" "2006-01-02-15-04-05"}}"
        ]
      }

As you can see, I’m provisioning the aws region and production environments as environment variables. The problem happens however when we get to the actual rdb load execution. The environment variables are passed along to the emr which I call in the .yml (as documented here):

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: <%= ENV['AWS_ACCESS_KEY_ID'] %>
  secret_access_key: <%= ENV['AWS_SECRET_ACCESS_KEY'] %>
  s3:
    region: <%= ENV['AWS_REGION'] %> # us-west-2
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: s3://piv-data-<%= ENV['AWS_REGION'] %>-${PRODUCTION_ENV}-iglu
      log: s3://piv-data-<%= ENV['AWS_REGION'] %>-${PRODUCTION_ENV}-good/log/emr
      encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3)
      enriched:
        good: s3://piv-data-<%= ENV['AWS_REGION'] %>-${PRODUCTION_ENV}-good/staging
        archive: s3://piv-data-<%= ENV['AWS_REGION'] %>-${PRODUCTION_ENV}-good/archive
        stream: s3://piv-data-<%= ENV['AWS_REGION'] %>-${PRODUCTION_ENV}-good/staging
      shredded:
        good: s3://piv-data-<%= ENV['AWS_REGION'] %>-${PRODUCTION_ENV}-good/shredded/good
        bad: s3://piv-data-<%= ENV['AWS_REGION'] %>-${PRODUCTION_ENV}-good/shredded/bad
        errors: 
        archive: s3://piv-data-<%= ENV['AWS_REGION'] %>-${PRODUCTION_ENV}-good/archive/shredded
  emr:
    ami_version: 5.9.0
    region: <%= ENV['AWS_REGION'] %>        # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement: <%= ENV['AWS_REGION'] %>a # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: snowplow-rocket-key
    security_configuration: 
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: Snowplow ETL # Give your job a name
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: 'thrift' # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
    spark_enrich: 1.16.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.1        # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    protocol: http
    port: 80
    app_id: piv-data # e.g. snowplow
    collector: 127.0.0.1

Everything works except the region parameter for s3. If I don’t hardcode the region as ‘us-west-2’ I get the error:

ERROR: Data loading error [Amazon](500310) Invalid operation: syntax error at or near "AWS_REGION" 
Position: 220;
Following steps completed: [Discover]

Is it just me or does this error make no sense. Why would only the s3 section of the config fail to load the environment variable, considering the emr section just doesn’t seem to care. Thanks in advance for any help there is to be given.


#2

If you’re using double quotes within a string (for folder and logkey) you’ll want to escape them and try again.


#3

Yeah I saw that in the example json but there hasn’t been any problem with not finding the buckets on s3. I’ll escape the characters and see if that changes anything


#4

Again, everything works fine if the s3 region is hardcoded in the rdb loader config


#5

Here’s the full jason with escaped quotes:

{
  "schema": "iglu:com.snowplowanalytics.dataflowrunner/PlaybookConfig/avro/1-0-1",
  "data": {
    "region": "{{systemEnv \"AWS_REGION\"}}",
    "credentials": {
      "accessKeyId": "{{systemEnv \"AWS_ACCESS_KEY_ID\"}}",
      "secretAccessKey": "{{systemEnv \"AWS_SECRET_ACCESS_KEY\"}}"
    },
    "steps": [
      {
        "type": "CUSTOM_JAR",
        "name": "S3DistCp Step: Enriched events -> staging S3",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar",
        "arguments": [
          "--src","s3n://piv-data-{{systemEnv \"AWS_REGION\"}}-{{systemEnv \"PRODUCTION_ENV\"}}-good/enriched/",
          "--dest","s3n://piv-data-{{systemEnv \"AWS_REGION\"}}-{{systemEnv \"PRODUCTION_ENV\"}}-good/staging/run={{timeWithFormat \"1540322909\" \"2006-01-02-15-04-05\"}}/"
        ]
      },
      {
        "type": "CUSTOM_JAR",
        "name": "rdb shred step",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "command-runner.jar",
        "arguments": [
          "spark-submit",
          "--class", "com.snowplowanalytics.snowplow.storage.spark.ShredJob",
          "--master", "yarn",
          "--deploy-mode", "cluster",
          "s3://snowplow-hosted-assets/4-storage/rdb-shredder/snowplow-rdb-shredder-0.14.0.jar",
          "--iglu-config",
          "{{base64File \"/root/dataflow-runner_dir/configs/resolver.json\"}}",
          "--input-folder",
          "s3n://piv-data-{{systemEnv \"AWS_REGION\"}}-{{systemEnv \"PRODUCTION_ENV\"}}-good/staging/run={{timeWithFormat \"1540322909\" \"2006-01-02-15-04-05\"}}/",
          "--output-folder",
          "s3n://piv-data-{{systemEnv \"AWS_REGION\"}}-{{systemEnv \"PRODUCTION_ENV\"}}-good/shredded/good/run={{timeWithFormat \"1540322909\" \"2006-01-02-15-04-05\"}}/",
          "--bad-folder",
          "s3n://piv-data-{{systemEnv \"AWS_REGION\"}}-{{systemEnv \"PRODUCTION_ENV\"}}-good/shredded/bad/run={{timeWithFormat \"1540322909\" \"2006-01-02-15-04-05\"}}/"
        ]
      },
      {
        "type": "CUSTOM_JAR",
        "name": "rdb load step",
        "actionOnFailure": "CANCEL_AND_WAIT",
        "jar": "s3://snowplow-hosted-assets/4-storage/rdb-loader/snowplow-rdb-loader-0.14.0.jar",
        "arguments": [
          "--config",
          "{{base64File \"/root/dataflow-runner_dir/configs/emr.yml\"}}",
          "--target",
          "{{base64File \"/root/dataflow-runner_dir/configs/targets/redshift.conf\"}}",
          "--resolver",
          "{{base64File \"/root/dataflow-runner_dir/configs/resolver.json\"}}",
          "--folder",
          "s3n://piv-data-{{systemEnv \"AWS_REGION\"}}-{{systemEnv \"PRODUCTION_ENV\"}}-good/shredded/good/run={{timeWithFormat \"1540322909\" \"2006-01-02-15-04-05\"}}/",
          "--logkey",
          "s3n://piv-data-{{systemEnv \"AWS_REGION\"}}-{{systemEnv \"PRODUCTION_ENV\"}}-good/log/rdb-loader-{{timeWithFormat \"1540322909\" \"2006-01-02-15-04-05\"}}"
        ]
      }
    ],
    "tags": [
    ]
  }
}

#6

Escaping the quotes with the json above yields the error:

template: rs-load:4: unexpected "\\" in operand

#7

rs-load being what I named the json


#8

@ihor seeing that you recommended this method maybe you have suggestions?


#9

i’m also searching for more suggestions. thanks for everything i found so far but if anyone can help with anything more, i would deeply appreciate any help!