Incorrect IP Address in the batch pipeline

ramandamodar · April 9, 2018, 12:41pm

I have many events in Redshift with IP Addresses as 2, 2001, 2003, 2401, 2402, 2405, 2406, 2407, 2409, 2600, 2601, 2604, 2620, and 2804.

Sharing the EmrEtlRunner config file

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: ###
  secret_access_key: ###
  s3:
    region: us-east-1
    buckets:
      assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: s3://snowplow-iglu-repository-replica/jsonpaths
      log: s3://snowplow-emretl-enricher/logs
      raw:
        in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
          - s3://snowplow-tracking-unenriched        # e.g. s3://my-old-collector-bucket
        processing: s3://snowplow-emretl-enricher/raw/processing-new
        #processing: s3://snowplow-emretl-enricher/raw/archive/run=2018-01-20-12-00-19
        archive: s3://snowplow-emretl-enricher/raw/archive
      enriched:
        good: s3://snowplow-emretl-enricher/enriched/good       # e.g. s3://my-out-bucket/enriched/good
        bad: s3://snowplow-emretl-enricher/enriched/bad        # e.g. s3://my-out-bucket/enriched/bad
        errors: s3://snowplow-emretl-enricher/enriched/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://snowplow-emretl-enricher/enriched/archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
      shredded:
        good: s3://snowplow-emretl-enricher/shredded/good       # e.g. s3://my-out-bucket/shredded/good
        bad: s3://snowplow-emretl-enricher/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
        errors: s3://snowplow-emretl-enricher/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://snowplow-emretl-enricher/shredded/archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
  emr:
    ami_version: 5.5.0
    region: us-east-1        # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement:    # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-e0edddec # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: snowplow-emr
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: Snowplow EMR ETL # Give your job a name
      master_instance_type: m4.xlarge
      core_instance_count: 4
      core_instance_type: m4.xlarge
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 200    # Gigabytes
        volume_type: "gp2"
        volume_iops: 800    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m4.xlarge
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
        spark:
          maximizeResourceAllocation: "true"
      additional_info:        # Optional JSON string for selecting additional features
  collectors:
    format: thrift # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
  enrich:
    versions:
      spark_enrich: 1.9.0 # Version of the Spark Enrichment process
    continue_on_unexpected_error: true # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
    output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
  storage:
    versions:
      rdb_loader: 0.12.0
      rdb_shredder: 0.12.0        # Version of the Spark Shredding process
      hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  monitoring:
    tags: {} # Name-value pairs describing this job
    logging:
      level: DEBUG # You can optionally switch to INFO for production
    snowplow:
      method: get
      app_id: snowplow-emr-etl # e.g. snowplow
      collector: track.popxo.com # e.g. d3rkrsqld9gmqf.cloudfront.net

The same events are saved correctly in the real-time pipeline.

Topic		Replies	Views
Reprocessing Bad Events, EmrEtlRunner Error Troubleshooting	7	1901	August 23, 2017
Error while Running EmrEtlRunner AWS batch pipeline (Legacy)	19	2255	September 22, 2017
Problem at S3 to HDFS S3DistCp step AWS batch pipeline (Legacy)	19	7002	June 4, 2021
Events after enrichment ending in bad bucket AWS batch pipeline (Legacy)	4	1722	November 30, 2017
S3 errors for region when upgrading For engineers	4	1494	September 19, 2018

Incorrect IP Address in the batch pipeline

Related Topics