Redshift empty after successful job run

ckrishnamoorthy · August 15, 2020, 8:41am

I am using the following pipeline.

ScalaStreamCollector → StreamEnricherToKinesis → S3SnowPlowLoader → S3 → EmrEtlRunner in Enrich Mode → DataLoader → Redshift

Here is my config file:

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: <%= ENV['AWS_SNOWPLOW_ACCESS_KEY'] %>
  secret_access_key: <%= ENV['AWS_SNOWPLOW_SECRET_KEY'] %>
  s3:
    region: us-east-2
    buckets:
      assets: s3n://snowplow-hosted-assets-us-east-2 # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3n://snowplow-logs.instacar.in/logs
      encrypted: false # Whether the buckets below are enrcrypted using server side encryption (SSE-S3)
      enriched:
        good: s3n://snowplow-databucket.instacar.in/good # e.g. s3://my-out-bucket/enriched/good
        bad: s3n://snowplow-databucket.instacar.in/bad # e.g. s3://my-out-bucket/enriched/bad
        errors: s3n://snowplow-databucket.instacar.in/errors # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3n://snowplow-databucket.instacar.in/archive #Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
        stream: s3n://snowplow-enrichedrecords.instacar.in
      shredded:
        good: s3n://snowplow-databucket.instacar.in/shredded/good # e.g. s3://my-out-bucket/shredded/good
        bad: s3n://snowplow-databucket.instacar.in/shredded/bad # e.g. s3://my-out-bucket/shredded/bad
        errors: s3n://snowplow-databucket.instacar.in/shredded/errors # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3n://snowplow-databucket.instacar.in/shredded/archive # Where to archive shredded events to,e.g. s3://my-archive-bucket/shredded
    consolidate_shredded_output: false # Whether to combine files when copying from hdfs to s3
  emr:
    ami_version: 5.9.0
    region: us-east-2 # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement:    # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-2e0b6f47   # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: snowplow1
    security_configuration: # Specify your EMR security configuration if needed. Leave blank otherwise
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
    # Adjust your Hadoop cluster below
    jobflow:
      job_name: SnowplowETL # Give your job a name
      master_instance_type:  m4.2xlarge
      core_instance_count: 2
      core_instance_type:  m4.2xlarge
      core_instance_bid: 0.1
      core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
        volume_size: 100    # Gigabytes
        volume_type: "gp2"
        volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
        ebs_optimized: false # Optional. Will default to true
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.1  # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    configuration:
      yarn-site:
        yarn.nodemanager.vmem-check-enabled: "false"
        yarn.nodemanager.resource.memory-mb: "27648"
        yarn.scheduler.maximum-allocation-mb: "27648"
        yarn.resourcemanager.am.max-attempts: "2"
      spark:
        maximizeResourceAllocation: "true"
      spark-defaults:
        spark.dynamicAllocation.enabled: "false"
        spark.executor.instances: "2"
        spark.yarn.executor.memoryOverhead: "1024"
        spark.executor.memory: 8G
        spark.executor.cores: "1"
        spark.yarn.driver.memoryOverhead: "1024"
        spark.driver.memory: 8G
        spark.driver.cores: "1"
        spark.default.parallelism: "8"
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: thrift # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
    spark_enrich: 1.18.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.1       # Version of the Spark Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production

Here is my Redshift target file ( all usernames and pwds have been redacted ) :

{
    "schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-1-0",
    "data": {
        "name": "Redshiftenriched events storage",
        "host": "inscompanysnowflakecluster.c2nasstlico1.us-east-2.redshift.amazonaws.com",
        "database": "snowplow",
        "port": 5439,
        "username": "myuser",
        "password": "mypassword",
        "schema": "atomic",
        "compRows": 1000,
        "sshTunnel": null,
        "sslMode": "DISABLE",
        "purpose": "ENRICHED_EVENTS",
        "id": "56gh2265fs-dbc1-11ea-87d0-0242ac130003",
        "compRows": 10000,
        "maxError": 10,
        "roleArn": "arn:aws:iam::605934722134:role/RedshiftLoadRole"
    }
}

I see the archive folder in and records all tab separated.
Here is the archived folder with atomic events in S3. I checked RedshiftLogs, and i see a COPy to Redshift. But appears , nothing was copied since the source was empty.
src here is:

s3://snowplow-databucket.instacar.in/shredded/good/run=2020-08-15-07-03-11/atomic-events/

COPY atomic.events
FROM
  's3://snowplow-databucket.instacar.in/shredded/good/run=2020-08-15-07-03-11/atomic-events/' CREDENTIALS '' REGION AS 'us-east-2' DELIMITER ' ' MAXERROR 2 EMPTYASNULL FILLRECORD TRUNCATECOLUMNS TIMEFORMAT 'auto' ACCEPTINVCHARS

It seems like the src bucket for s3 from where dataloader picks up records is empty. No able to figure out, where .

ihor · August 15, 2020, 11:41pm

@ckrishnamoorthy, have you checked enriched:good to ensure there is good data to shred in the first place? What about shredded:bad bucket to see if all your enriched data failed at shredding?

I expect that enriched:stream bucket is the bucket you upload your streamed data into with S3 Loader? The workflow of data is like this: enriched:stream -> enriched:good -> enriched:shredded -> archive. It is depicted here.

By the way, using m4.2xlarge for master node is an overkill. The m4.large type will do just fine.

ckrishnamoorthy · August 16, 2020, 4:00am

Thanks @ihor for the quick response.

Yes, enriched:stream is the destination folder for s3snowplowloader.
shredded:bad has entries but they are all zero bytes, i see no bad records there
halfway through the process, i checked enriched:good, and it has valid records.
I was using m4.large initially, but often it would get stuck saying resource not available in EC2 , so figured to avoid delays, i would bump up a larger instance might be available all the time without delays . I would bump this down to m4.large soon.
shredded:archive also seems to have valid records.
And i dont see any credentials related errors in any of the lo

Another factor worth noting is , i am testing with latest snowplow js tracker file ( sp.js ) and i use a simple html file where i added trackers. I see trackers firing in Firefox and loading all the way to S3. Does this have anything to do with bad records in S3, that is having a html file , instead of a real website as source?

I think I will run the EmrEtlRunner without a db target and ensure enriched good s3 bucket has records. And then may be run emrEtlRunner again with just loader.

shredded:bad bucket:

shredded:archive bucket:

ckrishnamoorthy · August 18, 2020, 3:11am

Hi @ihor, some tips here could help me. I am essentially blocked at this step.

Cheers
Krishnamoorthy

ihor · August 18, 2020, 4:54pm

@ckrishnamoorthy, I do not see any obvious reason except that you seem to be using older versions of RDB Loader/Shredder and perhaps EmrEtlRunner as well. Your maxError is set to 10 and if there are any issues loading the data to Redshift (up to 10 events) it could go unnoticed. You could check your stl_load_errors table for any failed inserts perhaps?

ckrishnamoorthy · August 20, 2020, 9:52am

Hi @ihor,
I changed maxError to 1. And also bumped up RDB Loader/Shredder to 0.16.0 / 0.15.0. I also bumped up all the versions of EmrEtlRunner and other StreamEnricher to the latest available. ami_version is 5.9.0

I run into an error “Failing this attempt.Diagnostics: Exception from container-launch” .

Thanks,
Krishna

Topic		Replies	Views
Should I run rdb_load only? For engineers	7	1015	February 11, 2020
Can't load data back into redshift Troubleshooting	7	1759	June 4, 2018
Storage Loader successful but not loading Redshift or Postgres DB Storage targets	4	1918	March 28, 2017
Loading data from s3 to Redshift after EmrEtlRunner Troubleshooting	7	3267	November 19, 2018
Rerun storage loader from archived files Storage targets	3	1466	January 29, 2017

Redshift empty after successful job run

Related Topics