Shredding fails with custom schema in eu-central-1


#1

We’ve been running EMR ETL Runner (r95_ellora) on a 24hr cron for the past month without issues in eu-central-1. Yesterday we started to track a custom event with an associated iglu schema - this appears to have caused the “Elasticity Spark Step: Shred Enriched Events” step to begin to fail (com.snowplowanalytics.snowplow.storage.spark.ShredJob), the error log contains:

18/01/26 09:55:03 INFO SparkContext: Created broadcast 0 from textFile at ShredJob.scala:341
18/01/26 09:55:06 WARN RoleMappings: Found no mappings configured with 'fs.s3.authorization.roleMapping', credentials resolution may not work as expected
18/01/26 09:55:14 ERROR ApplicationMaster: User class threw exception: java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
	at com.amazon.ws.emr.hadoop.fs.rolemapping.DefaultS3CredentialsResolver.resolve(DefaultS3CredentialsResolver.java:23)
	at com.amazon.ws.emr.hadoop.fs.guice.CredentialsProviderOverrider.override(CredentialsProviderOverrider.java:26)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.executeOverriders(GlobalS3Executor.java:95)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:76)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
	at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.doesBucketExist(AmazonS3LiteClient.java:88)
	at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.ensureBucketExists(Jets3tNativeFileSystemStore.java:138)
	at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:116)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy45.initialize(Unknown Source)
	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.initialize(S3NativeFileSystem.java:446)
	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:109)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:407)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
	at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:555)
	at com.snowplowanalytics.snowplow.storage.spark.ShredJob.run(ShredJob.scala:404)
	at com.snowplowanalytics.snowplow.storage.spark.ShredJob$.run(ShredJob.scala:79)
	at com.snowplowanalytics.snowplow.storage.spark.SparkJob$class.main(SparkJob.scala:32)
	at com.snowplowanalytics.snowplow.storage.spark.ShredJob$.main(ShredJob.scala:48)
	at com.snowplowanalytics.snowplow.storage.spark.ShredJob.main(ShredJob.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.client.builder.AwsClientBuilder.setRegion(AwsClientBuilder.java:371)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.client.builder.AwsClientBuilder.configureMutableProperties(AwsClientBuilder.java:337)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46)
	at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.securitytoken.AWSSecurityTokenServiceClientBuilder.defaultClient(AWSSecurityTokenServiceClientBuilder.java:44)
	at com.amazon.ws.emr.hadoop.fs.util.AWSSessionCredentialsProviderFactory.<clinit>(AWSSessionCredentialsProviderFactory.java:19)
	... 51 more
18/01/26 09:55:14 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.ExceptionInInitializerError)

Our config looks like:

aws:
  access_key_id: "<snip>"
  secret_access_key: "<snip>"
  s3:
    region: eu-central-1
    buckets:
      assets: s3://snowplow-hosted-assets
      jsonpath_assets: 
      log: s3://bucket-name/log
      raw:
        in:
          - s3://bucket-name/
        processing: s3://bucket-name/processing
        archive: s3://bucket-name/archive
      enriched:
        good: s3://bucket-name/enriched/good
        bad: s3://bucket-name/enriched/bad
        errors:
        archive: s3://bucket-name/enriched/archive
      shredded:
        good: s3://bucket-name/shredded/good
        bad: s3://bucket-name/shredded/bad
        errors:
        archive: s3://bucket-name/shredded/archive
  emr:
    ami_version: 5.9.0
    region: eu-central-1 
    jobflow_role: EMR_EC2_DefaultRole
    service_role: EMR_DefaultRole 
    placement:
    ec2_subnet_id: <snip>
    ec2_key_name: <snip>
    bootstrap: []
    software:
      hbase:
      lingual: 
    jobflow:
      job_name: Snowplow ETL (stage)
      master_instance_type: m4.large
      core_instance_count: 2
      core_instance_type: m4.large
      core_instance_bid: 0.07
      core_instance_ebs: 
        volume_size: 100 
        volume_type: "gp2"
        volume_iops: 400 
        ebs_optimized: false 
      task_instance_count: 0
      task_instance_type: m4.large
      task_instance_bid: 0.07 
    bootstrap_failure_tries: 3
    configuration:
      yarn-site:
        yarn.resourcemanager.am.max-attempts: "1"
      spark:
        maximizeResourceAllocation: "true"
    additional_info: 
collectors:
  format: thrift 
enrich:
  versions:
    spark_enrich: 1.10.0 
  continue_on_unexpected_error: false 
  output_compression: NONE 
storage:
  versions:
    rdb_loader: 0.14.0
    rdb_shredder: 0.13.0
    hadoop_elasticsearch: 0.1.0 
monitoring:
  tags: {}
  logging:
    level: DEBUG

EMR can read-from/write-to the specified S3 bucket in the other steps. Any idea how we can fix this?

Many thanks,
Chris


#2

Hey @chriswarren - can you share the (anonymized) contents of your --targets directory?


#3

Hi @alex,

Thanks, --targets contains only the following redshift.json file:

{
    "schema": "iglu:com.snowplowanalytics.snowplow.storage/redshift_config/jsonschema/2-1-0",
    "data": {
        "name": "AWS Redshift enriched events storage",
        "host": "<snip>",
        "database": "snowplow",
        "port": 5439,
        "sslMode": "DISABLE",
        "username": "<snip>",
        "password": "<snip>",
        "roleArn": "<snip>",
        "schema": "atomic",
        "maxError": 1,
        "compRows": 20000,
        "sshTunnel": null,
        "purpose": "ENRICHED_EVENTS"
    }
}

(based off https://github.com/snowplow/snowplow/blob/master/4-storage/config/targets/redshift.json)


#4

Hey @chriswarren,

I think we need to sum up following facts:

  1. Yesterday we started to track a custom event with an associated iglu schema
  2. Line causing exception (com.snowplowanalytics.snowplow.storage.spark.ShredJob.run(ShredJob.scala:404)) refers to ShredJob writing to bad folder.

Is it possible that your s3.buckets.shredded.bad is not properly configured? Maybe refers to another bucket. I have a feeling that previously your shred job did never write anything to bad bucket and when you added new schema - it started to write there and stumbled upon this issue.

Have you tried to look into the bad bucket? One more think I’d check additionally is the region of bad bucket - are you sure it is eu-central-1?


#5

Hi @anton & @alex,

I’ve tried --resume-from shred today and the step completed successfully (it failed repeatedly yesterday) so appears to be intermittent.

The config is pointing to different S3 key prefixes for every s3 config option but all use the same bucket:

shredded:
  good: s3://beamly-data-snowplow-lzo-stage/shredded/good
  bad: s3://beamly-data-snowplow-lzo-stage/shredded/bad
  errors:

The bucket is in eu-central-1 (and emr-etl-runner is being run in ECS in eu-central-1, as is the EMR cluster):

I’ll update this thread if we see this happen again.

Thanks,
Chris


#6

Thanks @chriswarren - keep us posted!