EMR intermittently fails at Loading S3 to Redshift


#1

Hi,

I am getting intermittent failures in Snowplow EMR job during “Elasticity Custom Jar Step: Load Redshift Storage Target” step. Anyone else running into the same problem? I am on the latest release v92. I think the problem might be a lag in s3, where one step is uploading a ton of files to s3, and then the next step is quickly trying to access those file to load into redshift. Below is the stdout error from EMR:

Data loading error [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
Details: 
 -----------------------------------------------
  error:  S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
  code:      8001
  context:   S3 key being read : s3://XXXXX/shredded/good/run=2017-09-27-22-00-18/atomic-events/part-00061.gz
  query:     208888
  location:  table_s3_scanner.cpp:352
  process:   query3_68 [pid=10325]
  -----------------------------------------------;
ERROR: Data loading error [Amazon](500310) Invalid operation: S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
Details: 
 -----------------------------------------------
  error:  S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 0F095DB7B73D7948,ExtRid A1MS6DMeTvlqAn8hCfLs9HqH2Pn0LN9hHql+iSe4k9sC+LKArkFf+oPobirRC0wZuMakt6tE4lQ=,CanRetry 1
  code:      8001
  context:   S3 key being read : s3://XXXXX/shredded/good/run=2017-09-27-22-00-18/atomic-events/part-00061.gz
  query:     208888
  location:  table_s3_scanner.cpp:352
  process:   query3_68 [pid=10325]
  -----------------------------------------------;
Following steps completed: [Discover]
INFO: Logs successfully dumped to S3 [s3://XXXXX/log/rdb-loader/2017-09-27-23-00-18/16dc63e6-6720-43d1-bbd9-097c06dffeec]

#2

Hello @neekipatel,

I believe this error happens due invalid Role ARN. It must look like arn:aws:iam::719197435995:role/RedshiftLoadRole. It also must have AmazonS3ReadOnlyAccess permission:

Then you need chose Amazon Redshift -> AmazonS3ReadOnlyAccess, choose a role name, for example RedshiftLoadRole. Once created, copy the Role ARN as you will need it in the next section.


#3

Hi @anton ,

Thank you for your help. I double checked the permissions and they seem to be set properly. If they weren’t wouldn’t it always fail, instead of intermittently failing? On instances when the error does occur, I re-run snowplow-emr-etl-runner with --resume-from=“rdb_load” and everything works out fine.


#4

Hi @neekipatel,

Sorry, you’re totally right, I must be misread that it fails intermittently.

In that case, I believe it happens due to notable S3 eventual consistency issue. What’s the typical amount of files you’re loading (both in atomic-events and shredded)?

Problem is that when you have too many files - discover logic can give you wrong list of files, where some files are basically ghosts from previous load. S3 will become consistent, but it happens “eventually”, but not now. Meanwhile Redshift tries to load these ghost files and (correctly) fails with it.

We added some logic in RDB Load to check and wait for some time, but in the end unfortunately there’s no silver bullet against eventual inconsistency - we have to wait.


Serializable isolation violation on table
#5

Just wanted to report back, after some more investigation we found the issue was related to s3 versioning. Once we turned off S3 versioning the issue hadn’t occurred in the last 3 days.


#6

Thanks for sharing @neekipatel!