Transient rdbloader error: [Amazon](60000) Error setting/closing connection


#1

Yesterday, the rdbloader step of our batch pipeline (R92) failed after data discovery and logged the following error message:

ERROR: Data loading error Amazon Error setting/closing connection: Connection reset by peer.
Following steps completed: [Discover]

I was able to load the shredded data by re-runing with the --resume_from rdbloader option but want to see if there’s something I should be doing to prevent this from happening again.

We’re currently running R92 (i.e., rdb_shredder: 0.12.0 and rdb_loader = 0.13.0). I’ve seen a few references (linked below) to the 60000 error, but am unsure if those are what caused the above error.

We upgraded to R92 several weeks ago, and things have been working well. In addition, the failure we saw yesterday was on a small, weekend volume of data compared to what we normally process for a regular business day.

Is there a way I can definitively tell whether or not our error is related to one of the following?

1.https://github.com/snowplow/snowplow-rdb-loader/issues/73
2. https://github.com/snowplow/snowplow-rdb-loader/issues/85

Thanks for your help!


#3

Hello @bsweger,

We indeed had an issue RDB Loader’s connection that resulted in similar error. That issue was caused by NAT Gateway and fact that it makes RST and FIN packets to misbehave. This was solved by this EMR bootstrap script. However, you mentioned that volume was quite small in that case, but problem I’ve described happened only when loading takes more than ~10 minutes and I’m surprised you didn’t encounter it before. Maybe you remember any changes you’ve made to your infrastructure recently?


#4

Hi @anton,

TL;DR More poking around confirms that you’re right about the NAT Gateway–thank you. Can’t quite follow the Github threads…is there a specific release we should upgrade to for this fix?

More info for anyone else who might find this thread…

We’ve had no recent infrastructure changes.

I was skeptical that we hit that NAT Gateway issue on a low-volume day after running R92 successfully for several weeks, so I checked our Redshift logs (STL_QUERY):

  • The total elapsed time of all the COPY statements on normal-volume runs are between 5 - 8 minutes
  • The atomic.events load took ~ 11 min on the day we got the RDB Loader error (not sure why, since it only took 40 seconds on the re-run, but that’s a different issue)

Thanks again for confirming this.


#5

Hey @bsweger,

I’m glad we’ve figured out what happened!

Sorry for mess in GH threads - most conversations happened internally, indeed. The quickest workaround for this problem is to add s3://snowplow-hosted-assets/common/emr/snowplow-ami5-bootstrap-0.1.0.sh to aws.emr.bootstrap (bear in mind - this is an array, not a string) in your config.yml. Another way is to update EmrEtlRunner to R102, which runs this script by default.

Hope this helps.