We’re testing the upgrade from R80 to R93.
Our env resides in AWS VPC and the EmrEtlRunner is used as the collector; the redshift cluster has SSL enabled in its parameter group (require_ssl=true). The EMR instances & the redshift cluster reside within the same VPC subnet.
The EMR job fails when trying to load data to the redshift, from what I understand from the log (attached) - since it cannot establish connection to it.
I’ve done the following setup as part of the upgrade process:
- to the redshift security group I’ve added inbound rule that allows connectiuon from the SG of the master EMR node to its port.
- created IAM role that has RO access to S3 and assigned it to the redshift cluster.
Trying to troubleshoot the problem I did the following:
- placed the events back in the “in” bucket & started the EMR job all over again, from the “staging” phase. While it was performing the initial steps, I’ve logged in to the master instance. From there I’ve issued psql command with same params as in the redshift.json file:
psql -h <redshift_cluster>.<my_region>.redshift.amazonaws.com -U <user> -d <db> -p <port>
-> SSL connection was established and I was able to query it:
<db>=# select * from atomic.manifest;
etl_tstamp | commit_tstamp | event_count | shredded_cardinality
I’ve then issued tcpdump of the redshift port (sudo tcpdump dst port -w tcpdump.log) but nothing was logged although it took 1min to the rdb_load step to fail. I coul;dn’t further debug it as the server was terminated afterwards.
tried downgrading the rdb_loader version from 0.13.0 to 0.12.0 & resumed - same error.
disabled the “require_ssl” setting in the redshift parameter group.
resumed the EMR job from rdb_load step (after setting the ssl mode to DISABLED in the redshift.json file) - this time it succeeded:
I, [2017-10-10T20:32:24.615000 #14639] INFO -- : RDB Loader successfully completed following steps: [Discover, Load, Analyze]
D, [2017-10-10T20:32:24.616000 #14639] DEBUG -- : EMR jobflow j-XXXXXX completed successfully.
I, [2017-10-10T20:32:24.617000 #14639] INFO -- : Completed successfully
I’ve attached the redshift.json & the global config.yml.
Any idea what the problem might be & how to solve it? further debug steps?
BTW - I guess not related but worth mentioning I’m using a test redshift db that was launched from a snapshot of the production cluster, and uses identical redshift configuration (security group, subnet, param group etc).
Thanks a lot for your help!