Choice of Redshift instances

Hi all!

I can imagine most of the production Snowplow / Redshift users here had similar thoughts about this topic so I thought I’d see if I can learn from anyone’s experience.

We currently run Redshift on a 4 dc1.large cluster. We are currently bound by disk space (so we archive fairly aggressively) and by CPU (heavy queries will max out the CPU on all instances).

I think we will need about 3-4 times more disk space than we have now and some more CPU capcity.
The thing is, giving the options and the cost of the instances, I’m not sure if we’re better off moving to say, 12-16 dc1.large instances or maybe 3-4 ds2.xlarge or even a single dc1.8xlarge. All of which would be approximately the same in terms of pricing. I know Redshift is meant to be linearly scalable but there’s always a tradeoffs between having more, weaker machines and having less stronger machines.

Anyone can advise based on past experience?

Thanks in advance

Hey Daniel,

I would think that 12-16 dc1.large would help with s3 parallelization, giving you great performance for your COPY commands, but lose some diskspace per node.

DC1 * 12 = 24 vCPU, 180GB ram, 2TB of diskspace, 2.4GB/sec IO, ~$3-4 per hour
DS2 * 4 = 16 vCPU, 124GB ram, 8TB of disk, 1.6GB/sec IO, ~$5 per hour

Also - TripAdvisor gave a talk at ReInvent last year BDT401 about tuning and best practices where they compared 32 dc1.large nodes vs 8 ds1.8xlarge (not sure why not ds2, maybe wasnt available).

BDT 401 - slide 41

Summary:
ds1 = 4 hour job, dc1 = 40 minutes

based off those numbers, i would lean towards several smaller DC1.large clusters for performance, especially if you already have aggressive archiving in practice.

That’s very useful information thanks for that!

I also did those calculations. My gut is that there’s a cost which is hard to account for of moving data between nodes when, for example, joined DIST keys don’t match.
This is also where throughput comes into play. dc1.large has 0.20GB/s “I/O” and that doesn’t scale with adding more nodes. In way that’s a limiting factor with scaling a cluster.

Just to make sure - I’m not trying to contradict your advice in any way. Just adding some more thoughts to the mix.

Cheers

No worries - i didnt realize/think about the non-scaling of I/O which does add a bit to the mix, being how ds2 was double the I/O.

Both good points, interested to see if others here have more insight - and maybe we can adjust our cluster too!

Hi Daniel,

I’ve seen various ways of handling this but it really depends on how your current implementation works so there’s a few options here.

A few questions first up.

  1. What sort of queries are maxing out the CPUs on the cluster currently? (e.g., are they mostly adhoc or can you cache partial results?)
  2. Is the execution load balanced across all 8 vCPUs for these queries?
  3. Are the columnar compressions the same as the ones in the Snowplow definition?

Depending on how much data you have in Redshift it’s not uncommon to have another source (other than your primary query cluster) as colder storage (S3 or a ds2 cluster) and a hot storage for insertion and queries that require more recent data, depending on the data size that could be a dc1 cluster.