Provisioning for Elasticsearch


#1

Hey all, I’ve been looking to add the (Good) Kinesis Elasticsearch Sink to my real time pipeline. I’m relatively new to ES and have been digging into the docs to better understand it, but I’m still struggling with how best to set up the Elasticsearch cluster for production Snowplow RT. I was hoping someone could shed light on the following:

  1. Is ES intended to be used as a long term persistent data store in the RT pipeline? I would imagine the indices get very large for companies sending millions of events per day so I’m curious if a delete process is used to drop data after some amount of time (while also running in parallel with batch loading to S3/Redshift for more permanent storage)

  2. If ES is intended to be a long term analytics data store, do Snowplow users with ever increasing event volumes over provision their cluster in terms of both nodes and shards?

  3. What is a reasonable size & number of nodes (and shards) for an ES cluster for a pipeline with let’s say, half a million enriched events per day growing to 1 million per day a year from now?


#2

Great questions @travisdevitt

  1. Most implementations I’ve worked on don’t use Elasticsearch as a persistent data store, rather it’s often used (in conjunction with Kibana) as a tool for building simple, but useful real-time dashboards. You’re right in thinking that data volumes can get quite large if storing a larger amount of data and sending millions events a day. The typical way to drop events from an index over time is to use the TTL setting on an index to evict documents of a maximum age automatically.

  2. It’s not a terrible idea to overprovision slightly anyway to account for things like spikes in traffic which may cause backpressure. Overprovisioning storage isn’t often required if you’re using TTL as it’s possible to estimate reasonably accurately the data volume in a given period.

  3. For a production setting you generally want a minimum of 2 nodes in case of failover. 3 or more is better if possible due to the way that ES performs failover in the case of losing a master node based on quorum size. I haven’t really bothered tuning the sharding for these indexes and have left it as the ES default.

I’d throughly recommend using AWS Elasticsearch service if it’s an option as it abstracts a lot of the headaches and cluster management of running an Elasticsearch cluster in production away. You can provision a cluster with a flexible number of nodes, specify the instance type and attach additional (SSD) storage if required. For Elasticsearch clusters the medium instances tend to provide a good balance of CPU and RAM here - so depending on your data volumes you can probably get away with m3/m4.mediums. Note that if you run into issues running this cluster it’s possible to resize and change instance type of the cluster as well use ESS.