Hey all, I’ve been looking to add the (Good) Kinesis Elasticsearch Sink to my real time pipeline. I’m relatively new to ES and have been digging into the docs to better understand it, but I’m still struggling with how best to set up the Elasticsearch cluster for production Snowplow RT. I was hoping someone could shed light on the following:
Is ES intended to be used as a long term persistent data store in the RT pipeline? I would imagine the indices get very large for companies sending millions of events per day so I’m curious if a delete process is used to drop data after some amount of time (while also running in parallel with batch loading to S3/Redshift for more permanent storage)
If ES is intended to be a long term analytics data store, do Snowplow users with ever increasing event volumes over provision their cluster in terms of both nodes and shards?
What is a reasonable size & number of nodes (and shards) for an ES cluster for a pipeline with let’s say, half a million enriched events per day growing to 1 million per day a year from now?