Snowplow Mini - multiple sinks for failover

Hi,

We are using the snowplow-mini version - without kinesis. I am wondering if it is possible to have multiple sinks from stdin (apart from elasticsearch) like S3 / hive. We are trying to address the data loss because of a failed elasticsearch system. Can you please suggest?

Thanks,
Manju

Hi @manju,

Snowplow Mini is designed for non-production use cases: trialling Snowplow, integration testing, iterative schema development. It is not intended for production use. Single-instance systems are not a good fit for production event pipelines.

If data loss in Snowplow Mini’s built-in Elasticsearch system is a concern, then that means you should be setting up a full Snowplow stack with horizontal scaling, failover support etc.

@alex,

Thanks a lot for responding. We do have a full snowplow stack as well. Wondering how do I provide failover support - for elasticsearch. The kinesis-elasticsearch sink configuration file provides placeholder for only one ip address. Is there a way to add multiple elasticsearch sinks simultaneously? Could you please suggest a fail-safe architechture as a best practice? Sorry for being so ignorant about this. Also we would like to avoid kinesis and get kafka / flume if possible…Thank you so much in advance.
Thanks,
Manju

The easiest way to do this is to add a single Elasticsearch endpoint for an Elasticsearch cluster. If you don’t have too much experience managing an ES cluster the easiest option is to spin up a multinode cluster using AWS Elasticsearch Service.

Hello @mike,

Sure, we are having a single elasticsearch endpoint for the ES cluster. But my concern is what if the endpoint fails. How do I persist data? Looking forward to your response.

Thanks,
Manju

Which part of the endpoint are you concerned is going to fail? ESS will replace unhealthy nodes automatically and if required it’s possible to use cross-zone replication in the event of a single zone failure.

In terms of persistence - most people who use Snowplow don’t tend to opt for long term persistence (depending on data volumes beyond 7-30 days) as data should be sinking to S3 where it can remain, if required, indefinitely.

Many Thanks @mike. May be it is a stupid question. Am talking about the ES end point we provide in kinesis-elasticsearch sink (kinesis-Es config file). If that server is down, then how will the data pass on from enrichment to elasticsearch cluster. Is there another option? Again thanks for helping.

Thanks,
Manju

What scenario are you concerned about that would cause the ES cluster to fail completely?

Hello @mike, Thanks for the response. I am worried if the server going down (the ipaddress mentioned in kinesis-elasticsearch sink configuration file). Then no data will go to cluster. Will the config file in kinesis-elasticsearch sink be able to route to a different ES node in case one node goes down?

Thanks,
Manju

Providing you are pointing the configuration at the ES endpoint (the address will end in es.amazonaws.com) then yes - you shouldn’t need to worry about individual node failures.

You should avoiding pointing to a single IP in the cluster (some of these can be listed using dig hostname.aws-region.es.amazonaws.com.

Thanks @mike,

@manju - if you are concerned about a total Elasticsearch cluster failure as well, then this ticket will be of interest too:

https://github.com/snowplow/snowplow/issues/2918

@alex, @mike,
Really appreciate your help and attention. Thank you so much for this.
Is there an option to have failover endpoints(to replace the one below in the config file) if we want to manage our own ES cluster without AWS? Currently we are having our own ES cluster not from AWS.

(kinesis-elasticsearch-sink.config):
elasticsearch.client.type : The type of client to use (http or transport)
elasticsearch.client.endpoint : The Elasticsearch cluster endpoint
elasticsearch.client.port : The Elasticsearch cluster port
elasticsearch.client.max-timeout : The Elasticesarch maximum timeout in milliseconds
elasticsearch.client.http.conn-timeout : The connection timeout for the HTTP client
elasticsearch.client.http.read-timeout : The read timeout for the HTTP client

Can s3 sink be run parallely?Having data in s3 sink - does that mean I have to kinesis streams only? Again, thanks for your wonderful support.

Thanks,
Manju

There’s no option in the configuration file to have multiple Elasticsearch endpoints at the moment - it’s assumed that you’ll only need to sink to cluster.

If you’re running your own AWS cluster in production you’ll want to consider how you scale when things go wrong that ESS takes care of for you (loss of a single node, snapshots, restoration of a node, replication between nodes).

Yes - you can run the S3 sink or another application that consumes Kinesis directly in parallel. Depending on how many consumers you are running you’ll need to consider the Kinesis limits particularly around read rates (2 MB/s/shard) which can be modified by increasing the number of shards in a stream (splitting).

Hello @mike,

Thank you so much for responding. It certainly helps. By any chance is it possible to have flume / kafka in place of kinesis streams? Thanks again so much for clarifying all this.

Thanks,
Manju

Release 85 of Snowplow introduced some beta support for Kafka to partially replace Kinesis. If you search the main repository for Kafka issues you should get an overview of what’s been completed and what’s planned.

As for Flume - I can’t speak for the Snowplow team here - but I can’t imagine first class citizen support is a priority. Flume overlaps significantly with existing parts of the Snowplow pipeline and doesn’t really have nearly as much community support behind it. That said, if you really wanted to use Flume you may (?) be able to use it by dumping events to stdout/disk and reading them in.

@mike is right - Apache Flume support is not on our roadmap; it’s something of a legacy tech at this point and would be substitutive for Snowplow rather than complementary.

Kafka support on the other hand is something we are taking very seriously!