We’re in the process of building an infrastructure that will allow Snowplow data to be used in other applications (e.g. a python-based algorithm to predict the number of service providers interested in a given service). We currently use Redshift as a data warehouse, copying our transactional MySQL databases to be used together with web-analytics events.
We have basically two use cases:
1. When data could be outdated by a few days
In simple cases, where we can load historical to train a model, data could be outdated by a few days without significant impact. We could connect to Redshift, load data into memory, and work from there because Redshift has a small-ish limit of concurrent connections. Is there a better option?
2. When data must be fresh
In more complex cases, we need data to be up-to-date within a few seconds. With the Snowplow real-time pipeline implemented, Elasticsearch would be an obvious choice here. But the ES cluster would need to store massive amounts of data (including all historical data) - is that the way to go? What other options should we consider?
I’d really appreciate if you could share your experience in this subject.