I’m currently testing snowplow-mini in preparation for a production deployment of snowplow. An issue I just encountered is the good index seem to only hold 7 days of data? This is around 316k-456k+ docs. Do you know what is causing this? I.e. do does snowplow{-mini} have a setting/tool that clears the ES index except for the recent 7 days? Additionally, where is the rest of the data being stored? I don’t see it in pg?
By default both the good and bad indices in Snowplow Mini have a time to live (TTL) of 7 days. This TTL is a setting of Elasticsearch that evicts documents after a specified amount of time.
To retrieve the currently set TTLs in Snowplow Mini for the good index you can use
where {IP} is the IP address or host of your instance. You can also switch out good above for bad for the bad index. To modify this TTL from 7 days to 14 days you could perform the following:
As far as I know there’s no way to recover documents that have already been flushed.
TTL should still work in 2.x - I’m not too sure about 5.x however.
No - Elasticsearch is the only configured sink for Snowplow Mini (and uses pipes to join the different processes).
Yes - in a production environment data could be configured to send to S3/Kinesis/Kafka/Redshift/Postgres etc. Snowplow has great durability and makes it quite difficult to lose if configured correctly.