Snowplow Mini only storing 7 days of data in Elasticsearch?


#1

Hi,

I’m currently testing snowplow-mini in preparation for a production deployment of snowplow. An issue I just encountered is the good index seem to only hold 7 days of data? This is around 316k-456k+ docs. Do you know what is causing this? I.e. do does snowplow{-mini} have a setting/tool that clears the ES index except for the recent 7 days? Additionally, where is the rest of the data being stored? I don’t see it in pg?

Thanks in advance!


#2

By default both the good and bad indices in Snowplow Mini have a time to live (TTL) of 7 days. This TTL is a setting of Elasticsearch that evicts documents after a specified amount of time.

To retrieve the currently set TTLs in Snowplow Mini for the good index you can use

curl -XGET "http://{IP}:9200/good/good/_mapping/field/_ttl"

where {IP} is the IP address or host of your instance. You can also switch out good above for bad for the bad index. To modify this TTL from 7 days to 14 days you could perform the following:

curl -XPUT "http://{IP}:9200/good/good/_mapping" -d '
{ 
    "good": {
      "_ttl": { 
        "enabled": true, 
        "default": "14d"
      }
    } 
}'

This TTL change will only apply to documents that have been indexed after the operation - documents indexed prior will retain their original TTL.


#3

Hi Mike,

Thanks for the reply.

  1. The old docs are completely flushed, I assume? No way to retrieve these?
  2. TTL is deprecated in ES2+, does this have any effect on migrating to using the recommended prod setup of Snowplow?
  3. Data isn’t being stored anywhere else?
  4. In prod data is also pushed to Redshift/pg, correct?

Thanks


#4
  1. As far as I know there’s no way to recover documents that have already been flushed.
  2. TTL should still work in 2.x - I’m not too sure about 5.x however.
  3. No - Elasticsearch is the only configured sink for Snowplow Mini (and uses pipes to join the different processes).
  4. Yes - in a production environment data could be configured to send to S3/Kinesis/Kafka/Redshift/Postgres etc. Snowplow has great durability and makes it quite difficult to lose if configured correctly.