Raw data consists of messy, thrift-formatted payloads basically. They’re still in the format that the trackers sent them, and also the raw stream will contain junk data that later gets filtered out by the validation step (which happens in the enrich component).
Generally you should never need to work directly with raw data, and doing so is a lot of work and pain.
The reason one would load raw data to S3 is usually as a failsafe. If some drastic issue happens downstream, then the raw data is in S3 and can be reprocessed (note that it’s not easy to do this, it’s a last resort).
Enriched (good) data has been validated so only contains the good, high quality data. It also has information added to it in the enrichment process, and it’s in TSV format.
The enrichment process also produces bad data - which is the result of failing validation. Normally, the bad data is loaded to s3 so that one can use tools like Athena to debug issues (for example a tracking mistake where an int is sent as a string).
Both the good and bad enriched data can be accessed via elasticsearch, or directly from the streams. So if you never care about using the data in S3, nor do you care about having a backup copy in filestorage, (and you’re not loading to Redshift or Snowflake) then there’s no real reason to load to S3.
If I were using Snowplow for the first time, and my starting volumes were low, I’d probably start with the enriched good and bad data in S3, just to give me a way to dig into the data directly. Then after getting to grips with things, I’d turn off whatever I don’t think I’ll need to keep.