ok, thanks for that additional detail.
If you’re just playing around a bit, and not giving anybody direct SQL access to your data, you’re fine.
But if you increase your activity on the cluster, maybe even a bit more mission-critical (e.g. embedded charts, or a scheduled report), then there’s a bit of upfront work you can do now, and it will help you in the long run.
I wrote that up in 3 Things to Avoid When Setting Up an Amazon Redshift Cluster
- give your ETL pipeline a separate user, e.g. ‘snowplow_etl’
- add a “load” queue in your WLM, assign your ETL user to that queue
- create a separate schema for your data loads
By following that set-up, you’ll have more visibility into your ETL, which will make isolating any errors / exceptions straightforward. By having a separate “raw schema”, you’ll protect your raw data (or “atomic data”, as the Snowplow teams like to say) from ad-hoc use. If your end-users start building queries straight on top of your raw data, it’ll be hard to change those tables later.
For your data loads, there are also some best practices.
For example it’s important to select a single timestamp format and enforce it across all tables in your schemas. Also when possible validate simple strings like emails, URLs, or other IDs to avoid needing to do that later.
Columns with a CHAR data type only accept single-byte UTF-8 characters, up to byte value 127, or 7F hex, which is also the ASCII character set. VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.
Primary keys are not enforced! De-duplication is your responsibility.
Hope that helps! Ping me here if you have more questions, happy to help.