Monitoring snowplow


#1

I am setting up snowplow to track user events from web and mobile apps in my platform. I got it up and running and it is now logging events to Redshift.

Web/mobile -> Clojure collector -> S3 -> EMREtl -> Redshift

Now, I want to monitor if there are any errors getting event data to the collector or if the EMR ran as it was intended to and dumped the data into redshift. What do you guys use to log any errors or just to monitor the overall health of the pipeline.

Thanks in advance for your input.


#2

Can you share a bit more about your ETL pipeline?

  • Any orchestration tool that you’re using, like Airflow or Luigi?
  • do you have a separate user for your ETL in Redshift? (i.e. no shared login)
  • how frequent are you dumping data into Redshift?
  • do you have a separate staging schema for those data dumps in Redshift?
  • have you set up workload management in Redshift?

#3

Hi Lars,

Thanks Lars for replying. Please find the answers below

  • No orchestration tools yet.
  • No separate user
  • Every two hours
  • No separate staging schema
  • No workload management set up yet

#4

ok, thanks for that additional detail.

If you’re just playing around a bit, and not giving anybody direct SQL access to your data, you’re fine.

But if you increase your activity on the cluster, maybe even a bit more mission-critical (e.g. embedded charts, or a scheduled report), then there’s a bit of upfront work you can do now, and it will help you in the long run.

I wrote that up in 3 Things to Avoid When Setting Up an Amazon Redshift Cluster

tl;dr

  • give your ETL pipeline a separate user, e.g. ‘snowplow_etl’
  • add a “load” queue in your WLM, assign your ETL user to that queue
  • create a separate schema for your data loads

By following that set-up, you’ll have more visibility into your ETL, which will make isolating any errors / exceptions straightforward. By having a separate “raw schema”, you’ll protect your raw data (or “atomic data”, as the Snowplow teams like to say) from ad-hoc use. If your end-users start building queries straight on top of your raw data, it’ll be hard to change those tables later.

For your data loads, there are also some best practices.

For example it’s important to select a single timestamp format and enforce it across all tables in your schemas. Also when possible validate simple strings like emails, URLs, or other IDs to avoid needing to do that later.

Columns with a CHAR data type only accept single-byte UTF-8 characters, up to byte value 127, or 7F hex, which is also the ASCII character set. VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.

Primary keys are not enforced! De-duplication is your responsibility.

Hope that helps! Ping me here if you have more questions, happy to help.