Monitoring snowplow

grvregmi · May 7, 2018, 5:23pm

I am setting up snowplow to track user events from web and mobile apps in my platform. I got it up and running and it is now logging events to Redshift.

Web/mobile -> Clojure collector -> S3 -> EMREtl -> Redshift

Now, I want to monitor if there are any errors getting event data to the collector or if the EMR ran as it was intended to and dumped the data into redshift. What do you guys use to log any errors or just to monitor the overall health of the pipeline.

Thanks in advance for your input.

Lars · May 7, 2018, 6:10pm

Can you share a bit more about your ETL pipeline?

Any orchestration tool that you’re using, like Airflow or Luigi?
do you have a separate user for your ETL in Redshift? (i.e. no shared login)
how frequent are you dumping data into Redshift?
do you have a separate staging schema for those data dumps in Redshift?
have you set up workload management in Redshift?

grvregmi · May 7, 2018, 7:22pm

Hi Lars,

Thanks Lars for replying. Please find the answers below

No orchestration tools yet.
No separate user
Every two hours
No separate staging schema
No workload management set up yet

Lars · May 7, 2018, 8:16pm

ok, thanks for that additional detail.

If you’re just playing around a bit, and not giving anybody direct SQL access to your data, you’re fine.

But if you increase your activity on the cluster, maybe even a bit more mission-critical (e.g. embedded charts, or a scheduled report), then there’s a bit of upfront work you can do now, and it will help you in the long run.

I wrote that up in 3 Things to Avoid When Setting Up an Amazon Redshift Cluster

tl;dr

give your ETL pipeline a separate user, e.g. ‘snowplow_etl’
add a “load” queue in your WLM, assign your ETL user to that queue
create a separate schema for your data loads

By following that set-up, you’ll have more visibility into your ETL, which will make isolating any errors / exceptions straightforward. By having a separate “raw schema”, you’ll protect your raw data (or “atomic data”, as the Snowplow teams like to say) from ad-hoc use. If your end-users start building queries straight on top of your raw data, it’ll be hard to change those tables later.

For your data loads, there are also some best practices.

For example it’s important to select a single timestamp format and enforce it across all tables in your schemas. Also when possible validate simple strings like emails, URLs, or other IDs to avoid needing to do that later.

Columns with a CHAR data type only accept single-byte UTF-8 characters, up to byte value 127, or 7F hex, which is also the ASCII character set. VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.

Primary keys are not enforced! De-duplication is your responsibility.

Hope that helps! Ping me here if you have more questions, happy to help.

Topic		Replies	Views
Monitoring of snowplow AWS batch pipeline (Legacy)	10	4305	May 29, 2019
Sample Data For Redshift Redshift	3	1898	December 22, 2016
Snowplow real-time analysis with on-premise pipeline For engineers	6	2412	January 4, 2018
On-premise Realtime Pipeline For engineers	2	2216	January 3, 2018
How to shred events into Redshift from the real-time pipeline? AWS real-time pipeline	2	2789	May 25, 2016

Monitoring snowplow

Related Topics