Visualise Snowplow data using Airbnb Caravel & Redshift [tutorial]

robkingston · August 12, 2016, 8:23am

Exposing data from Redshift can be a pain when many dashboarding tools don’t support it OOB. As big R users we often generate reports out of RStudio using rmarkdown and knitr for performing useful analytics on the data. But for some reports it’s just impractical as clients want to explore their up-to-date data in an accessible dashboard. Since dashboards aren’t a core part of our business we couldn’t justify thousands of dollars a year for BI tools like Periscope or Tableau. Meanwhile other services like Slemma were OK but offered limited control and slow reports.

Introducing Airbnb Caravel

Airbnb decided to roll their own interactive BI solution and open source it. It has a large following on Github thanks to it being built on Python, Flask, Pandas and SQLAlchemy. I think the Snowplow community will love it for:

Diverse set of visualisations to choose from
Create and manage users with fine-grained permissions
Connect to Redshift and many other DBs supported by SQLAlchemy
Caching for taxing queries
JSON/CORS support for exposing Redshift data across domains
Interactively query your Snowplow events from the web

The feature set is still on the basic side but it’s really promising. Whereas before I would jump into Google Analytics to answer quick questions of web analytics data, we now use Caravel to access every single event in Redshift.

Getting started

You have two options:

I find it’s simplest to just manage the whole process with Docker containers, because you may struggle getting the environment ready with all the dependencies in Python. Note: If you’re installing this directly you will need to install the sqlalchemy-redshift package to connect to Redshift using: pip install sqlalchemy-redshift.

1, Fire up an instance and make sure you have Docker installed: https://docs.docker.com/engine/installation/linux/ubuntulinux/

2, Run the Docker container automatically managed by amancevice on Github (it’s an automated build kept up to date and contains everything you need to access Redshift out of the box). If you’re just trialling Caravel for the first time you may like to use the SQLite DB before committing to MySQL:

docker run --detach --name caravel \
    --env SECRET_KEY="mySUPERsecretKEY" \
    --env SQLALCHEMY_DATABASE_URI="sqlite:////home/caravel/db/caravel.db" \
    --publish 8088:8088 \
    --volume ~/caravel:/home/caravel/db \
    amancevice/caravel

3, Initialise the Caravel DB by running the init command and answer some questions to setup your admin user details:

docker exec -it caravel caravel-init

4, Make sure you open up port 8080 and you should be able to login to your Caravel instance. Boom.

5, After logging in, browse to “Sources” > “Databases” and add your Redshift DB to Caravel and Test the connection:

It should follow this format: redshift+psycopg2://username@host.amazonaws.com:5439/database

As always, make sure Redshift is accessible by the new host. You’ll know it’s working when you get a successful message upon hitting “Test connection”

6, Next, you’ll need to define your tables. Under the “Sources” menu, hit “Tables” and add a new row:

To get started you may like to add your atomic.events table:

Table name: events
Database: snowplow (Or whatever you named the DB you just created)
Schema: atomic
Main dttm: collector_tstamp
Offset: However many hours you want to offset from UTC time.

If Caravel can’t find the table, make sure your user permissions allow access to the table and failing that, check that the atomic schema is in Redshift’s search path - you’ll need to modify your cluster parameters as described here by Redshift.

7, Once your table is saved, you’ll want to edit the table and let Caravel know how to treat each column. I won’t cover this here as it should be fairly self-explantory:

This affects which variables will be available to you in Exploration mode.

8, Now you have all your data setup, you should be ready to create slices for a dashboard and begin exploring your data interactively. Just browse to your newly created events table and open it, right away it will show you a table with a count:

You can perform simple aggregations in the table using Group By and count distinct:

But most importantly you can change the visualisation to any number of the built in charts:

How awesome is that!

9, When you’re finished exploring your data and you want to save it as a slice for viewing later, just hit save and add it to a dashboard.

Voila… you’ve now got all your Snowplow data at your fingertips!

Exciting new features are coming soon to Caravel

Tabbed SQL editor to write and run queries directly in a web interface
Table joins support coming

Let me know what you think of this guide and if you run into any issues, just let me know.

christophe · August 12, 2016, 2:45pm

Thanks for sharing @robkingston. I had been wanting to give Caravel a spin for a long time, and this is just the nudge I needed

robkingston · August 12, 2016, 2:54pm

Thanks @christophe - you’re welcome.

Knew the Snowplow community would be interested in a tool like this.

matiasg · August 15, 2016, 1:37am

I’ve tried both Caravel and Redash, and while Caravel is much more powerful, I like the simplicity of Redash. https://github.com/getredash/redash

mike · August 15, 2016, 1:47am

This is awesome, thanks @robkingston

travisdevitt · August 15, 2016, 7:54pm

@robkingston great stuff…are you deploying Caravel on ec2? I’m curious what infrastructure is required to run it for a small(ish) company (<100 people)

robkingston · August 16, 2016, 2:54pm

@travisdevitt - cheers. Our instance isn’t running on AWS, we use another local provider where the specs are:

6 “virtual CPUs”
4gb RAM
For ~$35USD/month

Even this is seriously overkill. It barely breaks a sweat when you’re opening up 5-10 dashboards at once. Here’s our CPU usage for the past week (minimal usage - with MariaDB behind it, and Nginx/OpenSSL running in front):

I tested it on an instance as small as 1 vCPU, 768mb RAM and it worked there too, but I don’t think it would hold up well to multiple active users. A single T2 Small should do a fine job with room to spare.

But as you’d expect Redshift is a real bottleneck. I want to configure datasource caching so we’re not constantly hitting our Redshift cluster every time we open a dashboard.

travisdevitt · August 16, 2016, 3:28pm

@robkingston that all makes a ton of sense…thanks Rob!

mike · August 18, 2016, 2:03am

Note: If you’re installing this directly and running into errors attempting to connect to Redshift make sure you pip install sqlalchemy-redshift, Caravel won’t install this automatically.

fwahlqvist · August 18, 2016, 8:29pm

Thanks @robkingston,
Good tutorial and good overview of Caravel.

bernardosrulzon · August 20, 2016, 8:04pm

Great stuff, thanks! Has anyone tried out Metabase? I gave it a spin last week and was impressed by the recent updates. Seems to work good for simple dashboard.

robkingston · August 22, 2016, 12:50am

Thanks @mike - Good spot. Added…

@bernardosrulzon - I haven’t but it looks interesting.

13scoobie · August 22, 2016, 2:36pm

We have used metabase for quite some time, enjoy the more polished interface and built-in slack integration. Definitely worth checking out!

tclass · August 23, 2016, 8:57am

@robkingston Do you have any things that are really annoying in caravel? I’m evaluating some BI tools for my company and that I can’t write plain SQL yet would be a big downside for example.

robkingston · August 23, 2016, 1:06pm

@tclass - hah, well no SQL editor is inconvenient but it’s in the works:

via this PR: https://github.com/airbnb/caravel/pull/514

Other pain points:

Confusing to setup initial permissions for users/groups - They’re way too granular and it should be simple
When a query returns null, you’ll have a big red ugly error on your dashboard - not the most professional looking / client friendly error
Very few guides to get up and running
No way to cancel long-running queries (yet)
Development looks really chaotic, so there’s no telling what features will land in the next release (or when that may be)
Design feels a little dated
Interface feels unintuitive in places “Add DB -> Add table -> Explore data -> Add slice to dashboard” workflow is not obvious for instance

yali · September 7, 2016, 9:04pm

Hey @robkingston - finally getting an opportunity to work through your tutorial.

A couple of things I found I had to change - noting them here in case other people find them useful:

Launching the docker image

I found I had to run:

sudo docker run --detach --name caravel \
    --env SECRET_KEY="FbLUn0gMn2oL" \
    --env SQLALCHEMY_DATABASE_URI="sqlite:////home/caravel/db/caravel.db" \
    --publish 8088:8088 \
    --volume ~/caravel/caravel.db:/home/caravel/caravel.db \
    amancevice/caravel

Note the

    --volume ~/caravel/caravel.db:/home/caravel/caravel.db

rather than

    --volume ~/caravel:/home/caravel/db

If I tried running the above without the modification the container wouldn’t start/stopped immediately. It looks like I had this issue but in reverse(?) Needs further investigation…

Connecting to Redshift

I was almost certainly being dense, but it took me a while to figure out that the SQLAlchemy URI needed to be of the form:

redshift+psycopg2://username:password@host.amazonaws.com:5439/database

Note the :password - this is key

I was also put off because every time I tried to test the connection it failed. However if I ignored the failed test I could subsequently go on to connect fine.

Excited to see what we can do with Caravel…!

tclass · September 8, 2016, 4:42pm

@yali did you run in this issue yet? It isn’t working for me at all https://github.com/airbnb/caravel/issues/682

yali · September 9, 2016, 9:28am

No - I haven’t run into that issue @tclass…

robkingston · November 30, 2016, 1:19pm

For those interested, Caravel has made quite a few updates in the past few months.

To name a few:

It’s now called “Superset”
It comes with a SQL editor (called SQL Lab) - but it’s very basic
They’ve largely restyled it to look more modern
Loads of bug fixes
Really nice documentation http://airbnb.io/superset/

Topic		Replies	Views
Sample Data For Redshift Redshift	3	1897	December 22, 2016
Snowplow real-time analysis with on-premise pipeline For engineers	6	2411	January 4, 2018
Real-time visualization	15	898	February 10, 2023
Approaches to access data in S3 For data modelers & consumers	2	1393	May 18, 2021
Using Snowplow data to feed other applications For data modelers & consumers	1	1979	September 5, 2016