Visualise Snowplow data using Airbnb Caravel & Redshift [tutorial]


#1

Exposing data from Redshift can be a pain when many dashboarding tools don’t support it OOB. As big R users we often generate reports out of RStudio using rmarkdown and knitr for performing useful analytics on the data. But for some reports it’s just impractical as clients want to explore their up-to-date data in an accessible dashboard. Since dashboards aren’t a core part of our business we couldn’t justify thousands of dollars a year for BI tools like Periscope or Tableau. Meanwhile other services like Slemma were OK but offered limited control and slow reports.

Introducing Airbnb Caravel

Airbnb decided to roll their own interactive BI solution and open source it. It has a large following on Github thanks to it being built on Python, Flask, Pandas and SQLAlchemy. I think the Snowplow community will love it for:

  • Diverse set of visualisations to choose from
  • Create and manage users with fine-grained permissions
  • Connect to Redshift and many other DBs supported by SQLAlchemy
  • Caching for taxing queries
  • JSON/CORS support for exposing Redshift data across domains
  • Interactively query your Snowplow events from the web

The feature set is still on the basic side but it’s really promising. Whereas before I would jump into Google Analytics to answer quick questions of web analytics data, we now use Caravel to access every single event in Redshift.

Getting started

You have two options:

I find it’s simplest to just manage the whole process with Docker containers, because you may struggle getting the environment ready with all the dependencies in Python. Note: If you’re installing this directly you will need to install the sqlalchemy-redshift package to connect to Redshift using: pip install sqlalchemy-redshift.

1, Fire up an instance and make sure you have Docker installed: https://docs.docker.com/engine/installation/linux/ubuntulinux/

2, Run the Docker container automatically managed by amancevice on Github (it’s an automated build kept up to date and contains everything you need to access Redshift out of the box). If you’re just trialling Caravel for the first time you may like to use the SQLite DB before committing to MySQL:

docker run --detach --name caravel \
    --env SECRET_KEY="mySUPERsecretKEY" \
    --env SQLALCHEMY_DATABASE_URI="sqlite:////home/caravel/db/caravel.db" \
    --publish 8088:8088 \
    --volume ~/caravel:/home/caravel/db \
    amancevice/caravel

3, Initialise the Caravel DB by running the init command and answer some questions to setup your admin user details:

docker exec -it caravel caravel-init

4, Make sure you open up port 8080 and you should be able to login to your Caravel instance. Boom.

5, After logging in, browse to “Sources” > “Databases” and add your Redshift DB to Caravel and Test the connection:

It should follow this format: redshift+psycopg2://username@host.amazonaws.com:5439/database

As always, make sure Redshift is accessible by the new host. You’ll know it’s working when you get a successful message upon hitting “Test connection”

6, Next, you’ll need to define your tables. Under the “Sources” menu, hit “Tables” and add a new row:

To get started you may like to add your atomic.events table:

Table name: events
Database: snowplow (Or whatever you named the DB you just created)
Schema: atomic
Main dttm: collector_tstamp
Offset: However many hours you want to offset from UTC time.

If Caravel can’t find the table, make sure your user permissions allow access to the table and failing that, check that the atomic schema is in Redshift’s search path - you’ll need to modify your cluster parameters as described here by Redshift.

7, Once your table is saved, you’ll want to edit the table and let Caravel know how to treat each column. I won’t cover this here as it should be fairly self-explantory:

This affects which variables will be available to you in Exploration mode.

8, Now you have all your data setup, you should be ready to create slices for a dashboard and begin exploring your data interactively. Just browse to your newly created events table and open it, right away it will show you a table with a count:

You can perform simple aggregations in the table using Group By and count distinct:

But most importantly you can change the visualisation to any number of the built in charts:

How awesome is that!

9, When you’re finished exploring your data and you want to save it as a slice for viewing later, just hit save and add it to a dashboard.

Voila… you’ve now got all your Snowplow data at your fingertips!

Exciting new features are coming soon to Caravel

  • Tabbed SQL editor to write and run queries directly in a web interface
  • Table joins support coming

Let me know what you think of this guide and if you run into any issues, just let me know.


#2

Thanks for sharing @robkingston. I had been wanting to give Caravel a spin for a long time, and this is just the nudge I needed :slight_smile:


#3

Thanks @christophe - you’re welcome.

Knew the Snowplow community would be interested in a tool like this.


#4

I’ve tried both Caravel and Redash, and while Caravel is much more powerful, I like the simplicity of Redash. https://github.com/getredash/redash


#5

This is awesome, thanks @robkingston


#6

@robkingston great stuff…are you deploying Caravel on ec2? I’m curious what infrastructure is required to run it for a small(ish) company (<100 people)


#7

@travisdevitt - cheers. Our instance isn’t running on AWS, we use another local provider where the specs are:

6 "virtual CPUs"
4gb RAM
For ~$35USD/month

Even this is seriously overkill. It barely breaks a sweat when you’re opening up 5-10 dashboards at once. Here’s our CPU usage for the past week (minimal usage - with MariaDB behind it, and Nginx/OpenSSL running in front):

I tested it on an instance as small as 1 vCPU, 768mb RAM and it worked there too, but I don’t think it would hold up well to multiple active users. A single T2 Small should do a fine job with room to spare.

But as you’d expect Redshift is a real bottleneck. I want to configure datasource caching so we’re not constantly hitting our Redshift cluster every time we open a dashboard.


#8

@robkingston that all makes a ton of sense…thanks Rob!


#9

Note: If you’re installing this directly and running into errors attempting to connect to Redshift make sure you pip install sqlalchemy-redshift, Caravel won’t install this automatically.


#10

Thanks @robkingston,
Good tutorial and good overview of Caravel.


#11

Great stuff, thanks! Has anyone tried out Metabase? I gave it a spin last week and was impressed by the recent updates. Seems to work good for simple dashboard.


#12

Thanks @mike - Good spot. Added…

@bernardosrulzon - I haven’t but it looks interesting.


#13

We have used metabase for quite some time, enjoy the more polished interface and built-in slack integration. Definitely worth checking out!


#14

@robkingston Do you have any things that are really annoying in caravel? I’m evaluating some BI tools for my company and that I can’t write plain SQL yet would be a big downside for example.


#15

@tclass - hah, well no SQL editor is inconvenient but it’s in the works:

via this PR: https://github.com/airbnb/caravel/pull/514

Other pain points:

  • Confusing to setup initial permissions for users/groups - They’re way too granular and it should be simple
  • When a query returns null, you’ll have a big red ugly error on your dashboard - not the most professional looking / client friendly error
  • Very few guides to get up and running
  • No way to cancel long-running queries (yet)
  • Development looks really chaotic, so there’s no telling what features will land in the next release (or when that may be)
  • Design feels a little dated
  • Interface feels unintuitive in places “Add DB -> Add table -> Explore data -> Add slice to dashboard” workflow is not obvious for instance

#16

Hey @robkingston - finally getting an opportunity to work through your tutorial.

A couple of things I found I had to change - noting them here in case other people find them useful:

Launching the docker image

I found I had to run:

sudo docker run --detach --name caravel \
    --env SECRET_KEY="FbLUn0gMn2oL" \
    --env SQLALCHEMY_DATABASE_URI="sqlite:////home/caravel/db/caravel.db" \
    --publish 8088:8088 \
    --volume ~/caravel/caravel.db:/home/caravel/caravel.db \
    amancevice/caravel

Note the

    --volume ~/caravel/caravel.db:/home/caravel/caravel.db

rather than

    --volume ~/caravel:/home/caravel/db

If I tried running the above without the modification the container wouldn’t start/stopped immediately. It looks like I had this issue but in reverse(?) Needs further investigation…

Connecting to Redshift

I was almost certainly being dense, but it took me a while to figure out that the SQLAlchemy URI needed to be of the form:

redshift+psycopg2://username:password@host.amazonaws.com:5439/database

Note the :password - this is key :smile:

I was also put off because every time I tried to test the connection it failed. However if I ignored the failed test I could subsequently go on to connect fine.

Excited to see what we can do with Caravel…!


#17

@yali did you run in this issue yet? It isn’t working for me at all https://github.com/airbnb/caravel/issues/682


#18

No - I haven’t run into that issue @tclass


#19

For those interested, Caravel has made quite a few updates in the past few months.

To name a few:

  • It’s now called “Superset”
  • It comes with a SQL editor (called SQL Lab) - but it’s very basic
  • They’ve largely restyled it to look more modern
  • Loads of bug fixes
  • Really nice documentation http://airbnb.io/superset/