Get started analyzing Snowplow data with R - part 1

Getting started analyzing Snowplow data with R (part 1)

This guide is geared towards data analysts who have limited or no experience with R. It aims to get the reader started with R through concrete examples with Snowplow data - so that by the end of the guide, the reader can pull Snowplow data into R, play around with it and plot it. That should leave the reader comfortable enough to use other resources to build out her knowledge of R, and give her enough knowledge to start using R to perform productive analytics on Snowplow data.

  1. Why use R?
  2. Getting started: plotting the number of uniques per day
  3. Having a look at the data: an introduction to data frames in R
  4. Creating our first plot

In the first section of this guide, we outline why R is such an excellent tool to use. Next (sections 2 to 4), we work through how to pull a simple data set from Snowplow and visualize it in R. In sections 5 to 7, we work through a more complicated set of examples: one in which we pull a more involved data set from Snowplow, and visualize it in a number of ways that are difficult with traditional BI / OLAP tools in Excel.

1. Why use R?

R is a fantastic analytics tool. To date, however, only a minority of web analysts use it. To list just some of what makes it so excellent, especially when compared with traditional general purpose tools like Excel, and BI tools like Tableau:

  • It is great for producing a wide variety of visualizations - including a much wider than those supported by BI tools
  • It has fantastic tools for extending existing visualizations and creating new visualizations all together
  • It is easy to document analytics performed with R and retrace your steps. (Simply by copying and pasting your steps into the R console.) This makes R a much safer and more robust environment to interrogate data than e.g. Excel or Tableau, where if you realize you made an error eight steps back, retracing your steps can be difficult and time consuming.
  • It is blindingly fast to both perform complicated analytics and generate beautiful visualizations. The syntax is incredibly concise - what you can achieve in one line of R can take hours of working through menus and options on other systems.
  • It helps you think through what you do with your data in a more rigorous way. R forces you to define your data types much more specifically than either Excel or BI tools like Tableau. That rigour is helpful - it means you do things faster, and you’re less liable to make mistakes.
  • It is great at statistics. Traditional BI tools and Excel suck at statistics. Sure you can calculate means, medians, quartiles etc. But actually pulling these together into meaningful distribution plots, or plotting how they change over time, is a pain in these tools, which are much better at counting and summing data. R is simply much better at statistics
  • It is a great tool for modeling data: it is straightforward to create models in R, and compare those models to your actual data sets, either formally through calculation, or by plotting you model(s) against your data(sets)
  • It has an enormous library of packages available for performing just about any type of analytics imaginable
  • It is free
  • It works on all platforms

In spite of all the benefits listed above, people often find struggle with R to start off with. The command line is a difficult place to start, and R has some idiosyncracies that need to be understood.

In this guide, we cover them by working through practical examples with Snowplow data, paying special attention to covering some of those key concepts and idiosyncracies.

Back to top.

2. Getting started: plotting the number of unique visitors by day

If you have not installed R yet, instructions on doing so can be found here.

This guide assumes you have installed R, and installed the RPostgreSQL package required to get R to talk to PostgreSQL databases including Amazon Redshift. If you have not, installed this, you can do so by executing the following at the R prompt:

> install.packages("RPostgreSQL")

R will ask you to select a mirror to download the package from. Select a local mirror and the package should install automatically.

Now we need to tell R to use the package:

> library("RPostgreSQL")

The library lets us execute a SQL query against our data in Redshift, and pulls the result of the query into R as a data frame. (We’ll explain a bit about data frames shortly.) The query we want to execute, to find out how many unique visitors we’ve received to our site by day, is:

SELECT
to_char(collector_tstamp, 'YYYY-MM-DD') AS "date",
count(distinct(domain_userid)) AS uniques
FROM "public"."events"
GROUP BY "date"
ORDER BY "date";

To execute the query in R, first we have to setup a connection to our Redshift database, by executing the following commands:

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, host="<<ENTER HOST DETAILS HERE>>", port="<<ENTER PORT DETAILS HERE>>",dbname="<<ENTER DB NAME HERE>>", user="<<ENTER USERNAME HERE>>", password="<<ENTER PASSWORD HERE>>")

(For detials about where to find your host, port, dbname and username see the [setup guide] install-r.)

Now we can execute the query, and assign it to a variable called uniquesByDay, by executing the following:

uniquesByDay <- dbGetQuery(con, "
	SELECT
	to_char(collector_tstamp, 'YYYY-MM-DD') AS \"date\",
	count(distinct(domain_userid)) AS uniques
	FROM \"public\".\"events\"
	WHERE collector_tstamp > '2013-02-22'
	GROUP BY \"date\"
	ORDER BY \"date\"
")

Note how double inverted commas have to be escaped with a backslash.

Back to top.

3. Having a look at the data: an introduction to data frames in R

We can have a look at the uniquesByDay variable by simply entering it at the command line:

uniquesByDay

R responds by dumping the contents of the variable to the screen. We see that the data is in the form of a table with three columns. The first column is the row number, the second column is the date column, and the third is the uniques column. We can view just the top 5 lines of data by entering head(uniquesByDay) at the prompt. We’ve pasted the contents of the screen, including what R returns, in our case:

> head(uniquesByDay)
        date uniques
1 2013-02-22      78
2 2013-02-23      70
3 2013-02-24      76
4 2013-02-25     125
5 2013-02-26     130
6 2013-02-27      88
>

We can find out more about our uniquesByDay object, by typing summary(uniquesByDay) at the prompt:

> summary(uniquesByDay)
     date              uniques
 Length:124         Min.   : 16.00  
 Class :character   1st Qu.: 75.75  
 Mode  :character   Median :116.00  
                    Mean   :114.73  
                    3rd Qu.:150.25  
                    Max.   :274.00  
>

R tells us that our data contains two columns: date and uniques. It tells us what type each column is: date is of type character, and uniques is a numeric field, for which it gives us some basic statistical information, so we can get a sense of the range of values.

Our date column really corresponds to date. We can update the data type for this column:

> uniquesByDay$date <- as.Date(uniquesByDay$date)

When we now look at the summary for our data frame, the type of data in the date column has changed:

> summary(uniquesByDay)
      date               uniques
 Min.   :2013-02-22   Min.   : 16.00  
 1st Qu.:2013-03-24   1st Qu.: 75.75  
 Median :2013-04-24   Median :116.00  
 Mean   :2013-04-24   Mean   :114.73  
 3rd Qu.:2013-05-25   3rd Qu.:150.25  
 Max.   :2013-06-25   Max.   :274.00  
>

We can get further information on the structure of the data frame by executing the str() function on it:

> str(uniquesByDay)
'data.frame':	124 obs. of  2 variables:
 $ date   : Date, format: "2013-02-22" "2013-02-23" ...
 $ uniques: num  78 70 76 125 130 88 84 69 35 75 ...

This confirms that the type of the first column has been set to a date format.

Back to top.

4. Creating our first plot

There are a number of built in and additional libraries for plotting data visualizations in R. We recommend starting out with the absolutely excellent ggplot2.

First, install the ggplot2 package, if you have not done so previously:

> install.packages("ggplot2")

Now that it’s installed, you can load the library:

> library("ggplot2")

Now we can plot the number of uniques over time, using the qplot (quickplot) command:

> qplot(date, uniques, data=uniquesByDay, geom="line")

The plot appears in a new window:

Back to top.

Here you can find part 2 of this guide.