This guide is geared towards data analysts who have limited or no experience with R. It aims to get the reader started with R through concrete examples with Snowplow data - so that by the end of the guide, the reader can pull Snowplow data into R, play around with it and plot it. That should leave the reader comfortable enough to use other resources to build out her knowledge of R, and give her enough knowledge to start using R to perform productive analytics on Snowplow data.
- Why use R?
- Getting started: plotting the number of uniques per day
- Having a look at the data: an introduction to data frames in R
- Creating our first plot
In the first section of this guide, we outline why R is such an excellent tool to use. Next (sections 2 to 4), we work through how to pull a simple data set from Snowplow and visualize it in R. In sections 5 to 7, we work through a more complicated set of examples: one in which we pull a more involved data set from Snowplow, and visualize it in a number of ways that are difficult with traditional BI / OLAP tools in Excel.
R is a fantastic analytics tool. To date, however, only a minority of web analysts use it. To list just some of what makes it so excellent, especially when compared with traditional general purpose tools like Excel, and BI tools like Tableau:
- It is great for producing a wide variety of visualizations - including a much wider than those supported by BI tools
- It has fantastic tools for extending existing visualizations and creating new visualizations all together
- It is easy to document analytics performed with R and retrace your steps. (Simply by copying and pasting your steps into the R console.) This makes R a much safer and more robust environment to interrogate data than e.g. Excel or Tableau, where if you realize you made an error eight steps back, retracing your steps can be difficult and time consuming.
- It is blindingly fast to both perform complicated analytics and generate beautiful visualizations. The syntax is incredibly concise - what you can achieve in one line of R can take hours of working through menus and options on other systems.
- It helps you think through what you do with your data in a more rigorous way. R forces you to define your data types much more specifically than either Excel or BI tools like Tableau. That rigour is helpful - it means you do things faster, and you’re less liable to make mistakes.
- It is great at statistics. Traditional BI tools and Excel suck at statistics. Sure you can calculate means, medians, quartiles etc. But actually pulling these together into meaningful distribution plots, or plotting how they change over time, is a pain in these tools, which are much better at counting and summing data. R is simply much better at statistics
- It is a great tool for modeling data: it is straightforward to create models in R, and compare those models to your actual data sets, either formally through calculation, or by plotting you model(s) against your data(sets)
- It has an enormous library of packages available for performing just about any type of analytics imaginable
- It is free
- It works on all platforms
In spite of all the benefits listed above, people often find struggle with R to start off with. The command line is a difficult place to start, and R has some idiosyncracies that need to be understood.
In this guide, we cover them by working through practical examples with Snowplow data, paying special attention to covering some of those key concepts and idiosyncracies.
Back to top.
If you have not installed R yet, instructions on doing so can be found here.
This guide assumes you have installed R, and installed the
RPostgreSQL package required to get R to talk to PostgreSQL databases including Amazon Redshift. If you have not, installed this, you can do so by executing the following at the R prompt:
R will ask you to select a mirror to download the package from. Select a local mirror and the package should install automatically.
Now we need to tell R to use the package:
The library lets us execute a SQL query against our data in Redshift, and pulls the result of the query into R as a data frame. (We’ll explain a bit about data frames shortly.) The query we want to execute, to find out how many unique visitors we’ve received to our site by day, is:
SELECT to_char(collector_tstamp, 'YYYY-MM-DD') AS "date", count(distinct(domain_userid)) AS uniques FROM "public"."events" GROUP BY "date" ORDER BY "date";
To execute the query in R, first we have to setup a connection to our Redshift database, by executing the following commands:
drv <- dbDriver("PostgreSQL") con <- dbConnect(drv, host="<<ENTER HOST DETAILS HERE>>", port="<<ENTER PORT DETAILS HERE>>",dbname="<<ENTER DB NAME HERE>>", user="<<ENTER USERNAME HERE>>", password="<<ENTER PASSWORD HERE>>")
(For detials about where to find your host, port, dbname and username see the [setup guide] install-r.)
Now we can execute the query, and assign it to a variable called
uniquesByDay, by executing the following:
uniquesByDay <- dbGetQuery(con, " SELECT to_char(collector_tstamp, 'YYYY-MM-DD') AS \"date\", count(distinct(domain_userid)) AS uniques FROM \"public\".\"events\" WHERE collector_tstamp > '2013-02-22' GROUP BY \"date\" ORDER BY \"date\" ")
Note how double inverted commas have to be escaped with a backslash.
Back to top.
We can have a look at the
uniquesByDay variable by simply entering it at the command line:
R responds by dumping the contents of the variable to the screen. We see that the data is in the form of a table with three columns. The first column is the row number, the second column is the
date column, and the third is the
uniques column. We can view just the top 5 lines of data by entering
head(uniquesByDay) at the prompt. We’ve pasted the contents of the screen, including what R returns, in our case:
> head(uniquesByDay) date uniques 1 2013-02-22 78 2 2013-02-23 70 3 2013-02-24 76 4 2013-02-25 125 5 2013-02-26 130 6 2013-02-27 88 >
We can find out more about our uniquesByDay object, by typing
summary(uniquesByDay) at the prompt:
> summary(uniquesByDay) date uniques Length:124 Min. : 16.00 Class :character 1st Qu.: 75.75 Mode :character Median :116.00 Mean :114.73 3rd Qu.:150.25 Max. :274.00 >
R tells us that our data contains two columns:
uniques. It tells us what type each column is:
date is of type
uniques is a numeric field, for which it gives us some basic statistical information, so we can get a sense of the range of values.
date column really corresponds to
date. We can update the data type for this column:
> uniquesByDay$date <- as.Date(uniquesByDay$date)
When we now look at the summary for our data frame, the type of data in the
date column has changed:
> summary(uniquesByDay) date uniques Min. :2013-02-22 Min. : 16.00 1st Qu.:2013-03-24 1st Qu.: 75.75 Median :2013-04-24 Median :116.00 Mean :2013-04-24 Mean :114.73 3rd Qu.:2013-05-25 3rd Qu.:150.25 Max. :2013-06-25 Max. :274.00 >
We can get further information on the structure of the data frame by executing the
str() function on it:
> str(uniquesByDay) 'data.frame': 124 obs. of 2 variables: $ date : Date, format: "2013-02-22" "2013-02-23" ... $ uniques: num 78 70 76 125 130 88 84 69 35 75 ...
This confirms that the type of the first column has been set to a date format.
Back to top.
There are a number of built in and additional libraries for plotting data visualizations in R. We recommend starting out with the absolutely excellent
First, install the
ggplot2 package, if you have not done so previously:
Now that it’s installed, you can load the library:
Now we can plot the number of uniques over time, using the
qplot (quickplot) command:
> qplot(date, uniques, data=uniquesByDay, geom="line")
The plot appears in a new window:
Back to top.
Here you can find part 2 of this guide.