Marketing channels + sessions using Google Analytics cookie data from com_google_analytics_cookies_1

Trying to analyse marketing channels can be difficult using the Snowplow Analytics event model.

When it comes to marketing channel data, the draw back to the Snowplow Analytics event model is the channel data from mkt_* and refr_* fields in atomic.events is not persistent. This means you only get it at the first event per domain_sessionidx (for brevity, let’s call it a session) and this session is based on the arbitrary time out of 30 minutes.

What you end up with is a challenging query as you use sub queries, joins and windows functions to make sense of it all. Even then, if you don’t have campaign tracking set up (UTM’s), you end up with poor output in the fields for paid traffic like Adwords (GCLID is in the referring query) but the source is Google, no differentiation on paid versus organic in atomic.events.

The good news is, for users of Snowplow Analytics that still run Google Analytics Classic, the cookie data from GA is stored client side (unlike Universal which is all done server side) and you can use the auto contexts feature of Snowplow to store all the cookie data.

contexts: {
    ...
    gaCookies: true, //if you want auto GA cookie tracking
    ...
}

Setting the above to true gives you the table in Redshift called com_google_analytics_cookies_1 (assumes you have set the SQL it won’t just happen)

com_google_analytics_cookies_1

  • For users of Universal analytics, at best you get the cookie ID which is available in the field __ga. Somewhat useful but not really…
  • If you are running classic GA, the cookie data is stored in full in the __utmz field.

Looks like:

84217100.1462425347.68.11.utmcsr=xxx.co.id|utmccn=(referral)|utmcmd=referral|utmcct=/xxx/xxx

Unfortunately, this is a string but you can use the query below to create fields for each input type.

SQL for GA campaign data

select __utmz, 
root_id,
split_part(__utmz,'.',1) as "domain_hash",
split_part(__utmz,'.',2) as "cookie_timestamp",
split_part(__utmz,'.',3) as "session_number",
split_part(__utmz,'.',4) as "campaign_number",
split_part(regexp_substr(__utmz,'utmcsr=[^|]*'),'=',2) as "utmscr",
split_part(regexp_substr(__utmz,'utmgclid=[^|]*'),'=',2) as "utmgclid",
split_part(regexp_substr(__utmz,'utmcid=[^|]*'),'=',2) as "utmcid",
split_part(regexp_substr(__utmz,'utmgclsrc=[^|]*'),'=',2) as "utmgclsrc",
split_part(regexp_substr(__utmz,'utmdclid=[^|]*'),'=',2) as "utmdclid",
split_part(regexp_substr(__utmz,'utmdsid=[^|]*'),'=',2) as "utmdsid",
split_part(regexp_substr(__utmz,'utmccn=[^|]*'),'=',2) as "utmccn",
split_part(regexp_substr(__utmz,'utmcmd=[^|]*'),'=',2) as "utmcmd",
split_part(regexp_substr(__utmz,'utmctr=[^|]*'),'=',2) as "utmctr",
split_part(regexp_substr(__utmz,'utmcct=[^|]*'),'=',2) as "utmcct"
from "atomic".com_google_analytics_cookies_1
where root_tstamp > current_date - integer '1'
limit 1000

###Why is this important?

  1. This value is persistent, every event_id has a mapping to the root_id campaign data stored in the table.
  2. For better sessionisation, it uses the correct method used by Google (30 minute time our or source change) providing the session count along with the campaign count.

####A classic business question is asking the analyst to segment transactions by channel.
With the query above, it is a simple join on a.event_id = b.root_id as you tie a persistent value to the event type and you get the transaction BY channel. No more crazy windows functions and sub queries :slight_smile:

Take Away

The Snowplow event model is a flexible system that can ingest all kinds of source data into the event model. With the above, you are looking at pieces of one system to compliment the brilliant event model of Snowplow. What you end up with is a powerful solution that gives you every event, timestamped with channel data from GA to enhance the data model.

SQL-Runner

A handy tip is adding this query into SQL-Runner as a job so you always have up to date data.

1 Like

Thanks so much for sharing this @digdeep!

1 Like

Nice post @digdeep - you may be happy to know some open source projects maintain GA’s classic utmz cookie:

This is just one of them, but I’ve seen others out there too.