First and last touch attribution models in SQL [tutorial]

In order to calculate the return on marketing spend on individual campaigns, digital marketers need to connect revenue events, downstream in a user journey, with marketing touch events, upstream in a user journey. This connection is necessary so that the cost of those associated with the marketing campaign that drove those marketing touches can be connected to profit associated with the conversion events later on.

Different attribution models involve applying different logic to connecting those marketing touch events with subsequent revenue events. In this blog post, we will document the analytic approach and the corresponding SQL statements to perform the most basic types of attribution: first and last touch. Our example SQL will focus on the web use case. The same undelying analytic approach can, however, be applied to any channel.

Note that all the SQL given below is Redshift compatible. It can be easier to do attribution analysis with other dialects of SQL that support complicated data types (arrays and objects) in particular.

Identifying the different marketing touches

First we need to identify all our marketing touch events. We can generate a table with all of these as follows:

create table derived.marketing_touches as (
  select
    domain_userid,
    derived_tstamp,
    event_id, 
    mkt_medium,
    mkt_source,
    mkt_term,
    mkt_content,
    mkt_campaign,
    refr_medium,
    refr_source,
    refr_term
  from atomic.events
  where refr_medium != 'internal'
    and refr_medium is not null
  order by 1,2);

The above table includes a line of data per marketing touch, ordered by user (as identified via the first party cookie id domain_userid) and time (as identified by the derived_tstamp).

Identifying the different revenue events

Now lets create a table with all our different revenue events. What these look like will depend on your own particular event schema - for this example we’ll assume that revenue events are standard Snowplow transaction events. It should be straightforward to modify / update the below SQL with your own set of revenue events.

create table derived.revenue_events as (
  select
    domain_userid,
    derived_tstamp,
    event_id,
    tr_total
  from atomic.events
  where event_name = 'transaction'
  order by 1,2);

Connecting marketing touches with revenue events: first touch attribution model

Now that we have our marketing touches and our revenue events, we need to join them together.

How we do the join is firstly a question of business logic: what type of attribution model do we want to apply? The simplest model is a first touch model - this credits all the value associated with the revenue event to the first marketing touch for each user.

To do this, we create a new derived.first_marketing_touch table, that records only the first marketing touch for each user. This is a subset of the marketing touches recorded in the derived.marketing_touches table.

with first_touch_tstamps as (
  select
    domain_userid,
    min(derived_tstamp) as first_touch_tstamp
  from derived.marketing_touches
  group by 1,
  order by 1
)
create table derived.first_marketing_touch as (
  select
    m.domain_userid,
    m.derived_tstamp,
    m.event_id, 
    m.mkt_medium,
    m.mkt_source,
    m.mkt_term,
    m.mkt_content,
    m.mkt_campaign,
    m.refr_medium,
    m.refr_source,
    m.refr_term
  from derived.marketing_touches m
  join first_touch_tstamps f            
    -- only return first touch tstamps
  on m.domain_userid = f.domain_userid
  and m.derived_tstamp = f.first_touch_tstamp
);

Now it is trivial to join our derived.first_marketing_touch table with our derived.revenue_events table:

select
  f.*,
  r.tr_total
from derived.first_marketing_touch f
right join derived.revenue_events r    
  -- right join in case there is no marketing touch event to join to the revenue event
on f.domain_userid = r.domain_userid

Bingo! We have a table with a line of data for each revenue event, and all the marketing data associated with the corresponding first touch event for that user.

Note that the above is especially straightforward because:

It is easy to identify the first marketing touch for each user. (It’s simply the one wiht the earlierst timestamp.)
By the time we do the join we have a table with a maximum of one marketing touch event per user ID, so we do not have to worry about generating a cartesian product when performing the join.
Because neither of the above is true when we are applying a last click attribution model, the SQL gets a bit more complicated.

Connecting marketing touches with revenue events: last touch attribution model

In a last touch attribution model, we want to credit all the value associated with each revenue event to the most recent marketing touch that occurred prior to that event.

There are a number of ways to do this in SQL - I think the following is the most straightforward, but welcome any suggestions at alternatives that are clearer / more performant.

First, we need to identify for each revenue event what is the corresponding marketing touch event that we wish to connect. To do this, we first union our marketing touches and revenue event tables into a single table that contains both the marketing touches and the revenue events. For performance reasons, we only include a subset of the columns in our marketing touches table.

create table derived.marketing_touches_and_revenue_events as (
  select
    domain_userid,
    derived_tstamp,
    event_id as marketing_event_id,
    null as revenue_event_id,
    'marketing touch' as event_type,
    null as revenue
  from derived.marketing_touches
  union
  select
    domain_userid,
    derived_tstamp,
    null as marketing_event_id,
    event_id as revenue_event_id
    'revenue event' as event_type,
    tr_total as revenue
  from derived.revenue_events
);

The above table includes a line for every marketing touch event and every revenue events. Marketing touch events have a marketing_event_id set and revenue events have a revenue_event_id set.

We need to aggregate over this table so that we set the marketing_event_id for each revenue event. This will be the event_id of the most recent marketing touch event prior to the revenue event. We’ll then be able to use the event ID to join back with our marketing touches table, to pull all the metadata associated with that marketing touch to the revenue event.

To do that, we use a window function to identify the most recent marketing touch event prior to the revenue event:

select
  domain_userid,
  derived_tstamp,
  last_value(marketing_event_id ignore nulls) over (
    partition by domain_userid
    order by derived_tstamp
    rows between unbounded preceding and current row
  ) as marketing_event_id,
  revenue_event_id,
  event_type,
  tr_total
from derived.marketing_touches_and_revenue_events

The window function is doing a lot of work for us, so it is worth explaining what’s going on before we use the above query to generate our final result set:

  • First, it partitions our marketing touch and revenue events by user ID
  • Then it orders the event stream by time
  • Then for each event, it fetches the most recent not null marketing_event_id value. Note that this will be applied to every row in the table i.e. marketing touch events and transaction events.

Where it is applied to marketing events, the most recent marketing event ID will be the marketing event ID for the current event. That doesn’t matter (we’re going to filter these events out of the event stream in the next step). The important thing is that for revenue events, it will correctly fetch the most recent marketing event ID. (Because the marketing event ID for the current row will be null, so will be ignored.)

Now we apply the above window function to generate our final result set:

with last_touch_event_ids_calculated as (
  select
    domain_userid,
    derived_tstamp,
    last_value(marketing_event_id ignore nulls) over (
      partition by domain_userid
      order by derived_tstamp
      rows between unbounded preceding and current row
    ) as last_marketing_event_id,
    revenue_event_id,
    event_type,
    tr_total
  from derived.marketing_touches_and_revenue_events  
)
select
  r.domain_userid,
  r.derived_tstamp as revenue_event_tstamp,
  r.last_marketing_event_id
  r.revenue_event_id,
  r.tr_total,
  m.mkt_medium,
  m.mkt_source,
  m.mkt_term,
  m.mkt_content,
  m.mkt_campaign,
  m.refr_medium,
  m.refr_source,
  m.refr_term
from last_touch_event_ids_calculated r
right join derived.marketing_touches m
on r.last_marketing_event_id = m.event_id  
  -- only perform the join for the last touch event
where r.event_type = 'revenue event'    
  -- only fetch revenue events from the last_touch_event_ids_calculated table

The above query will generate a result set with one line of data per revenue event, and each line including the marketing data associated with the last channel that each user engaged with prior before the revenue event occurred.

4 Likes

Hey @yali , just found this post, good overview. We’re actually in the process of implementing this , but we are aiming for multi-touch attribution. Going to copy/paste an email I sent to support :slight_smile:

I’m hitting an issue where activity from specific domain_userid value have NO records where refr_medium != ‘internal’ . Can this occur? My impression is that, at the very least, all sessions from a domain_userid should be direct (no traffic source). I can possibly see this occurring if a user deletes cookings mid-session, but the amount of instances this is occurring too high.

Background:

I’m in the process of implementing attribution models, similar to what Yali posted here: First and last touch attribution models in SQL [tutorial]

My approach is different in that we are are aiming for multi-touch attribution: everything in between first and last touch (relative to an order event). I’ve worked through most of the logic and hitting issues with what I think may be due to the method of identity stitching, borrowed from to this post: Identifying users (identity stitching)

High level process:

  1. Create mapping list of user_id (from user context data) and all associated domain_userid
  2. Create marketing touch points - Exclude ‘event=page_ping’ and Internal IP addresses
  3. Join marketing touch points with user_id / domain_userid mapping
  4. De-dup records that have the same referring/marketing
  5. Join order data on user_id

At this point, there are marketing touch points from Step 2 where a domain_userid only has 'refr=‘internal’ records

On the marketing touch points SQL from this post:

where refr_medium != ‘internal’
and refr_medium is not null

Would this exclude Direct traffic? I had thought that refr_medium != ‘internal’ was sufficient to exclude events/clicks

Hi @crizposadas,

If a user comes directly to your website then the refr_medium field will be null. So you will be excluding direct traffic if you include the

where refr_medium is not null

condition in your statement.

If you remove that condition, does it resolve the issue? (I.e. are those users that have come to your website directly?)

Hey @yali,

I don’t think this would match null values (in order to include Direct traffic):

refr_medium != ‘internal’

Using this appears to work:

(refr_medium is NULL OR refr_medium != ‘internal’)

Hi @crizposadas,

That’s indeed correct. Filtering on = 'internal' or != 'internal' will leave out NULL values (and therefore exclude direct traffic).

Christophe

Thanks @christophe and @yali, updating the where clause fixed the issue. As an aside, domain_userid can also be null for mobile application events, so I included filtering to exclude those records.

Has anyone experience writing queries that build up multi-touch/linear/time-decay based attribution models?

I’d be interested in looking at some of these and seeing how we can manipulate these to suit our clients’ users.

JP

Hi @yali,

first, thank you for awesome tutorial.

while reading last touch attribution model part, i got a question.

Q. why do not use same strategy for getting last_touch_event_ids_calculated like first touch attribution model? i think(as newbie), it looks like can apply same strategy(first of all… postgresql have no ‘ignore nulls’ :frowning: )

now, I knew why but how implement ‘ignore nulls’ in postgresql in last touch attribution model


Anyway, put why for anyone who has same question.

first touch is always unique, but last touch can be multiple in a domain_userid.

for example, assume we have these events
domain_userid1, marketing_touch1
domain_userid1, revenue1
domain_userid1, marketing_touch2
domain_userid1, marketing_touch3
domain_userid1, revenue2

if you use the strategy(in first touch attribution model) to get last touch for ‘revenue1’, you may got ‘marketing_touch3’ not ‘marketing_touch1’ because get last marketing_touch in domain_userid(first touch attribution model strategy)

Hi Yali, this is great. Would you be able to know how to write this where it attributes only sales events within a certain date range of the marketing event. eg, I only want to attribute sales events within 7 days of the last marketing event.

thanks!