Measuring page load times with the performance timing context [tutorial]


#1

We all want our websites to be fast, especially now that mobile has become the dominant platform. Research by Google has shown that speed does indeed matter. When pages take a long time to load, users will get frustrated and leave. However, measuring and troubleshooting page load times is often easier said than done. The application design and browser version matter, but so do things that are harder to control, such as unexpected server load, poor network conditions, and unique client-side environments.

With Snowplow, it’s possible to measure how a website performs in the real world. Our Javascript tracker uses the Navigation Timing API to capture 23 attributes that break the loading process down into a set of milestones. This data is stored in the Performance Timing context, which gets sent with all events that happen on the page.

In this post, I’ll explain how to use Navigation Timing data.

Understanding the Navigation Timing API

The Navigation Timing API is supported by all major browsers and records the time when certain milestones in the navigation and load process occur. The various milestones are illustrated in this graph:

Each attribute captures when a navigation milestone (e.g. when was the page requested?) or page load milestone (e.g. when did the DOM start loading?) occurred. Note that:

  • time is measured in milliseconds since midnight of 1 January 1970 (UTC);
  • attributes are set to zero until the corresponding milestone has occurred;
  • browsers might do additional internal processing between milestones.

Enabling the Performance Timing context

The Javascript tracker has the option to store Navigation Timing data in a predefined context. You will need to enable the Performance Timing context, and we also recommend enabling the Web Page context:

'contexts': {
  'webPage': true,
  'performanceTiming': true
}

Make sure to also create the corresponding tables in Redshift.

If the Performance Timing context is enabled, the JavaScript Tracker will create a context JSON from the window.performance.timing object, along with the Chrome firstPaintTime field (renamed to chromeFirstPaint) if it exists.

The redirectStart, redirectEnd, and secureConnectionStart attributes are set to zero if there is no redirect, or if a secure connection is not requested. Moreover, if you fire a page view event as soon as the page loads, the domComplete, loadEventStart, loadEventEnd, and chromeFirstPaint attributes might still be set to zero. This is because those properties are not known until all scripts are done executing, including sp.js. To circumvent this limitation, one could wrap Snowplow (and other) code in a setTimeout call:

setTimeout(function () {

  // Load Snowplow and call tracking methods here

}, 0);

This will, however, also reduce the number of events that are sent in. If a visitor leaves a page before it is done loading, then no events will have been sent and the visit will remain unrecorded.

Instead, we can use page pings events to complete the picture. The initial page view event might be missing some attributes, if the page was still loading, but the Performance Timing context is updated each time a page ping (or heartbeat) is sent. If a user remains on the page long enough for a page ping to be sent, we will receive the updated attributes. If not, we will still be able to infer where in the loading process the user left.

Sending page ping events gives us more data to work with, but it also increases overall event volumes and the number of requests that are made to the collector. It’s a trade-off, one that requires careful consideration. If measuring page performance is an important goal, we recommend these settings:

snowplow_name_here('enableActivityTracking', 5, 10);

The first page ping is sent 5 seconds after the page view event. Subsequent page pings are sent in 10 second intervals. If measuring page performance is less important, we recommend increasing these intervals.

Aggregating the data in Redshift

Let’s start with joining the relevant tables:

WITH basic AS (
  SELECT

    b.id, -- page view UUID requires the webPage context
    a.derived_tstamp, -- requires JS tracker 2.6.0 or later
    a.page_urlhost,
    a.page_urlpath,
    a.event,

    -- dimensions

    a.br_name,
    a.dvce_ismobile,

    -- performance timing

    c.navigation_start,
    c.redirect_start,
    c.redirect_end,
    c.fetch_start,
    c.domain_lookup_start,
    c.domain_lookup_end,
    c.secure_connection_start,
    c.connect_start,
    c.connect_end,
    c.request_start,
    c.response_start,
    c.response_end,
    c.unload_event_start,
    c.unload_event_end,
    c.dom_loading,
    c.dom_interactive,
    c.dom_content_loaded_event_start,
    c.dom_content_loaded_event_end,
    c.dom_complete,
    c.load_event_start,
    c.load_event_end

  FROM atomic.events AS a

  INNER JOIN atomic.com_snowplowanalytics_snowplow_web_page_1 AS b
    ON  a.event_id = b.root_id
    AND a.collector_tstamp = b.root_tstamp

  INNER JOIN atomic.org_w3_performance_timing_1 AS c
    ON  a.event_id = c.root_id
    AND a.collector_tstamp = c.root_tstamp

  WHERE a.event IN ('page_view','page_ping')
    AND a.br_type IN ('Browser', 'Browser (mobile)') -- exclude bots

    -- remove unexpected values (affects about 1% of rows)

    AND c.navigation_start IS NOT NULL AND c.navigation_start > 0
    AND c.redirect_start IS NOT NULL -- zero is OK
    AND c.redirect_end IS NOT NULL -- zero is OK
    AND c.fetch_start IS NOT NULL AND c.fetch_start > 0
    AND c.domain_lookup_start IS NOT NULL AND c.domain_lookup_start > 0
    AND c.domain_lookup_end IS NOT NULL AND c.domain_lookup_end > 0
    AND c.secure_connection_start IS NOT NULL AND c.secure_connection_start > 0
     -- connect_start is either 0 or NULL
    AND c.connect_end IS NOT NULL AND c.connect_end > 0
    AND c.request_start IS NOT NULL AND c.request_start > 0
    AND c.response_start IS NOT NULL AND c.response_start > 0
    AND c.response_end IS NOT NULL AND c.response_end > 0 AND DATEDIFF(d, a.derived_tstamp, (TIMESTAMP 'epoch' + c.response_end/1000 * INTERVAL '1 second ')) < 365
    AND c.unload_event_start IS NOT NULL AND DATEDIFF(d, a.derived_tstamp, (TIMESTAMP 'epoch' + c.unload_event_start/1000 * INTERVAL '1 second ')) < 365 -- zero is OK
    AND c.unload_event_end IS NOT NULL AND DATEDIFF(d, a.derived_tstamp, (TIMESTAMP 'epoch' + c.unload_event_end/1000 * INTERVAL '1 second ')) < 365 -- zero is OK
    AND c.dom_loading IS NOT NULL AND c.dom_loading > 0
    AND c.dom_interactive IS NOT NULL AND c.dom_interactive > 0
    AND c.dom_content_loaded_event_start IS NOT NULL AND c.dom_content_loaded_event_start > 0
    AND c.dom_content_loaded_event_end IS NOT NULL AND c.dom_content_loaded_event_end > 0
    AND c.dom_complete IS NOT NULL -- zero is OK
    AND c.load_event_start IS NOT NULL -- zero is OK
    AND c.load_event_end IS NOT NULL -- zero is OK

  ORDER BY 1,2)

It’s recommended to add more dimensions. Examples include the ISP, location, and various browser and device related dimensions. For all available dimensions and measures, check out our canonical event model.

The next step is aggregating page view and page ping events (both are micro events) into page views (a macro event). This distinction is explained in more detail in this blogpost on event data modeling.

DROP TABLE IF EXISTS derived.performance_timing;
CREATE TABLE derived.performance_timing
  DISTKEY(id)
  SORTKEY(tstamp)
AS (

  WITH basic AS (...)

  SELECT

    id,
    MIN(derived_tstamp) AS tstamp,
    page_urlhost,
    page_urlpath,

    SUM(CASE WHEN event = 'page_view' THEN 1 ELSE 0 END) AS pv_count,
    SUM(CASE WHEN event = 'page_ping' THEN 1 ELSE 0 END) AS pp_count,

    CASE
      WHEN DATEDIFF(s, MIN(derived_tstamp), MAX(derived_tstamp)) < 10
      THEN ROUND(DATEDIFF(s, MIN(derived_tstamp), MAX(derived_tstamp))/5)*5
      ELSE ROUND(DATEDIFF(s, MIN(derived_tstamp), MAX(derived_tstamp))/10)*10
    END AS time_on_page,

    br_name,
    dvce_ismobile,

    -- select the first non-zero value

    MIN(NULLIF(navigation_start, 0)) AS navigation_start,
    MIN(NULLIF(redirect_start, 0)) AS redirect_start,
    MIN(NULLIF(redirect_end, 0)) AS redirect_end,
    MIN(NULLIF(fetch_start, 0)) AS fetch_start,
    MIN(NULLIF(domain_lookup_start, 0)) AS domain_lookup_start,
    MIN(NULLIF(domain_lookup_end, 0)) AS domain_lookup_end,
    MIN(NULLIF(secure_connection_start, 0)) AS secure_connection_start,
    MIN(NULLIF(connect_start, 0)) AS connect_start,
    MIN(NULLIF(connect_end, 0)) AS connect_end,
    MIN(NULLIF(request_start, 0)) AS request_start,
    MIN(NULLIF(response_start, 0)) AS response_start,
    MIN(NULLIF(response_end, 0)) AS response_end,
    MIN(NULLIF(unload_event_start, 0)) AS unload_event_start,
    MIN(NULLIF(unload_event_end, 0)) AS unload_event_end,
    MIN(NULLIF(dom_loading, 0)) AS dom_loading,
    MIN(NULLIF(dom_interactive, 0)) AS dom_interactive,
    MIN(NULLIF(dom_content_loaded_event_start, 0)) AS dom_content_loaded_event_start,
    MIN(NULLIF(dom_content_loaded_event_end, 0)) AS dom_content_loaded_event_end,
    MIN(NULLIF(dom_complete, 0)) AS dom_complete,
    MIN(NULLIF(load_event_start, 0)) AS load_event_start,
    MIN(NULLIF(load_event_end, 0)) AS load_event_end

  FROM basic

  GROUP BY 1,3,4,8,9
  ORDER BY 2

);

Note that:

  • unload_event_start often occurs before response_end (in ~ 30% of page views)
  • dom_loading sometimes starts before unload_event_end (in ~ 10% of page views)

It’s also not guaranteed that other milestone timestamps will be in the correct order. The tracker captures the data it gets from the browser, but, in rare cases, a later milestone will have an earlier timestamp. This affects few page views (< 1%), so it’s enough to just be aware that this can happen.

The estimates given before are all for the Snowplow website.

Exploring the data

Let’s run some queries!

Percentage of page views that completed loading

SELECT

  ROUND(SUM(CASE WHEN dom_complete > 0 THEN 1 ELSE 0 END)/COUNT(*)::FLOAT, 4) AS all,
  ROUND(SUM(CASE WHEN dom_complete > 0 AND time_on_page >= 5 THEN 1 ELSE 0 END)/SUM(CASE WHEN time_on_page >= 5 THEN 1 ELSE 0 END)::FLOAT, 4) AS at_least_5s,
  ROUND(SUM(CASE WHEN dom_complete > 0 AND time_on_page >= 10 THEN 1 ELSE 0 END)/SUM(CASE WHEN time_on_page >= 10 THEN 1 ELSE 0 END)::FLOAT, 4) AS at_least_10s

FROM derived.performance_timing

For the Snowplow website, this returns the following results:

  • all pages: 70.62%
  • at least 5 seconds: 99.25%
  • at least 10 seconds: 99.36%

Looking at all page views, we find that the DOM completed loading in 70% of page views. If we restrict to page views where at least one page ping was sent (i.e. the user was on the page for at least 5 seconds), this number goes up to 99%.

An alternative would be to use load_event_end rather than dom_complete. This should return similar results.

Network latency

SELECT

  response_end-fetch_start AS milliseconds,
  SUM(CASE WHEN dvce_ismobile THEN 1 ELSE 0 END) AS mobile,
  SUM(CASE WHEN dvce_ismobile THEN 0 ELSE 1 END) AS not_mobile,
  COUNT(*) AS total

FROM derived.performance_timing

GROUP BY 1
HAVING response_end-fetch_start >= 0
ORDER BY 1
LIMIT 1000

Pages that take the longest to load (on average)

SELECT

  page_urlpath,
  AVG(load_event_end-response_end) AS avg_load_time_ms,
  COUNT(*)

FROM derived.performance_timing

GROUP BY 1
ORDER BY 2 DESC

Capturing domComplere attribute in pageload context
#2

Hi, in Google IO 2017, they had a very interesting session about measuring individual users’ web performance using new performance APIs in the browsers - https://youtu.be/6Ljq-Jn-EgU?t=1259
Worth looking into it