Tracking how SQL data models perform [tutorial]


#1

When running a data model in SQL using SQL Runner, it’s useful to know how long each step takes on average, and how this varies with the number of events. It’s possible to use STL_QUERY to get this information, but the STL tables don’t keep data for more than a couple of days, and querytxt is not the nicest field to parse.

An alternative approach is to track time as part of the SQL data model. Below is one possible implementation.

Step 0: create a schema and table to store the cleaned up timing data

CREATE SCHEMA IF NOT EXISTS datamodeling;

CREATE TABLE datamodeling.queries (
  min_tstamp timestamp encode lzo,
  max_tstamp timestamp encode lzo,
  model_name varchar(255) encode lzo,
  step_name varchar(255) encode lzo,
  tstamp timestamp encode lzo,
  duration integer encode lzo
)
DISTSTYLE KEY
DISTKEY (step_name)
SORTKEY (tstamp);

Step 1: before the data model runs, create a table to track the time

CREATE SCHEMA IF NOT EXISTS scratch;

DROP TABLE IF EXISTS scratch.queries;
CREATE TABLE scratch.queries (
  model_name varchar(255) encode lzo,
  step_name varchar(255) encode lzo,
  tstamp timestamp encode lzo
)
DISTSTYLE KEY
DISTKEY (model_name)
SORTKEY (tstamp);

INSERT INTO scratch.queries (SELECT 'start', '', GETDATE()); -- track time a first time

Step 2: after each step (or query), track the time

...

INSERT INTO scratch.queries (SELECT 'model_1', 'step 1', GETDATE()); -- track time

...

INSERT INTO scratch.queries (SELECT 'model_1', 'step 2', GETDATE()); -- track time

...

INSERT INTO scratch.queries (SELECT 'model_2', 'step 1', GETDATE()); -- track time

Step 3: when the data model is done, insert the results in the final table

INSERT INTO datamodeling.queries (

  SELECT
    min_tstamp,
    max_tstamp,
    model_name,
    step_name,
    tstamp,
    EXTRACT(EPOCH FROM (tstamp - previous_tstamp)) AS duration
  FROM (
    SELECT
      FIRST_VALUE(tstamp) OVER (ORDER BY tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS min_tstamp,
      LAST_VALUE(tstamp) OVER (ORDER BY tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS max_tstamp,
      model_name,
      step_name,
      tstamp,
      LAG(tstamp, 1) OVER (ORDER BY tstamp) AS previous_tstamp
    FROM scratch.queries
    ORDER BY tstamp
  )
  WHERE model_name != 'start'
  ORDER BY tstamp

);

DROP TABLE IF EXISTS scratch.queries;

Step 4: see the results

SELECT * FROM datamodeling.queries ORDER BY 1,5

Example output:

min_tstamp		max_tstamp		model_name	step_name	tstamp			duration
2016-06-08 22:26:52	2016-06-08 22:27:57	model_1		step 1		2016-06-08 22:27:20	28
2016-06-08 22:26:52	2016-06-08 22:27:57	model_1		step 2		2016-06-08 22:27:51	31
2016-06-08 22:26:52	2016-06-08 22:27:57	model_2		step 1		2016-06-08 22:27:57	6

It’s possible to extend the SQL to capture additional information, such as the total number of events in atomic.events at the moment, or the number of events in the most recent batch.

Example

Below is an example from a Snowplow user. Each bar represents one run, and time is measured in minutes.

This particular user started out with recomputing a set of derived tables using all historical data each time the pipeline ran, as most Snowplow users do. As one would expect, the time it took to run the model scaled roughly linearly with the number of events in atomic.events. This became a problem after a couple of very successful days, when the total number of events more than tripled.

We worked with this Snowplow user to make their data model incremental. This happened in 2 stages - the first brought the time down to 40 minutes per run, the second to 10 minutes per run. This user now has close to 2 billion events in their Redshift cluster, and their derived tables are updated in a flat 10 minutes per run.

For more information on incremental data modeling, check out this post:

http://discourse.snowplowanalytics.com/t/making-sql-data-models-incremental-to-improve-performance-tutorial/319