When running a data model in SQL using SQL Runner, it’s useful to know how long each step takes on average, and how this varies with the number of events. It’s possible to use STL_QUERY to get this information, but the STL tables don’t keep data for more than a couple of days, and querytxt
is not the nicest field to parse.
An alternative approach is to track time as part of the SQL data model. Below is one possible implementation.
Step 0: create a schema and table to store the cleaned up timing data
CREATE SCHEMA IF NOT EXISTS datamodeling;
CREATE TABLE datamodeling.queries (
min_tstamp timestamp encode lzo,
max_tstamp timestamp encode lzo,
model_name varchar(255) encode lzo,
step_name varchar(255) encode lzo,
tstamp timestamp encode lzo,
duration integer encode lzo
)
DISTSTYLE KEY
DISTKEY (step_name)
SORTKEY (tstamp);
Step 1: before the data model runs, create a table to track the time
CREATE SCHEMA IF NOT EXISTS scratch;
DROP TABLE IF EXISTS scratch.queries;
CREATE TABLE scratch.queries (
model_name varchar(255) encode lzo,
step_name varchar(255) encode lzo,
tstamp timestamp encode lzo
)
DISTSTYLE KEY
DISTKEY (model_name)
SORTKEY (tstamp);
INSERT INTO scratch.queries (SELECT 'start', '', GETDATE()); -- track time a first time
Step 2: after each step (or query), track the time
...
INSERT INTO scratch.queries (SELECT 'model_1', 'step 1', GETDATE()); -- track time
...
INSERT INTO scratch.queries (SELECT 'model_1', 'step 2', GETDATE()); -- track time
...
INSERT INTO scratch.queries (SELECT 'model_2', 'step 1', GETDATE()); -- track time
Step 3: when the data model is done, insert the results in the final table
INSERT INTO datamodeling.queries (
SELECT
min_tstamp,
max_tstamp,
model_name,
step_name,
tstamp,
EXTRACT(EPOCH FROM (tstamp - previous_tstamp)) AS duration
FROM (
SELECT
FIRST_VALUE(tstamp) OVER (ORDER BY tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS min_tstamp,
LAST_VALUE(tstamp) OVER (ORDER BY tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS max_tstamp,
model_name,
step_name,
tstamp,
LAG(tstamp, 1) OVER (ORDER BY tstamp) AS previous_tstamp
FROM scratch.queries
ORDER BY tstamp
)
WHERE model_name != 'start'
ORDER BY tstamp
);
DROP TABLE IF EXISTS scratch.queries;
Step 4: see the results
SELECT * FROM datamodeling.queries ORDER BY 1,5
Example output:
min_tstamp max_tstamp model_name step_name tstamp duration
2016-06-08 22:26:52 2016-06-08 22:27:57 model_1 step 1 2016-06-08 22:27:20 28
2016-06-08 22:26:52 2016-06-08 22:27:57 model_1 step 2 2016-06-08 22:27:51 31
2016-06-08 22:26:52 2016-06-08 22:27:57 model_2 step 1 2016-06-08 22:27:57 6
It’s possible to extend the SQL to capture additional information, such as the total number of events in atomic.events
at the moment, or the number of events in the most recent batch.
Example
Below is an example from a Snowplow user. Each bar represents one run, and time is measured in minutes.
This particular user started out with recomputing a set of derived tables using all historical data each time the pipeline ran, as most Snowplow users do. As one would expect, the time it took to run the model scaled roughly linearly with the number of events in atomic.events
. This became a problem after a couple of very successful days, when the total number of events more than tripled.
We worked with this Snowplow user to make their data model incremental. This happened in 2 stages - the first brought the time down to 40 minutes per run, the second to 10 minutes per run. This user now has close to 2 billion events in their Redshift cluster, and their derived tables are updated in a flat 10 minutes per run.
For more information on incremental data modeling, check out this post: