Debugging a Serializable isolation violation in Redshift (ERROR: 1023) [tutorial]

Sometimes you might get an error, for example during a Snowplow load into Redshift, which looks like this:

Serializable isolation violation on table and the code ERROR: 1023

This usually means that there are several SQL processes trying to run at the same time and competing for the same resources.

This is a brief tutorial to help you diagnose these issues so that you can fix them.

Identifying the conflicting transactions

Luckily the error message mentions the transaction IDs which are in conflict with each other. Here is an example of such an error message:

ERROR: 1023 Detail: Serializable isolation violation on table - 123456,
transactions forming the cycle are: 3605921, 3605926 (pid:15192)
/opt/snowplow-r77-great-auk/snowplow-storage-loader!/storage-loader/lib/snowplow-storage-loader/redshift_loader.rb:88:in
`load_events_and_shredded_types' file:/opt/snowplow-r77-great-auk/snowplow-storage-loader!/storage-loader/bin/snowplow-storage-loader:54:in
`(root)' org/jruby/RubyArray.java:1613:in `each' file:/opt/snowplow-r77-great-auk/snowplow-storage-loader!/storage-loader/bin/snowplow-storage-loader:51:in `(root)' org/jruby/RubyKernel.java:1091:in `load'`

The two processes that are causing the violation are mentioned in this part of the error message:

transactions forming the cycle are: 3605921, 3605926

The first step for debugging is to identify the SQL queries which make up these transactions. This will help you understand what is happening and what you can do to fix it.

Reviewing each transaction

For each process execute this query:

SELECT trim(querytxt) AS query FROM stl_query WHERE xid = 'Add transaction ID here';

So, for the first transaction the query becomes this:

SELECT trim(querytxt) AS query FROM stl_query WHERE xid = '3605921';

Let’s assume the results of the query look something like this:

    query 
    ------------------------------------------------------------------
     -- (a) list all root_id that occur more than once in the target table CREATE TABLE duplicates.tmp_com_company_share_1_id DISTKEY (root_id) SORTKEY (root_id) AS (SELECT root_id FROM (SELECT root_id, COUNT(*) AS count FROM atomic.commcompany_share_1 GROUP BY 1) WHERE count > 1); 
     padb_fetch_sample: select count(*) from tmp_com_company_share_1_id 
     padb_fetch_sample: select * from tmp_com_company_share_1_id 
     -- (b) create a new table with these events and deduplicate as much as possible using GROUP BY CREATE TABLE duplicates.tmp_com_company_share_1 DISTKEY (root_id) SORTKEY (root_id) AS ( SELECT schema_vendor, schema_name, schema_format, schema_version, root_id, MIN(root_tstamp), -- keep the earliest event ref_root, ref_tree, ref_parent, action, network, target, label, result_score FROM atomic.com_company_share_1 WHERE root_id IN (SELECT root_id FROM duplicates.t.p_com_company_share_1_id) GROUP BY 1,2,3,4,5,7,8,9,10,11,12,13,14 ); 
     padb_fetch_sample: select count(*) from tmp_com_company_share_1 
     padb_fetch_sample: select * from tmp_com_company_share_1 
     padb_fetch_sample: select * from tmp_com_company_share_1 
     DELETE FROM atomic.com_company_share_1 WHERE root_id IN (SELECT root_id FROM duplicates.tmp_com_company_share_1_id); 
     INSERT INTO atomic.com_company_share_1 (SELECT * FROM duplicates.tmp_com_company_share_1); 
    (9 rows)

Interesting - it looks like this transaction is part of some kind of SQL data modelling process - in any case it’s certainly not part of our Snowplow load, because it does performing COPY statements into atomic.events and associated tables.

Let’s have a look at the query for the second process:

SELECT trim(querytxt) AS query FROM stl_query WHERE xid in ('3605926');

Let’s assume the result is something like this:

    query 
    ------------------------------------------------------------------
     Count rows in target table 
     COPY atomic.events FROM 's3://snowplow-company-enrichment-output/shredded/good/run=2016-07-14-09-02-37/atomic-events' CREDENTIALS '' REGION AS 'us-east-1' DELIMITER '\\t' MAXERROR 1000 EMPTYASNULL FILLRECORD TRUNCATECOLUMNS TIMEFORMAT 'auto' ACCEPTINVCHARS 
     Count rows in target table 
     COPY atomic.com_company_post_context_1 FROM 's3://snowplow-company-enrichment-output/shredded/good/run=2016-07-14-09-02-37/com.company/post_c...' CREDENTIALS '' JSON AS 's3://iglu-company-jsonpaths/jsonpaths/com.company/post_context_1.json' REGION AS 'us-east-1' MAXERROR 1000 TRUNCATECOLUMNS TIMEFORMAT 'auto' ACCEPTINVCHARS 
     Count rows in target table 
     COPY atomic.com_company_event_1 FROM 's3://snowplow-company-enrichment-output/shredded/good/run=2016-07-14-09-02-37/com.company/event_1...' CREDENTIALS '' JSON AS 's3://iglu-company-jsonpaths/jsonpaths/com.company/event_1.json' REGION AS 'us-east-1' MAXERROR 1000 TRUNCATECOLUMNS TIMEFORMAT 'auto' ACCEPTINVCHARS 
     Count rows in target table 
     COPY atomic.com_company_event_output_1 FROM 's3://snowplow-company-enrichment-output/shredded/good/run=2016-07-14-09-02-37/com.company/event_r...' CREDENTIALS '' JSON AS 's3://iglu-company-jsonpaths/jsonpaths/com.company/event_output_1.json' REGION AS 'us-east-1' MAXERROR 1000 TRUNCATECOLUMNS TIMEFORMAT 'auto' ACCEPTINVCHARS 
     Count rows in target table 
     COPY atomic.com_company_share_1 FROM 's3://snowplow-company-enrichment-output/shredded/good/run=2016-07-14-09-02-37/com.company/share/...' CREDENTIALS '' JSON AS 's3://iglu-company-jsonpaths/jsonpaths/com.company/share_1.json' REGION AS 'us-east-1' MAXERROR 1000 TRUNCATECOLUMNS TIMEFORMAT 'auto' ACCEPTINVCHARS 
     Undoing 1 transactions on table 108281 with current xid 3605926 : 3605926 
     Undoing 1 transactions on table 108267 with current xid 3605926 : 3605926 
     Undoing 1 transactions on table 108271 with current xid 3605926 : 3605926 
     Undoing 1 transactions on table 108273 with current xid 3605926 : 3605926 
    (14 rows)

Right - here we can clearly see the COPY of our atomic.events and other Snowplow specific actions: this is the Snowplow load process, and it is the other transaction that is interfering with this.

Note that if you require more detail on each transaction, try this query (using the xid from the previous example):

SELECT xid, starttime, endtime, aborted, userid, trim(querytxt) AS query FROM stl_query WHERE xid = '3605926';

Fixing the conflict

What we see here is that some kind of SQL data modelling process is trying to run while the Snowplow loading is also running, leading to the Serializable isolation violation error.

The solution in this case is to connect your load process and SQL data modeling into a single sequential flow, to prevent them from competing for the same resources. An easy way to do this is to create a Makefile which runs the Snowplow EmrEtlRunner, followed by StorageLoader and then finally runs the SQL data modeling using SQL Runner.

For more on this approach, check out our blog post Orchestrating batch processing pipelines with cron and make.