Merging Columns with a JOIN


#1

Hi all,

I have two separate queries that generate tables that have the following structure:

user_id | 2017_spend | 2017_decile

and

user_id | 2016_spend | 2016_decile

where the decile columns is a 10% percentile grouping, generated using Redshift’s RANK() and NTILE window functions.

Now, I want to join these two into a single table that has every unique User ID in the first column, and their spend and decile information in the following 4 columns. This seems a straightforward task, but I have not been able to get this work as desired. Every JOIN that I do (LEFT, LEFT OUTER, FULL OUTER, INNER) results in a loss of data. Because if someone was a customer in 2016 but not in 2017 (or vice versa), then there will be no match and the JOIN will exclude them. And as I understand, a UNION will simply stack the two tables on top of each other, resulting in duplicate user_id’s.

Am I missing something basic? (It feels like I am) .

Any help let me know!


#2

Hi Jrpeck1989,

My instinct is that a FULL OUTER JOIN should do what you’re trying to do, but perhaps there’s some reason I can’t think of that it doesn’t.

An inelegant but potentially effective solution could be to UNION the tables then group by user_id and use MIN/MAX to aggregate to one row per user.

This assumes you’ve only got one value per user for each of the four columns.

WITH step1 AS(

SELECT 

  user_id,
  2016_spend,
  2016_decile,
  null AS 2017_spend,
  null AS 2017_decile

FROM table_1

UNION

SELECT 

  user_id,
  null AS 2016_spend,
  null AS 2016_decile,
  2017_spend,
  2017_decile

FROM table_2
)

  SELECT 

    user_id,
    MIN(2016_spend),
    MIN(2016_decile),
    MIN(2017_spend),
    MIN(2017_decile)

  FROM step1

Hope this helps!


#3

yup that’s what i do as well


#4

If the problem is duplication then you have about a million ways of dealing with it

WITH duplicates_from_various_sets (
SELECT user from 2017_spend
UNION
SELECT user from 2016_spend
) SELECT distinct user from duplicates_from_various_sets

WITH mock as (
SELECT DISTINCT 1::int2 flag, invoice_id from 2017_spend
UNION DISTINCT
SELECT DISTINCT 2::int2 flag, invoice_id from 2016_spend
)
SELECT invoice_id,
bit_or(flag) – if you need to know which sets was invoice found in 1 - first, 2 - second, 3 - both

You can use this as a skeleton to put all your left outer joins onto… i.e.

– sum(1) should be a faster de-duplicator than DISTINCT, at least in PADB it was
WITH superset_of_invoices as (
SELECT invoice_id, bit_or(flag) as appears_in from (
SELECT 1::int2 flag, invoice_id, sum(1) from 2017_spend group by 1,2
UNION
SELECT 2::int2 flag, invoice_id, sum(1) from 2016_spend group by 1,2
) pile_of_invoices
) SELECT invoice_id,
appears_in,
COALESCE(2017_spend.amount,0) as 2017_spend_amount ,
COALESCE(2017_spend.amount,0) as 2016_spend_amount
FROM superset_of_invoices,
LEFT OUTER JOIN 2017_spend using(invoice_id)
LEFT OUTER JOIN 2016_spend using(invoice_id)

… Many ways to skin this cat. Check what’s faster and cheaper, i guess?


#5

The FULL OUTER JOIN should do what your after without any loss of data (as it will include rows from both tables).

Let’s cover the three possibilities:

  1. Customer 1 occurs in both 2016 and 2017
  2. Customer 2 occurs in 2016 only
  3. Customer 3 occurs in 2017 only

we can mock this out simply with the following dummy values

INSERT INTO test.c_2016 VALUES (1, 1, 1);
INSERT INTO test.c_2016 VALUES(2, 1, 1);
INSERT INTO test.c_2017 VALUES (1, 2, 2);
INSERT INTO test.c_2017 VALUES(3, 1, 1);

we expect to yield 3 rows in a table that looks something similar to

user_id spend_2016 decile_2016 spend_2017 decile_2017
1 1 1 2 2
2 1 1 null null
3 null null 1 1

To get this result we can join the two tables using a FULL OUTER JOIN on user_id which gives the expected result.

SELECT
  COALESCE(c_2016.user_id, c_2017.user_id) AS user_id,
  c_2016.spend_2016,
  c_2016.decile_2016,
  c_2017.spend_2017,
  c_2017.decile_2017
FROM
  test.c_2016
  FULL OUTER JOIN
  test.c_2017 ON
  c_2016.user_id = c_2017.user_id
  ORDER BY user_id ASC