Merging Columns with a JOIN

jrpeck1989 · November 10, 2017, 3:43pm

Hi all,

I have two separate queries that generate tables that have the following structure:

user_id | 2017_spend | 2017_decile

and

user_id | 2016_spend | 2016_decile

where the decile columns is a 10% percentile grouping, generated using Redshift’s RANK() and NTILE window functions.

Now, I want to join these two into a single table that has every unique User ID in the first column, and their spend and decile information in the following 4 columns. This seems a straightforward task, but I have not been able to get this work as desired. Every JOIN that I do (LEFT, LEFT OUTER, FULL OUTER, INNER) results in a loss of data. Because if someone was a customer in 2016 but not in 2017 (or vice versa), then there will be no match and the JOIN will exclude them. And as I understand, a UNION will simply stack the two tables on top of each other, resulting in duplicate user_id’s.

Am I missing something basic? (It feels like I am) .

Any help let me know!

Colm · November 10, 2017, 4:15pm

Hi Jrpeck1989,

My instinct is that a FULL OUTER JOIN should do what you’re trying to do, but perhaps there’s some reason I can’t think of that it doesn’t.

An inelegant but potentially effective solution could be to UNION the tables then group by user_id and use MIN/MAX to aggregate to one row per user.

This assumes you’ve only got one value per user for each of the four columns.

WITH step1 AS(

SELECT 

  user_id,
  2016_spend,
  2016_decile,
  null AS 2017_spend,
  null AS 2017_decile

FROM table_1

UNION

SELECT 

  user_id,
  null AS 2016_spend,
  null AS 2016_decile,
  2017_spend,
  2017_decile

FROM table_2
)

  SELECT 

    user_id,
    MIN(2016_spend),
    MIN(2016_decile),
    MIN(2017_spend),
    MIN(2017_decile)

  FROM step1

Hope this helps!

mjensen · November 10, 2017, 8:57pm

yup that’s what i do as well

dashirov · November 11, 2017, 12:35am

If the problem is duplication then you have about a million ways of dealing with it

WITH duplicates_from_various_sets (
SELECT user from 2017_spend
UNION
SELECT user from 2016_spend
) SELECT distinct user from duplicates_from_various_sets

WITH mock as (
SELECT DISTINCT 1::int2 flag, invoice_id from 2017_spend
UNION DISTINCT
SELECT DISTINCT 2::int2 flag, invoice_id from 2016_spend
)
SELECT invoice_id,
bit_or(flag) – if you need to know which sets was invoice found in 1 - first, 2 - second, 3 - both

You can use this as a skeleton to put all your left outer joins onto… i.e.

– sum(1) should be a faster de-duplicator than DISTINCT, at least in PADB it was
WITH superset_of_invoices as (
SELECT invoice_id, bit_or(flag) as appears_in from (
SELECT 1::int2 flag, invoice_id, sum(1) from 2017_spend group by 1,2
UNION
SELECT 2::int2 flag, invoice_id, sum(1) from 2016_spend group by 1,2
) pile_of_invoices
) SELECT invoice_id,
appears_in,
COALESCE(2017_spend.amount,0) as 2017_spend_amount ,
COALESCE(2017_spend.amount,0) as 2016_spend_amount
FROM superset_of_invoices,
LEFT OUTER JOIN 2017_spend using(invoice_id)
LEFT OUTER JOIN 2016_spend using(invoice_id)

… Many ways to skin this cat. Check what’s faster and cheaper, i guess?

mike · November 11, 2017, 1:45am

The FULL OUTER JOIN should do what your after without any loss of data (as it will include rows from both tables).

Let’s cover the three possibilities:

Customer 1 occurs in both 2016 and 2017
Customer 2 occurs in 2016 only
Customer 3 occurs in 2017 only

we can mock this out simply with the following dummy values

INSERT INTO test.c_2016 VALUES (1, 1, 1);
INSERT INTO test.c_2016 VALUES(2, 1, 1);
INSERT INTO test.c_2017 VALUES (1, 2, 2);
INSERT INTO test.c_2017 VALUES(3, 1, 1);

we expect to yield 3 rows in a table that looks something similar to

user_id	spend_2016	decile_2016	spend_2017	decile_2017
1	1	1	2	2
2	1	1	null	null
3	null	null	1	1

To get this result we can join the two tables using a FULL OUTER JOIN on user_id which gives the expected result.

SELECT
  COALESCE(c_2016.user_id, c_2017.user_id) AS user_id,
  c_2016.spend_2016,
  c_2016.decile_2016,
  c_2017.spend_2017,
  c_2017.decile_2017
FROM
  test.c_2016
  FULL OUTER JOIN
  test.c_2017 ON
  c_2016.user_id = c_2017.user_id
  ORDER BY user_id ASC

Topic		Replies	Views
Aggregating Tables to the Most Recent Values Redshift	3	6045	June 28, 2017
Making Redshift Use Merge Joins Instead of Slower Hash Joins For engineers	8	1280	April 8, 2024
Merging context versions in Bigquery For data modelers & consumers	0	941	May 17, 2020
More user table options in our snowplow normalize dbt package v0.2.2! New releases	0	615	March 13, 2023
Dealing with duplicate event IDs for joining contexts to atomic.events Redshift	3	3046	May 3, 2016

Merging Columns with a JOIN

Related Topics