Cookie ID field types and lengths in Redshift schema


#1

I’ve always wondered about the lengths of the cookie ID fields/types in Redshift:

domain_userid::varchar(128)
network_userid::varchar(128)
refr_domain_userid::varchar(128)
domain_sessionid::char(128)

Given most cookie IDs are 36 character UUIDs, wouldn’t it be more efficient to use CHAR(36) on them? Querying our dataset, these fields all max out at 36 chars. I’m guessing they’re this long in case custom cookie IDs are passed, however I haven’t seen a reference for this in the trackers I’ve used.

Also, why is domain_sessionid set to char(128) while the rest are all varchars?


#2

@robkingston, you might have an old (“pre-historical”) definition of the table. The latest version of the events table you can migrate to could be found here (this is a migration script from version 8 to 9): https://github.com/snowplow/snowplow/blob/master/4-storage/redshift-storage/sql/migrate_0.8.0_to_0.9.0.sql. The length in there is 36.

Though I can see discrepancies between the migrations. Version 10 is back to 128. We would need to look into that. Thanks for raising.


#3

These columns were bumped to accomodate pseudononymisation of these fields.

If you’re not using those features of the pipeline (which encrypt the raw values and in the process make them longer) there is no real benefit to having them as 128 and instead 36 is more appropriate.


#4

@ihor - nope, I’m running the latest version of the schema.

@mike That’s right, I remember - I only needed this for IP pseudonymisation at the time, so ignored the other widened fields.

Ah and varchar is needed for the other fields because of the old domain user IDs of a smaller length.


#5

Yep - a few of the older tracker versions that generated the shorter length user ids are still out there in the wild.