Geo_region field is 3 chars instead of 2 in stream enrich


#1

upgraded to latest stream enrich (snowplow-stream-enrich-kinesis-0.19.1.jar) but still running on older snowplow-kinesis-s3-0.4.1. anyone know why we are getting 3 chars for geo_region instead of 2 when upgrading? i checked the latest atomic.events master schema and still says geo region should be 2 chars. batch runs fine and shows up as geo_region 2 when done.

userid | 118
slice | 4
tbl | 5451829
starttime | 2018-10-01 12:29:28.002229
session | 4902
query | 27746842
filename | s3://ga-snowplow-production/snowplow-enriched-intraday/in/2018-10-01-49588598374129293789694283554165351853830598362795081826-49588598374129293789694283554599356223072250236514599010.gz
line_number | 1
colname | geo_region
type | char
col_length | 2
position | 320
raw_line | WebSiteOld web 2018-10-01 12:28:46.320 2018-10-01 12:19:52.750 2018-10-01 12:19:52.629 unstruct dcf0f00e-5915-49b2-8ead-e248f7ae0cba cf-hosted-v2.8.2 js-2.8.2 ssc-0.14.0-kinesis stream-enrich-0.19.1-common-0.35.0 XX.XX.XX.XX 455090429 bb3254c1-e05e-4f6c-a9d6-0a20154396da 59 c3fbe5b1-c579-4dcf-8e05-6ab0b561a8d5 RU MOW Moscow 102421 55.7522 37.6156 Moscow https://generalassemb.ly/education/user-experience-design https://generalassemb.ly/education/user-experience-design-immersive https generalassemb.ly 443 /education/user-experience-design https generalassemb.ly 443 /education/user-experience-design-immersive {“schema”:“iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0”,“data”:[{“schema”:“iglu:com.google.analytics/cookies/jsonschema/1-0-0”,“data”:{"__utma":“63617721.1790239682.1537304111.1537378654.1537378654.1”,"__utmc":“63617721”,"__utmv":“63617721.|1=Login Status=guest=1^2=Metro=london=1”,"__utmz":“63617721.1537378654.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)”,"_ga":"
raw_field_value | MOW
err_code | 1204
err_reason | Char length exceeds DDL length

using new ip_lookups enrichment

[root@ip-XX-XX-XX-XX enrichments]# more ip_lookups.json
{
“schema”: “iglu:com.snowplowanalytics.snowplow/ip_lookups/jsonschema/2-0-0”,

    "data": {

            "name": "ip_lookups",
            "vendor": "com.snowplowanalytics.snowplow",
            "enabled": true,
            "parameters": {
                    "geo": {
                            "database": "GeoLite2-City.mmdb",
                            "uri": "http://snowplow-hosted-assets.s3.amazonaws.com/third-party/maxmind"
                    }
            }
    }

}


#2

for now i just extended geo_region to char(3) and works


#3

I created a Github issue for this a little while ago here and the corresponding DDL changes and migration scripts in a pull request here. If you’re using RDB loader note that you’ll need to increase the DDL length and patch RDB loader as well.


#4

thanks,


#5

@mike’s PR was very helpful in getting to grips with a bug in our code we found only last Friday! thanks :slight_smile:

After looking through ISO_3166-2 on Wikipedia I believe the region codes can be at least one alphanumeric character. It’s not obvious if there are any upper limits to the length. I think the use of a CHAR for the field is incorrect and it should at least be VARCHAR(3) although I’d be tempted to use something bigger but haven’t purchasing the standard to really understand.

Is the ISO reference in the git issue incorrect? It’s labelled as 8166-2 but I think it should be 3166-2.

We have seen numerics for Sri Lanka e.g. 1,2,3,etc. (country code LK) and also it looks like in Greece at least there are multiple ways of defining the same place. The numbers are not a problem with the CHAR type.
| geo_city || geo_region || geo_region_name || geo_country ||
| Athens | 35 | Attiki | GR |
| Athens | I | Attica | GR |

Thanks
Gareth


#6

Having just reminded myself of the CHAR type’s definition there’s not much to be said for it vs VARCHAR at this length of string as it will just pad the single character values on disk. Sorry still trying to interpret everything I’ve seen in our data and reading. Silly mistake. It’s worth baring in mind that the region is at least 1 character though rather than 2 or 3 characters for any analysis.


#7

Yeah - there shouldn’t be too much of a difference here once compression has been applied to the column.

Good catch - I’ve updated that to reflect ISO 3166 not ISO 8166.

I think CHAR(3) should be fine. The standard refers to a “string of up to three alphanumeric characters” so I think that should be sufficient.


#8

I agree I just re-read the link I sent and it does state up to three alphanumeric characters. I definitely read somewhere at least one character but it must have been about something else. Sorry for the confusion and thanks for the info and PR.


#9

No worries. Always good to have another set of eyes on important stuff like this!