[AWS Lake Loader Iceberg] Additional "_recovered_" custom context columns

Hi guys,

we are using the lake loader to store our events every 3 minutes on S3 in the Iceberg format (using AWS Glue, we adapted here slightly your solution to make this work).

What we figured out is that inside of the schema there are duplicated custom contexts with recovered inside that have newer custom context versions but we don’t know where they come from.

Example:

Lake Loader Output:

[io-compute-0] INFO com.snowplowanalytics.snowplow.lakes.processing.Processing - Non atomic columns: [contexts_com_snowplowanalytics_snowplow_web_page_1,contexts_de_mycompany_video_context_3,contexts_de_mycompany_video_context_3_recovered_3_0_2_6978f4f7,contexts_de_mycompany_video_context_3_recovered_3_1_0_4000cbbd,contexts_de_mycompany_video_context_3_recovered_3_1_1_96f4c485,contexts_de_mycompany_video_context_3_recovered_3_1_2_6f06a2f4]

This creates additionally to the contexts_de_mycompany_video_context_3 custom context also additional custom contexts contexts_de_mycompany_video_context_3_recovered_3_1_0_4000cbbd. The schema of these two iceberg columns is completely the same and therefore, I don’t understand why the lake loader is splitting here the traffic.

Interestingly some schema versions it stores in the actual column but some others it creates new columns with the “recovered” column.

How does this come? Is the schema wrong over here, if so shouldn’t these events end up rather in the bad stream than creating new events?

Btw: we are running 6 Lake Loader in parallel. The custom context version that causes the additional column looks like that:

{
    "$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
    "description": "Context for MyCompany video tracking",
    "self": {
        "vendor": "de.mycompany",
        "name": "video_context",
        "version": "3-1-2",
        "format": "jsonschema"
    },
    "type": "object",
    "properties": {
       "video_xymatic_id": {
            "type": "string",
            "description": "Unique ID of the video",
            "maxLength": 256
        },
        "video_name_vms": {
            "type": "string",
            "description": "Human readable name of the video",
            "maxLength": 2048
        },
        "video_interaction": {
            "type": "string",
            "description": "How was the video started? (e.g. 'autostart' | 'clicktoplay')",
            "maxLength": 64
        },
        "video_status": {
            "type": "string",
            "description": "What is the status of the video? (e.g. 'vod')",
            "maxLength": 64
        },
        "video_play_type": {
            "type": "string",
            "description": "Play type of the video (e.g. 'firstplay')",
            "maxLength": 64
        },
        "video_sound": {
            "type": "boolean",
            "description": "The video sound is available. Has to be a boolean, i.e. true or false"
        },
        "video_player_version": {
            "type": "string",
            "description": "Version of the video player",
            "maxLength": 64
        },
        "video_player_type": {
            "type": "string",
            "description": "Type of video player ('widget' | 'standard')",
            "maxLength": 64
        },
        "video_publish_date": {
            "type": "string",
            "description": "Publish date of the video",
            "pattern": "^$|^\\d{4}\\-(0?[1-9]|1[012])\\-(0?[1-9]|[12][0-9]|3[01])$",
            "maxLength": 10
        },
        "video_update_date": {
            "type": "string",
            "description": "Update date of the video",
            "pattern": "^$|^\\d{4}\\-(0?[1-9]|1[012])\\-(0?[1-9]|[12][0-9]|3[01])$",
            "maxLength": 10
        },
        "video_salesforce_partner_id": {
            "type": "string",
            "description": "Id of the salesforce partner",
            "maxLength": 128
        },
        "video_creator_job_id": {
            "type": "string",
            "description": "Job Team Id of the video creator",
            "maxLength": 64
        },
        "video_autoplay_setting_page": {
            "type": "string",
            "description": "Autoplay Setting in CMS active on current page (1 if true)",
            "maxLength": 1
        },
        "video_ab_test_id": {
            "type": "string",
            "description": "A/B test id of the video",
            "maxLength": 128
        },
        "video_player_template_name": {
          "type": "string",
          "description": "Name of player integration",
          "maxLength": 256
        }
    },
    "additionalProperties": true
}

Thank you for any advice :slight_smile:

Cheers, Christoph

Hi @capchriscap,

This behavior is explained in detail here: How schema definitions translate to the warehouse | Snowplow Documentation (we do the same with Lake Loader as we do with Databricks).

By the way, we have just released (but not announced yet) Iceberg+Glue support in Lake Loader 0.3.0 :wink: See here: Lake Loader configuration reference | Snowplow Documentation

Perfect, thanks @stanch , this really helps! :slight_smile:

I assume that our string enlargements was seen as a breaking change (max size from 64 to 128). We for now corrected it using views, thanks a lot for the help!

Thanks for the hint, we will check it out. I just saw in the first look that the iceberg version is still on 0.13.x (not latest 0.14.x) that killed the application every x hours because the assumed permissions were not valid anymore. But might be also due to our Spark/Iceberg configurations. Thanks! :slight_smile:

Hi @capchriscap, Lake Loader 0.3.0 still uses Iceberg 1.3.x, and you are correct that is not the latest version. I avoided upgrading the Iceberg version because… <<insert complicated reasons here, unrelated to current topic>>. But I would like to get Iceberg upgraded in one of the next releases.

Can you explain any more about the permissions issues you saw? Is it because you use catalog options like client.factory = org.apache.iceberg.aws.AssumeRoleAwsClientFactory?

I assume that our string enlargements was seen as a breaking change (max size from 64 to 128).

I don’t think that is the reason. The Lake Loader should treat string enlargements as a permitted change. There must have been been some other breaking change somewhere, but it’s hard for me to guess without seeing the schema.

Hi @istreeter ,
we got the following error described in java.lang.IllegalStateException: Connection pool shut down when refreshing table metadata on s3 · Issue #8601 · apache/iceberg (github.com)
However, it might also be that the new updated factory also does the job, we will test this one thanks :slight_smile:

Mhm…interesting. As we will anyway move to the BDP soon, I would cover this topic then over there and discuss this one with you to fix this one together with you guys, thanks for your help! :slight_smile: