Mobile Context to S3 Loader

Hello Everyone.

I’m trying to setup Android Tracker > Kinesis Collector > Stream Enrich > S3 Loader > Query using Athena.

The data in S3 after S3 Loader step contains the following data in the cx field that is sent from the Android Event Emitter.

{ schema={ "schema": "iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1", "data": [ { "schema": "iglu:com.snowplowanalytics.snowplow/client_session/jsonschema/1-0-1", "data": { "sessionIndex": 2, "storageMechanism": "SQLITE", "firstEventId": "9cd0bb70-2561-4d3f-b0bd-8af4af365e63", "sessionId": "d8d4204c-2310-4735-87dc-4a3890e90411", "previousSessionId": "5a0cd403-5fcd-41ea-96d0-56d76b544b0b", "userId": "7e586e25-dbc5-479e-95a4-1461fa5af5d2" } }, { "schema": "iglu:com.snowplowanalytics.snowplow/mobile_context/jsonschema/1-0-1", "data": { "networkTechnology": "LTE", "carrier": "IND airtel", "osVersion": "9", "osType": "android", "androidIdfa": "6ad90f15-282d-4cfb-80d0-b84144f005e5", "deviceModel": "Redmi Note 7 Pro", "deviceManufacturer": "Xiaomi", "networkType": "mobile" } } ] }, data=null }

However, after Enrichment and S3 load, the OS Version and network fields are not available separately in the Loaded S3 tsv files.

Any idea what should be done to get this data?

Also, in general, how to load extra data (other than atomic.event) data into S3 after the data load step in while processing streaming data using Kinesis Collector > Stream Enrich > S3 Loader

Thanks in advance.

Hi @Gaurav_Toshniwal, there’s something strange about the data you’ve shared. I don’t think our Android tracker can produce this result under normal circumstances.

Could you please share the code that instruments the tracking?

Also, which version of the Android tracker are you using?

@Colm here’s the code:

    fun initSnowplowTracker() {
    // Create an Emitter
    val e1 = Emitter.EmitterBuilder("sp.*****.com", this)
            .security(RequestSecurity.HTTPS)
            .build()

    // Make and return the Tracker object
    Tracker.init(Tracker.TrackerBuilder(e1, "namespace", "appname", this)
            .sessionContext(true)
            .lifecycleEvents(true)
            .level(LogLevel.VERBOSE)
            .mobileContext(true)
            .backgroundTimeout(900)
            .build()
    )
}

@Colm just to mention, the data that I shared was base 64 encoded. I shred the decoded version of that data.

Athena isn’t at the moment a destination but if the data is well formed JSON sitting on S3 then you should be able to query this.

Are you shredding the data or only enriching and writing TSV out?

If it’s the former you’ll want to make sure that you’re querying the JSON data rather than the TSV data. If it’s the latter then you should be able to access this data in TSV format using the derived_contexts column.

Thanks for your response @mike
We are only enriching the data. From what I can see in the documentation, there’s only a Relational Database Shredder available, which is a five step process:

  1. Reads Snowplow enriched events from S3
  2. Extracts any unstructured event JSONs and context JSONs found
  3. Validates that these JSONs conform to schema
  4. Adds metadata to these JSONs to track their origins
  5. Writes these JSONs out to nested folders dependent on their schema

But, because we are using Stream Enrichment and Loading, which enriches and puts the enriched data back into Kinesis I’m not sure how to configure the step 1 of the shredding process.

cc @Colm

If you are planning on loading data into a database like Redshift RDB shredder is helpful but it’s not a requirement if you are only using Athena. If you are querying just the enriched data on S3 from the Kinesis stream derived_contexts should be around field 123 in the file which will have an array of contexts objects including mobile_context and client_session.

1 Like

@Gaurav_Toshniwal what I find confusing here is that the JSON you’ve shared seems to be illegitimate as a Self-describing JSON. It doesn’t strike me as something the tracker can produce (at least, if the tracker was producing it I would have expected us to have a lot of bugreports about it).

Reformatting it to be more readable, here’s what you’ve shared:

{ schema=

{
  "schema": "iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1",
  "data": [
    {
      "schema": "iglu:com.snowplowanalytics.snowplow/client_session/jsonschema/1-0-1",
      "data": {
        "sessionIndex": 2,
        "storageMechanism": "SQLITE",
        "firstEventId": "9cd0bb70-2561-4d3f-b0bd-8af4af365e63",
        "sessionId": "d8d4204c-2310-4735-87dc-4a3890e90411",
        "previousSessionId": "5a0cd403-5fcd-41ea-96d0-56d76b544b0b",
        "userId": "7e586e25-dbc5-479e-95a4-1461fa5af5d2"
      }
    },
    {
      "schema": "iglu:com.snowplowanalytics.snowplow/mobile_context/jsonschema/1-0-1",
      "data": {
        "networkTechnology": "LTE",
        "carrier": "IND airtel",
        "osVersion": "9",
        "osType": "android",
        "androidIdfa": "6ad90f15-282d-4cfb-80d0-b84144f005e5",
        "deviceModel": "Redmi Note 7 Pro",
        "deviceManufacturer": "Xiaomi",
        "networkType": "mobile"
      }
    }
  ]
}
, data=null }

That should fail validation, and shouldn’t successfully come through the enrichment process. It should land in bad rows. A legitimate context array would be just this part:

{
  "schema": "iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1",
  "data": [
    {
      "schema": "iglu:com.snowplowanalytics.snowplow/client_session/jsonschema/1-0-1",
      "data": {
        "sessionIndex": 2,
        "storageMechanism": "SQLITE",
        "firstEventId": "9cd0bb70-2561-4d3f-b0bd-8af4af365e63",
        "sessionId": "d8d4204c-2310-4735-87dc-4a3890e90411",
        "previousSessionId": "5a0cd403-5fcd-41ea-96d0-56d76b544b0b",
        "userId": "7e586e25-dbc5-479e-95a4-1461fa5af5d2"
      }
    },
    {
      "schema": "iglu:com.snowplowanalytics.snowplow/mobile_context/jsonschema/1-0-1",
      "data": {
        "networkTechnology": "LTE",
        "carrier": "IND airtel",
        "osVersion": "9",
        "osType": "android",
        "androidIdfa": "6ad90f15-282d-4cfb-80d0-b84144f005e5",
        "deviceModel": "Redmi Note 7 Pro",
        "deviceManufacturer": "Xiaomi",
        "networkType": "mobile"
      }
    }
  ]
}

ie. of the format {"schema": "iglu:com.snowplowanalytics.snowplow/client_session/jsonschema/1-0-1", "data": [{"schema": ..., "data": ...},{"schema": ..., "data": ...}]}

Indeed, {schema=..., data=...} is an invalid JSON.

I suspect Mike might be on to something in that there might be something about how you’re querying the data via Athena. There’s a tutorial on using Athena to query enriched data here - note that it’s slightly out of date, if your data is not in run= subfolders, than you’d need to remove PARTITIONED BY(run STRING) from the table definition.

Perhaps following that guide will unearth a different result - if it doesn’t, do let us know.

Thanks for sharing all of this @Colm. Let me check based on this and get back to you.

You’re correct @Colm, the final data that I can see in the s3 (and Athena is connected to S3) in the format you mentioned. The data I had shared, I saw that structure while debugging the Emitter request object on the Android side.

Also, the other link about querying data in Athena is quite helpful.

Thanks a lot.

1 Like