How to read enriched page view data

Hi!

I want to read data from Kafka and use my own storage loader, a good enriched data doesn’t have any static length. if we think that \t is splitter between columns, one time I have 97 columns for page view and another time I have 100. how can I read data correctly?

for example this is one enriched data stored in Kafka :

“digikala_tracker_test\tweb\t2017-08-29 06:02:18.479\t2017-08-29 06:02:16.413\t2017-08-29 06:01:24.641\tpage_view\t71a9a496-3203-4bef-9855-bbf7d7882193\t\tcf\tjs-2.6.2\tssc-0.9.0-kafka\tkinesis-0.10.0-common-0.24.0\t\t172.16.159.x\t678449323\tac0cd031-89a8-45e5-afc0-08b49c8581e9\t1\t8f48d8db-02aa-485c-9a9e-b5fca069ad9d\t\t\t\t\t\t\t\t\t\t\t\thttp://localhost:8080/home\t\t\thttp\tlocalhost\t8080\t/home\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tMozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36\tChrome\tChrome\t60.0.3112.101\tBrowser\tWEBKIT\ten-US\t1\t1\t0\t0\t0\t0\t0\t0\t0\t1\t24\t1920\t346\tWindows 8.1\tWindows\tMicrosoft Corporation\tAsia/Tehran\tComputer\t0\t1920\t1080\tUTF-8\t1903\t842\t\t\t\t\t\t\t\t\t\t\t\t2017-08-29 06:01:24.643\t\t\t{“schema”:“iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1”,“data”:[{“schema”:“iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0”,“data”:{“useragentFamily”:“Chrome”,“useragentMajor”:“60”,“useragentMinor”:“0”,“useragentPatch”:“3112”,“useragentVersion”:“Chrome 60.0.3112”,“osFamily”:“Windows”,“osMajor”:null,“osMinor”:null,“osPatch”:null,“osPatchMinor”:null,“osVersion”:“Windows”,“deviceFamily”:“Other”}},{“schema”:“iglu:org.ietf/http_cookie/jsonschema/1-0-0”,“data”:{“name”:“sp”,“value”:“8f48d8db-02aa-485c-9a9e-b5fca069ad9d”}}]}\t0187d810-852e-405c-a4e2-02002de696df\t2017-08-29 06:02:16.411\tcom.snowplowanalytics.snowplow\tpage_view\tjsonschema\t1-0-0\t6a98c8fb5beb3af6249c36dae5afbad4\t”

in this case I have 97 columns. but other times , other number of columns !!

Another Problem is there is now correct schema for enriched data, I mean there is no correct sequence of label for data :

user_ipaddress : 172.16.132.155
user_fingerprint : 2902689200
network_userid : 7a0798b3-f250-474d-84bc-45ccf714353f
domain_sessionidx : 1
domain_sessionid : 3422a751-5069-4cc6-b1fa-3f7346102bf4
DntKnw : null
DntKnw : null
DntKnw : null
DntKnw : null
DntKnw : null
DntKnw : null
page_url : http://172.16.159.149:8080/home

DntKnw means I don’t know what is it ! is there any document for this problem ?

Tnx

@mirfarzam,

You could use one of the Analytics SDKs specifically designed to let you work with Snowplow enriched events in your event processing, data modeling and machine-learning jobs. You can use these SDKs with Apache Spark, AWS Lambda, Apache Flink, Scalding, Apache Samza and other Scala/Python/.NET-compatible data processing frameworks.

That is the SDKs are currently available in Scala, Python and .NET. Here’s the wiki link: https://github.com/snowplow/snowplow/wiki/Snowplow-Analytics-SDK

1 Like

these are great tools, thank you for your helping!
I will work on scala SDK!