Unable to push stream enrich data into Redshift using "copy"



I’m currently using enrich stream to read data from collector in real time and pipe the enriched output to a file. However, if I try to load the enriched events to Redsifht via COPY, it errors out because the tab separated values in events file are not aligned with events table.
Could you tell me if this is this expected?

The stream enrich version I’m using is - > snowplow-stream-enrich-0.10.0

Here’s a screenshot of the “stl_load_errors” table in Redshift -

Here’s one of the event generated by stream enrich that errors out -

Everest QA      web     2017-05-08 22:03:56.338 2017-05-08 22:03:53.995 2017-05-08 22:03:53.570 transaction     67cb2fec-35fa-4e53-9530-cea6dfb0e366            cf      js-2.7.0        ssc-0.9.0-kinesis       kinesis-0.10.0-common-0.24.0    test@test.com      66.
234.199.33   3258001896      93036876-3a0c-4436-a47d-e0934b4f7ecb    4       543da39b-c8cb-4347-8195-49bfebfe352f    US      CA      Pittsburg       94565   38.005096       -121.8387       California                                      https://ccstore-z01a.oracleoutsourcing
.com/confirmation/7f822ae9-2a3b-4704-b184-ee265ca24943            https://ccstore-z01a.oracleoutsourcing.com/pirana-descender/product/checkout?url=pirana-descender/product/1462  https   ccstore-z01a.oracleoutsourcing.com      80      /confirmation/7f822ae9-2a3b-4704-b184-ee2
65ca24943                      https   ccstore-z01a.oracleoutsourcing.com      80      /pirana-descender/product/checkout      url=pirana-descender/product/1462               internal                                                                {"schema":"iglu:com.snowplow
tLoadedEventEnd":1494280925555,"domComplete":1494280925555,"loadEventStart":1494280925555,"loadEventEnd":1494280925556,"chromeFirstPaint":1494280925736}}]}                                                     o100516 Everest - Outdoor Gears and Apparel     37.95   0       10
      Walnut Creek    CA      US                                                                                      Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36       Chrome  Chrome  57.0.2987.133
   Browser WEBKIT  en-US   1       1       0       0       0       0       0       0       0       1       24      848     681     Mac OS X        Mac OS X        Apple Inc.      America/Los_Angeles     Computer        0       1680    1050    UTF-8   833     1305    USD     
                                                America/Los_Angeles                             2017-05-08 22:03:53.960                 {"schema":"iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1","data":[{"schema":"iglu:com.snowplowanalytics.snowplow/ua_parser_
context/jsonschema/1-0-0","data":{"useragentFamily":"Chrome","useragentMajor":"57","useragentMinor":"0","useragentPatch":"2987","useragentVersion":"Chrome 57.0.2987","osFamily":"Mac OS X","osMajor":"10","osMinor":"12","osPatch":"4","osPatchMinor":null,"osVersion":"Mac OS X 1
0.12.4","deviceFamily":"Other"}},{"schema":"iglu:org.ietf/http_cookie/jsonschema/1-0-0","data":{"name":"sp","value":"543da39b-c8cb-4347-8195-49bfebfe352f"}}]}    a08d055a-9063-4f29-865e-2bbc9714388b    2017-05-08 22:03:53.605 com.snowplowanalytics.snowplow  transaction     j
sonschema      1-0-0   66b6557b0252cd64f3bea029de4a8a9b



Hi @amitkhanal,

Yes, this is expected in the situation you describe.

If you look at the columns raw_field_value in e.g. the last line you see that the value is WEBKIT.
The err_reason (reason for the error) is Unknown boolean format so the column expects a Boolean. WEBKIT is not a Boolean.


Thanks for the info @leon.

So does the EMR process use a different version of enrich or is the enriched data further modified so it can be processed through storage loader?

Is there anyway for enrich stream to output data aligned with the events table?


Ah - this error is because you are trying to pipe the enriched events directly into the Redshift atomic.events table. The enriched events go through some (fairly minimal) additional processing in Hadoop Shred to prepare them for Redshift. This explains the column offset issue.


Thanks @alex, it makes sense now.

Could you to point me to the additional processing logic? We’re trying to load events in Redshift in near-realtime and I’d like to extract that logic if possible.


@amitkhanal, as Alex pointed out you would have to shred the data before loading it to Redshift. In other words, run the EmrEtlRunner with --skip staging,enrich option thus leaving shred task in place.

An alternative solution is to separate enrichment process for real-time and batch “branches” in Lambda architecture as per diagram in How to setup a Lambda architecture for Snowplow