Data mismatch between Stream Enrich and EmrEtlRunner


#1

Hi,

We are using two different pipelines for events tracking. One is for real-time data processing (Stream Enrich to Kinesis to Elasticsearch) and the other one is for batch processing of events (EmrEtlRunner to Redshift).

The number of events getting saved in both of them do not match. Somehow the events in Redshift is more than the number of events in Elasticsearch. Also, there are some events that are not present in Redshift but are in Elasticsearch. I have checked all the bad kinesis streams and those streams are (almost) empty.

Is there any other way of debugging this problem?


#3

Hey @ramandamodar,

What is the percentage of “lost” events in ES?

Is there anything specific about events (if you can identify them) that are only present in Redshift? Like it can be some subset of unstructured events or events with specific client-side environment.

Also, could you share tracker’s version and initialization code. Right now I suspect either Tracker-Collector part of pipeline or just infelicity in counting.


#4

Hi @anton,

3 to 4%

There is count mismatch for all type of events.

JavaScript Tracker Version - 2.8.0

window.snowplow('newTracker', 'cf', 'track.popxo.com', { // Initialise a tracker
appId: 'popxo-web',
cookieDomain: 'www.popxo.com'
});

#5

I was going through the logs of Kinesis LZO S3 Sink service and found this error.

{"log":"INFO: Unable to execute HTTP request: kinesis.us-east-1.amazonaws.com failed to respond\n","stream":"stderr","time":"2018-01-15T09:42:20.820932269Z"}

and

{"log":"INFO: Unable to execute HTTP request: Socket Closed\n","stream":"stderr","time":"2018-01-15T09:42:44.033350048Z"}

Can this be somehow related to the data mismatch problem that I am facing?


#7

Hi @anton

Redshift has many events where user_ipaddress is 2. All these events are not present in Elasticsearch.