Process bad rows from Elasticsearch and form them into good rows


#1

Hey,

we have some bad rows in Elasticsearch + Kibana, can we somehow export them and turn them into “good rows”. I don’t know how to export them, if we could do that, it would be easy to write a script, that transforms and shoots them again.


#2

Hi @tclass,

Have you had a look at this? https://github.com/snowplow/snowplow/wiki/Hadoop-Event-Recovery

Christophe


#3

@christophe thanks, I’m almost done with this, just testing the js on a small amount of events, is there a way to log inside the javascript, to debug on the cluster?


#4

I’m afraid I don’t know - perhaps someone else can help?


#5

I’m not aware of a way to debug on the cluster itself but I prefer debugging locally which makes things a bit similar (don’t have to wait for an EMR cluster to spin up).

There’s a few options here but I tend to go with 1 as it’s quickest:

  1. Download Rhino, the implementation of Javascript that the EMR cluster runs and use the CLI tool to debug your function locally. You can also import files/strings and test the output of these against your processing functions. The advantage is that this is quite quick and will emulate closely what the cluster executes but will be missing some of the built in functions that Snowplow already has defined. You can download Rhino from here - use the 1.7R4 release as this is what Hadoop Event Recovery uses.

  2. Use sbt console and the Snowplow repository itself (example below is courtesy of @alex).

$ git clone git@github.com:snowplow/snowplow.git
$ cd snowplow
$ vagrant up && vagrant ssh
vagrant@snowplow:~$ cd  /vagrant/3-enrich/scala-common-enrich
vagrant@snowplow:/vagrant/3-enrich/scala-common-enrich$ sbt console
[info] Loading project definition from /vagrant/3-enrich/scala-common-enrich/project
[info] Set current project to snowplow-common-enrich (in build file:/vagrant/3-enrich/scala-common-enrich/)
[info] Starting scala interpreter...
[info]
Welcome to Scala version 2.10.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import org.mozilla.javascript._
import org.mozilla.javascript._

scala> val script = """java.lang.System.out.println("Hello from Rhino");"""
script: String = java.lang.System.out.println("Hello from Rhino");

scala> val cx = Context.enter()
cx: org.mozilla.javascript.Context = org.mozilla.javascript.Context@6303a766

scala> val compiled = cx.compileString(script, "user-defined-script", 0, null)
compiled: org.mozilla.javascript.Script = org.mozilla.javascript.gen.user_defined_script_1@3ada96a2

scala> val scope = cx.initStandardObjects
scope: org.mozilla.javascript.ScriptableObject = [object Object]

scala> compiled.exec(cx, scope)
Hello from Rhino
res0: Object = org.mozilla.javascript.Undefined@4336f90d

This has the advantage of more closely resembling the environment that your Javascript executes as well as taking advantage of the built-in helper functions that Snowplow provides.


#6

@christophe @mike Hey thanks guys, I got it working yesterday, had to start the cluster multiple times to test it :smiley: but in the end it worked and with the example from mike it will be even faster next time. Big thanks


Using Hadoop Event Recovery to recover events with a missing schema [tutorial]