Sending bad rows to Elasticsearch


#1

Hi there,

I’m trying to send bad rows to my elasticsearch cluster and I found in the EMR logs (containers/application_*) that the EMR cluster is trying to balance requests between all my ES data nodes:

ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [10.10.10.14:9200] failed (Connection timed out); selected next node [10.10.10.13:9200]

Is there any way to suppress this behaviour, so that it would connect only to the host I’m supplying in my runner config? I only want to have 1 proxy to the cluster.


#2

Hi @kazgurs1,

A few questions

  1. What version of Elasticsearch are you running at the moment?
  2. What does your Snowplow configuration look like for sending data to ES? Are you specifying an IP or a hostname here?
  3. If you’re running Elasticsearch yourself what is the configuration of es.nodes.client.only, es.nodes.data.only and es.nodes.wan.only in your Elasticsearch configuration?

#3

Hi Mike,

thanks for getting back to me.
I’m using 2.4.1. Oh shoot, I think I got it. I left all the es_nodes settings default in my runner config. es.nodes.client.only is false by default, so I need to use ‘true’, in order to stop querying my data nodes, right? Thanks so much for putting me in the right direction.

EDIT: as per https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb#L449, I see that es.nodes.wan.only is the only setting that is allowed to be modified for es hadoop config. Will try to enable that one.

EDIT2: yes! Enabling es_nodes_wan_only helped. Thank you for the support.