ElasticSearch proxy

seanhall · May 4, 2016, 12:04am

I am attempting to write bad rows from Hadoop to ElasticSearch, as suggested in https://github.com/snowplow/snowplow/wiki/Common-configuration#elasticsearch. I am not using the AWS ElasticSearch service but a self-contained ES cluster on EC2. I can write to it successfully from that instance via curl. However, I am again running into a proxy issue when trying to write to it from inside the EMR cluster and step. (I assume it’s a proxy issue; the step never resolves and just hangs, and every other step of the Snowplow pipeline for me has been subject to a proxy.)

My question is: At what point exactly does the EMR runner make the http request to ElasticSearch to index these bad records? I’m trying to pinpoint it so I can test adding the proxy, but it’s not clear from either the runner lib or the Scala src.

Thank you!

Sean

fred · May 4, 2016, 8:21am

Hi Sean,

The EmrEtlRunner doesn’t directly communicate with Elasticsearch. Instead, for each bad rows bucket, it adds a step to the EMR jobflow to index that bucket. So it’s the Hadoop cluster which communicates directly with Elasticsearch. You can see these steps in the EMR console - they happen after the enrich and shred steps are finished.

Hope that helps,
Fred

seanhall · May 5, 2016, 1:28am

I thought that might be the case going in, and added an EMR bootstrap step up front in the config.yml to run a short shell script to set those proxy environment variables, but the ElasticSearch sink steps hang for 20 minutes or more on a very small dataset, so I assume a connection error somewhere between EMR and ElasticSearch persists. Is there a point in the Hadoop config wherein that proxy should be set? Or at the least is there an opportunity to add some debugging to those job steps to get some feedback while hanging?

I’m currently trying to add the below to the additional_info field of the emr config.yml to add it to the Configurations section of the cluster:

“{“classification”:“hadoop-env”,“properties”:{},“configurations”:[{“classification”:“export”,“properties”:{“HTTP_PROXY”:“http://pbcld-proxy.nordstrom.net:3128”}}]}”

fred · May 6, 2016, 8:28am

If your Elasticsearch cluster is behind a proxy, can you just update the host and port in the config file to point at that proxy?

You could try setting es_nodes_wan_only to true in the configuration file - I know you said you weren’t using Amazon Elasticsearch Service but it’s worth a try.

Have you looked at the cluster logs? There may be useful error information printed, especially in the syslog files.

Regards,
Fred

Topic		Replies	Views
Sending bad rows to Elasticsearch AWS batch pipeline (Legacy)	2	1259	April 28, 2017
Second job for importing bad rows Troubleshooting	1	1352	June 9, 2016
Trouble sending bad rows to amazon elasticsearch service (EsHadoopInvalidRequest) AWS batch pipeline (Legacy)	4	2963	August 1, 2017
Process bad rows from Elasticsearch and form them into good rows Troubleshooting	5	2899	May 16, 2017
Cluster: Snowplow ETLTerminated with errorsShut down as step failed Duplicate	2	2156	October 10, 2017

ElasticSearch proxy

Related Topics