Scala Stream Collector on Elastic Beanstalk - how to configure and run?


#1

hi all

I am setting up the pipeline for the first time on AWS and would really appreciate any help to cover the missing documentation from the setup guide. I promise that once I have this running, I’ll post my detailed setup here as reference!

I am trying to get the scala collector setup on an Elastic Beanstalk. From my understanding this should be fairly straightforward. I used a few very helpful support topics and the official setup guide for clojure as a reference.

However, I am not able to get the collector running. Here are some steps I followed:

  1. Tested on my local dev (127.0.0.1 on port 8080) and with sink set as stdout. This worked.
  2. I created an EB with the web server environment type (with java as the configuration) and ELB enabled.
  3. I uploaded a zip file with the Procfile, collector jar and the above config file in it. For testing, I kept the settings in the config file the same as I used for local testing (ie interface:127.0.0.1, port 8080, sink: stdout). The Procfile contents were: web: ./snowplow-stream-collector --config test.conf
  4. Note: I will eventually set this up with kinesis but I wanted to test it on stdout first. Can this be an issue as well?

The EB starts up successfully. Accessing the EB url gives the nginx 502 error. But I am assuming that this is the error on port 80 which the EB starts up automatically due to the web server config on the EB. But trying to access the <eb_url:8080> refuses the connection completely. I cannot telnet or stat this port on the server at all.

I accessed the EC2 volume being used and checked the processes. I can see the line java -jar ./snowplow-stream-collector --config ih.conf so the process is running. But not sure where and why is it inaccessible? Is it a firewall issue (the EB is not inside a VPC)?

Would appreciate any help…thanks all!

UPDATE: while trying to figure this out, i tried the above with 2 kinesis streams (good and bad) and got the same result ie it worked when testing locally, but not in the EB. Thanks!

UPDATE 2: OK…doing some more debugging, seems like it was partly AWS setup at fault. The EB created a default security group with this application, which of course, did not allow 8080 inbound! So I added 8080 as a rule, and I was able to access to server! But weirdly I could not use the EB url to do so, I had to use the EC2 IP address (or amazon public dns) to access it. That doesn’t sound right to me…Also eventually I would like to run the collector on port 80. How can I do that on this application and override the nginx process that runs on it by default? Thank a million!

UPDATE 2: Now after a few retries, I can access the EB url (eg. *.us-east-1.elasticbeanstalk.com) to access the box on port 8080. But as above, if I configure the scala conf to use port 80, it doesn’t work since the server box’s default web server (nginx) takes control. How can I disable this so I can use the scala collector at the standard port 80?
Hope someone can help as it really can’t be all that difficult :slight_smile:

UPDATE3: Sorry for using the forum as a rather verbose running log of my activities - but I am hoping the resolution will help others struggling at this stage as well. So leaving the previous EB as is, I tried the same approach by building in an ELB/ASG with the EB (since this is highly recommended byt the snowplow team anyhow). Now I cannot expose the port 8080 from outside anymore. I can still use the ec2 box’s IP and return OK for port 80, but that defeats the purpose of the ELB anyhow. So its a bit of a catch22 here. I can work with 8080 but the ELB doesn’t allow connecting to it. But if I switch to port 80 (ideal) or possibly https (443) then the default nginx takes over the requests.

Would appreciate any help or guidance…thanks very much!


#2

OK…I decided to answer this myself in case it helps others. The issue was essentially at the AWS implementation end. Here’s what I finally did to get the scala collector working on port 8080 through and ELB inside a VPC:

  1. Create and EB inside an ELB (and a VPC in my case). I used a public subnet in the VPC for the ELB and the instance.

  2. Set the health check url to /health

  3. Once started, I had to update the Load Balancer settings: set listener for port 80 to instance port 8080.

  4. I updated the Load Balancer’s security group settings (inbound and outbound) to map port 8080 to any source.

  5. I updated the instance’s security group settings to allow inbound 8080 traffic from any source.

Finally, with these edits, I was able to connect to the collector on standard http port through the ELB! I will eventually set it up to work on https instead but I assume the process will be similar.

Considering the number of manual steps required above, I wonder if there’s a way to automate them so we don;t have to go through these steps at every ELB setup.

I would love to hear comments from others…while I resume further digging into snowplow…and I am sure I will have more questions to post!

Thanks very much!


#3

Thanks for sharing your experiences @kjain!