Setting up the real-time pipeline on AWS

travisdevitt · March 15, 2017, 11:34pm

Maybe this will be helpful: I couldn’t get my collector application to work on AWS Elastic Beanstalk (autoscaling) until I exposed port 8080 in my collector config (and also in my dockerfile). I’m not sure if the collector jar is hard-coding 8080 as the listening port somewhere or if I missed something related to nginx or docker, but other ports did not work for me.

As I understand it, the load balancer listens on 80 (http) and 443 (https) by default, then forwards to the instances via port 80 which the nginx proxy should be listening to. Ultimately, I didn’t need to change these default inbound/outbound rules on the ELB or ec2 security groups. It was just a matter of changing the exposed port of the application specifically to 8080 in my configs.

There’s cryptic references to this issue in a few places around the Snowplow discourse, but no clear answers. Maybe the Snowplow team can help shed some light.

tedrem · March 16, 2017, 12:56am

Thank you for your reply Travis, this worked! I changed my snowplow config to 8080. At first, it failed again when I set the config interface to the DNS name of the ELB. I changed the interface to the actual server IP with port 8080 and receiving a “Successfully bound to x.x.x.x/8080”. This could definitely be documented better, however I am glad it is now mentioned explicitly in the discourse community for folks in the future.

Thanks again for the support.

dashirov-ga · March 16, 2017, 4:11am

I wrote a lambda function to copy kinesis stream data to kinesis firehose. Then configured firehouse to sink to redshift using aws console ui. The setup broke when I moved redshift into a private subnet inside a vpc. Firehose only works with redshift port exposed to the public. There’s no ssh port forwarding option in it yet.

For long term secure solution, do not repeat my mistake. You can write your own kinesis to redshift sink (stand alone service to read from kinesis stream, buffer and write every minute or so into redshift ) or better yet use snowplow components. One of these components is kinesis to s3 sink. As it creates new s3 objects, discover them, create a manifest, load into redshift then use manifest to move loaded objects away into archive. Rinse and repeat.

kaushikbhanu · October 10, 2018, 4:34pm

I might be late to the party but we have successfully deployed the collector and stream enricher using Docker with ElasticBeanstalk … I am happy to answer questions regarding our setup.

Marco_Mai · May 25, 2021, 2:08pm

Hi @kaushikbhanu , I would appreciate it if you could provide some steps to deploy the collector using Docker with EB. I am confused about how to do it exactly. Thank you in advance.

Topic		Replies	Views
Questions about setting up the real-time pipeline AWS real-time pipeline	5	1899	August 11, 2016
Setup Each Service AWS real-time pipeline	15	1085	September 15, 2021
Architecture question For engineers	2	637	July 11, 2019
On-premise Snowplow Realtime Pipeline with Spark Streaming Enrich RFCs	1	3947	June 25, 2017
Optimizing AWS Stack for Medium Load Site For engineers	4	1667	January 31, 2017

Setting up the real-time pipeline on AWS

Related Topics