Setting up the real-time pipeline on AWS


#21

Maybe this will be helpful: I couldn’t get my collector application to work on AWS Elastic Beanstalk (autoscaling) until I exposed port 8080 in my collector config (and also in my dockerfile). I’m not sure if the collector jar is hard-coding 8080 as the listening port somewhere or if I missed something related to nginx or docker, but other ports did not work for me.

As I understand it, the load balancer listens on 80 (http) and 443 (https) by default, then forwards to the instances via port 80 which the nginx proxy should be listening to. Ultimately, I didn’t need to change these default inbound/outbound rules on the ELB or ec2 security groups. It was just a matter of changing the exposed port of the application specifically to 8080 in my configs.

There’s cryptic references to this issue in a few places around the Snowplow discourse, but no clear answers. Maybe the Snowplow team can help shed some light.


#22

Thank you for your reply Travis, this worked! I changed my snowplow config to 8080. At first, it failed again when I set the config interface to the DNS name of the ELB. I changed the interface to the actual server IP with port 8080 and receiving a “Successfully bound to x.x.x.x/8080”. This could definitely be documented better, however I am glad it is now mentioned explicitly in the discourse community for folks in the future.

Thanks again for the support.


#23

I wrote a lambda function to copy kinesis stream data to kinesis firehose. Then configured firehouse to sink to redshift using aws console ui. The setup broke when I moved redshift into a private subnet inside a vpc. Firehose only works with redshift port exposed to the public. There’s no ssh port forwarding option in it yet.

For long term secure solution, do not repeat my mistake. You can write your own kinesis to redshift sink (stand alone service to read from kinesis stream, buffer and write every minute or so into redshift ) or better yet use snowplow components. One of these components is kinesis to s3 sink. As it creates new s3 objects, discover them, create a manifest, load into redshift then use manifest to move loaded objects away into archive. Rinse and repeat.


#24

I might be late to the party but we have successfully deployed the collector and stream enricher using Docker with ElasticBeanstalk … I am happy to answer questions regarding our setup.