Setting up the real-time pipeline on AWS

arjenbrink · November 23, 2016, 8:10pm

Hi all,
I’m new to the snowplow setup and not an expert developer, but i’m really keen on setting up a real-time streaming pipeline.
I started by setting up a basic cloudfront collector and did some tracking via javascript, this worked fine by just following the guide. However I’m having some trouble setting up the Scala Stream Collector:(
I downloaded the zip with the collector, enrich and sink files, but to be honest I don’t know what to do with them. What service of Amazon should I use? I already setup a EC2 instance, but that’s about where i get stuck. Should I start an elastic bean stalk or … ?

Any pointers are greatly appreciated!!

Thanks in advance
Arjen

alex · November 23, 2016, 10:08pm

You can run the individual workers on EC2 boxes just fine. It’s worth putting each type of worker in its own ASG, and the collectors in an ASG with an ELB attached.

vivricanopy · November 24, 2016, 2:35am

If I can add my 2c -

I was in the same boat just a couple of months ago. Yes, a good place to start is Elastic Beanstalk. It requires a bit more legwork and more concepts to learn but it’s going to be worth it in the medium term (comes with ASGs out of the box, as @alex recommended). Note where your resources are located - it’d make sense to put them in a VPC. Again, more concepts - but more options for you later. Good luck! AWS really does give you ropes to hang yourself with and for someone without devops expertise it’s a jungle like any other - be patient and give yourself time.

arjenbrink · November 24, 2016, 9:09am

Thanks a lot for your quick replies @alex and @vivricanopy, this gives me some pointers again:)
One more question to prevent me from taking a wrong turn here, should i setup a worker environment or webserver environment.
Thanks a lot in advance!

vivricanopy · November 24, 2016, 5:45pm

@arjenbrink it’s going to be a worker for the enricher - as it doesn’t listen on http - and a webserver for the collector.

arjenbrink · November 24, 2016, 8:53pm

@vivricanopy thanks again, i got the collector launched and the config file set:) but since i started the ec2 instance via beanstalk i can’t/shouldn’t run it by accessing the ec2 instance right? How should i do this?
Thanks a billion again:D

vivricanopy · November 25, 2016, 4:01pm

no worries man! so… one way to do it is to have a Docker container run it, another would be setting something like a daemon service directly on the ec2. be careful though - with EB, you can’t restore a terminated environment afaik, so make sure everything you do is scripted and saved in git

lionelport · November 26, 2016, 1:24am

You don’t need docker. You can choose elastic beanstalk with standalone Java. The only thing you need is a proc file that says what command to run to run the collector.

arjenbrink · November 26, 2016, 12:36pm

Allright, not using a docker sounds great (one less factor in the puzzle). @lionelport, I currently have a tomcat with java instance in place, would that work too?

lionelport · November 26, 2016, 11:24pm

You don’t need tomcat as the collector binary has a web server bundled. The easiest way if your doing it manually is to bundle the collector binary, procfile and config file in one zip and upload through aws console or eb command.

The procfile will contain a single line like…

web: ./snowplow-collector -Dhttp.port=5000 --config my.conf

(Note: syntax is from memory)

arjenbrink · November 27, 2016, 11:04am

@lionelport thanks for this! I have been trying to get this working but i keep getting the health ‘degrated’ .

i have three files:

snowplow-stream-collector-0.9.0.jar
Procfile
my.config

Where the procfile only contains ‘web: ./snowplow-stream-collector-0.9.0 -Dhttp.port=5000 --config my.config’

The three files I zipped and then started an instance via EB (to keep it simple on a single instance).

Is there anyway I can get more information on why the health gets status ‘degrated’?

Thanks again!

arjenbrink · November 28, 2016, 6:25pm

I am running it local now, and it seems to get pretty far, kinesis streams are found and active, but after that I get the error below. This doesn’t really say anything to me. Anybody any clue how i can fix this?

Much appreciated!!

[ForkJoinPool-2-worker-1] ERROR c.s.s.c.scalastream.ScalaCollector$ - Failure binding to port
java.lang.RuntimeException: CommandFailed(Bind(Actor[akka://scala-stream-collector/user/handler#-1455739769],/xx.xxx.xxx.xx:80,100,List(),None))

vivricanopy · November 28, 2016, 8:54pm

which component is failing? is it the enricher? if so - then it can’t find the collector to do the health check (have you set it up? maybe try without first). i’ve set up the collector to accept from 0.0.0.0 on port 80. another probable cause is availability-zone/security-group/subnet misconfiguration

arjenbrink · November 28, 2016, 9:34pm

@vivricanopy if i run it locally at 0.0.0.0 port 8000 i get no errors, and if i start an EB with health reporting to basic it seems to be working fine. However if i do the enhanced health reporting it fails, but i haven’t configured anything other than just selecting that option.

(nothing else than the collector and the bad and good kinesis streams are setup currently, so no enricher or anything else)

Feels like i’m getting closer:D

vivricanopy · November 29, 2016, 4:47pm

well I think you’re on the right track; the “enhanced health reporting” may be sending logs to S3, so you need to configure an access policy to it - I don’t really know what’s needed but the truth is out there. Also, make sure the role you run it under (maybe the service role? i dunno) has enough permissions - maybe S3 permissions, maybe CloudWatch permissions, maybe something else entirely; look at the log outputs, see what’s fishy. In any case, you don’t need it to experiment and to create a proof of concept.

arjenbrink · November 29, 2016, 6:27pm

@vivricanopy if i don’t need it to keep going i gladly skip it for now. I have been looking at setting up the tracker already to see if things aren’t already working. But i can’t find an example of a tracker for the real-time pipeline yet. Where do I find it?

for a clojure collector there is this:

window.snowplow(‘newTracker’, ‘mycljcoll’, ‘snowplow-coll.acme.com’, { // Initialise a tracker
appId: ‘{{MY-SITE-ID}}’,
cookieDomain: ‘{{MY-COOKIE-DOMAIN}}’
});

but what uri should i use for my real time collector?

vivricanopy · November 29, 2016, 9:51pm

The uri of your collector EB app, of course. there’s a guide on the webside afaik, and there’s also a “no-js” template generator for your tracker - you can try it as a simple template.

arjenbrink · November 29, 2016, 10:10pm

Sorry for the not so smart questions
Just to be sure i’m not testing on the wrong uri, i took the url that is displayed in the top of the interface of EB. that’s what i should use right?

kjain · January 25, 2017, 6:48pm

Hi Arjen

I was wondering if you were able to get this running successfully? I too am in a similar boat and would really appreciate any help with setting up Scala through EB.
btw…i had to remove the option -Dhttp.port=5000 as it threw errors in the EB event log. I wonder if something changed in the scala collector jar since Nov?

Thanks!

tedrem · March 14, 2017, 9:19pm

Hi Kjain/Arjen,

Were you able to identify the root cause of the bind failure? I am experiencing a similar issue and I am thinking it is related to my AWS environment but I am not sure how to address the issue. Any information would be appreciated.

Thanks!

Topic		Replies	Views
Questions about setting up the real-time pipeline AWS real-time pipeline	5	1905	August 11, 2016
Setup Each Service AWS real-time pipeline	15	1092	September 15, 2021
Architecture question For engineers	2	641	July 11, 2019
On-premise Snowplow Realtime Pipeline with Spark Streaming Enrich RFCs	1	3961	June 25, 2017
Optimizing AWS Stack for Medium Load Site For engineers	4	1672	January 31, 2017

Setting up the real-time pipeline on AWS

Related Topics