Setting up the real-time pipeline on AWS

Hi all,
I’m new to the snowplow setup and not an expert developer, but i’m really keen on setting up a real-time streaming pipeline.
I started by setting up a basic cloudfront collector and did some tracking via javascript, this worked fine by just following the guide. However I’m having some trouble setting up the Scala Stream Collector:(
I downloaded the zip with the collector, enrich and sink files, but to be honest I don’t know what to do with them. What service of Amazon should I use? I already setup a EC2 instance, but that’s about where i get stuck. Should I start an elastic bean stalk or … ?

Any pointers are greatly appreciated!!

Thanks in advance
Arjen

1 Like

You can run the individual workers on EC2 boxes just fine. It’s worth putting each type of worker in its own ASG, and the collectors in an ASG with an ELB attached.

1 Like

If I can add my 2c -

I was in the same boat just a couple of months ago. Yes, a good place to start is Elastic Beanstalk. It requires a bit more legwork and more concepts to learn but it’s going to be worth it in the medium term (comes with ASGs out of the box, as @alex recommended). Note where your resources are located - it’d make sense to put them in a VPC. Again, more concepts - but more options for you later. Good luck! AWS really does give you ropes to hang yourself with and for someone without devops expertise it’s a jungle like any other - be patient and give yourself time.

1 Like

Thanks a lot for your quick replies @alex and @vivricanopy, this gives me some pointers again:)
One more question to prevent me from taking a wrong turn here, should i setup a worker environment or webserver environment.
Thanks a lot in advance!

@arjenbrink it’s going to be a worker for the enricher - as it doesn’t listen on http - and a webserver for the collector.

@vivricanopy thanks again, i got the collector launched and the config file set:) but since i started the ec2 instance via beanstalk i can’t/shouldn’t run it by accessing the ec2 instance right? How should i do this?
Thanks a billion again:D

no worries man! so… one way to do it is to have a Docker container run it, another would be setting something like a daemon service directly on the ec2. be careful though - with EB, you can’t restore a terminated environment afaik, so make sure everything you do is scripted and saved in git

You don’t need docker. You can choose elastic beanstalk with standalone Java. The only thing you need is a proc file that says what command to run to run the collector.

Allright, not using a docker sounds great (one less factor in the puzzle). @lionelport, I currently have a tomcat with java instance in place, would that work too?

You don’t need tomcat as the collector binary has a web server bundled. The easiest way if your doing it manually is to bundle the collector binary, procfile and config file in one zip and upload through aws console or eb command.

The procfile will contain a single line like…

web: ./snowplow-collector -Dhttp.port=5000 --config my.conf

(Note: syntax is from memory)

1 Like

@lionelport thanks for this! I have been trying to get this working but i keep getting the health ‘degrated’ .

i have three files:

  • snowplow-stream-collector-0.9.0.jar
  • Procfile
  • my.config

Where the procfile only contains ‘web: ./snowplow-stream-collector-0.9.0 -Dhttp.port=5000 --config my.config’

The three files I zipped and then started an instance via EB (to keep it simple on a single instance).

Is there anyway I can get more information on why the health gets status ‘degrated’?

Thanks again!

I am running it local now, and it seems to get pretty far, kinesis streams are found and active, but after that I get the error below. This doesn’t really say anything to me. Anybody any clue how i can fix this?

Much appreciated!!

[ForkJoinPool-2-worker-1] ERROR c.s.s.c.scalastream.ScalaCollector$ - Failure binding to port
java.lang.RuntimeException: CommandFailed(Bind(Actor[akka://scala-stream-collector/user/handler#-1455739769],/xx.xxx.xxx.xx:80,100,List(),None))

which component is failing? is it the enricher? if so - then it can’t find the collector to do the health check (have you set it up? maybe try without first). i’ve set up the collector to accept from 0.0.0.0 on port 80. another probable cause is availability-zone/security-group/subnet misconfiguration

1 Like

@vivricanopy if i run it locally at 0.0.0.0 port 8000 i get no errors, and if i start an EB with health reporting to basic it seems to be working fine. However if i do the enhanced health reporting it fails, but i haven’t configured anything other than just selecting that option.

(nothing else than the collector and the bad and good kinesis streams are setup currently, so no enricher or anything else)

Feels like i’m getting closer:D

well I think you’re on the right track; the “enhanced health reporting” may be sending logs to S3, so you need to configure an access policy to it - I don’t really know what’s needed but the truth is out there. Also, make sure the role you run it under (maybe the service role? i dunno) has enough permissions - maybe S3 permissions, maybe CloudWatch permissions, maybe something else entirely; look at the log outputs, see what’s fishy. In any case, you don’t need it to experiment and to create a proof of concept.

1 Like

@vivricanopy if i don’t need it to keep going i gladly skip it for now. I have been looking at setting up the tracker already to see if things aren’t already working. But i can’t find an example of a tracker for the real-time pipeline yet. Where do I find it?

for a clojure collector there is this:

window.snowplow(‘newTracker’, ‘mycljcoll’, ‘snowplow-coll.acme.com’, { // Initialise a tracker
appId: ‘{{MY-SITE-ID}}’,
cookieDomain: ‘{{MY-COOKIE-DOMAIN}}’
});

but what uri should i use for my real time collector?

The uri of your collector EB app, of course. there’s a guide on the webside afaik, and there’s also a “no-js” template generator for your tracker - you can try it as a simple template.

Sorry for the not so smart questions :yum:
Just to be sure i’m not testing on the wrong uri, i took the url that is displayed in the top of the interface of EB. that’s what i should use right?

Hi Arjen

I was wondering if you were able to get this running successfully? I too am in a similar boat and would really appreciate any help with setting up Scala through EB.
btw…i had to remove the option -Dhttp.port=5000 as it threw errors in the EB event log. I wonder if something changed in the scala collector jar since Nov?

Thanks!

Hi Kjain/Arjen,

Were you able to identify the root cause of the bind failure? I am experiencing a similar issue and I am thinking it is related to my AWS environment but I am not sure how to address the issue. Any information would be appreciated.

Thanks!