Clojure Collector bottleneck


#1

Hi guys,

We’re using Amazon Elastic Beanstalk with a Clojure-based collector.
We used to have an M1.small instance but since the number of tracking events have been growing it was replaced with an M4.large EC2 instance.
This current instance has 450 Mbps bandwidth and also Enhanced Networking by default, also has 8GB RAM and 2 vCPUs.

The action taken by the collector seems relatively simple: log the event, in case of click action, redirect to the indicated URL, in case of open action, serve a pixel.

The problem? We’re facing a huge bottleneck.

We’ve compiled some benchmarks that can illustrate the huge TTFB (Time To First Byte) when clicking on a tracking URL:

M1.small - 58.02ms (2.2 req/sec)
M4.large - 1.05s (9.2 req/sec)
M1.small - 56.13ms (5.4 req/sec)
M4.large - 7.60s (17.0 req/sec)
M1.small - 88.27ms (3.2 req/sec)
M4.large - 7.77s (13.7 req/sec)
M4.large - 110.33ms (5.5 req/sec)

It is impractical that you click on a link and have to wait 7 seconds until something finally happens!

Can anyone helps us by-pass this situation?

TIA,


#2

If that helps, I’ve attached the monitoring for the past 8 hours:

At first sight everything looks normal…
Those spikes on Network Out occur every hour when the logs are deployed to S3.

Also some info regarding response rates from /status:

   "ring.responses.rate":{
      "type":"meter",
      "rates":{
         "1":4.609870456257155,
         "5":4.770831762632824,
         "15":4.8355326453594305
      }
   },
   "ring.responses.rate.2xx":{
      "type":"meter",
      "rates":{
         "1":3.539407799370498,
         "5":3.7155430291082774,
         "15":3.7829888548645725
      }
   },
   "ring.responses.rate.3xx":{
      "type":"meter",
      "rates":{
         "1":1.0256593339035567,
         "5":1.0292773068203749,
         "15":1.018990987053793
      }
   }

#4

Hi @T_P,

A few follow-up questions to see if we can get to the bottom of this:

  1. What Trackers are you using?
  2. Are you using GET or POST requests? If POST how many are being bundled?
  3. With the m4.large server what type of EBS volume have you attached? Standard, gp2, io1?
  4. Is the latency something you are seeing in the Elastic Beanstalk UI or are these time measurements at your host? Does sending the request via cURL at the CLI result in the same roundtrip time?

#5

Hi @josh

  1. We’re using Pixel Tracker, currently tracking pixels and clicks (example below)
  2. GET
  3. 100GB gp2
  4. Benchmarks were collected via HTTP (location Portugal, EBS AZ Ireland) the response rate were collected at EBS using /status. Either via cURL or HTTP the results has very similar TTFB.

Request example:
trck.domain.com/r/tp2?e=ue&ue_pr={“schema”%3A"iglu%3Acom.snowplowanalytics.snowplow%2Funstruct_event%2Fjsonschema%2F1-0-0"%2C"data"%3A{“schema”%3A"iglu%3Acom.XYZ%2Fclick%2Fjsonschema%2F1-0-4"%2C"data"%3A{“cid”%3A"8633"%2C"eid"%3A"31238"%2C"uid"%3A"12345"%2C"geo"%3A"PT"}}}&tv=custom&p=web&u=https%3A%2F%2Fwww.XYZ.com%2Fhome%2C24682%2CNL%2CNL%2C54219.html


#6

Hi @T_P,

Thanks for that … all of that looks fine but there are a few other things we can have a look at.

  1. Could you share your collector endpoint so that I can test it as well for TTFB (just to remove any localised latency issues)
  2. Could you share the exact configuration for both the m1.small and m4.large environments:
  • RAM allocated, instance count etc
  1. Are you running the Collectors in a Private Subnet behind a NAT Gateway / Instance?
  2. How many requests per second is the collector seeing at the Load Balancer currently and what is the average reported Latency for the Load Balancer?

#7

Thank you for your help, @josh

  1. Take a look: trck.eu-west-1.elasticbeanstalk.com. Currently with ±5 req/sec the latency isn’t fully experienced since we opted to disable the tracking of most events.
  2. At the moment only m4.large is used that consists in 1 instance with 8GB RAM, 2 vCPUs, dedicated 450 Mbps bandwidth.
  3. No NAT Gateway, but using a VPC with a subnet.
  4. Not using a Load Balancer at the moment.

Instance metrics from the past 24 hours:

Volume metrics from the same 24 hours period:

For reference, at 19/09 16h00 we got something like TTFB 6.60s (8.5 req/sec)


#8

Hi @josh, any clue with the given metrics? :point_up_2:t2:


#9

Hi @T_P,

Sending requests to the endpoint you have supplied above resolves in just a few milliseconds so not sure where the issue might be.

Is the traffic pattern from the pixel tracker very spiky? Are you sending sudden influxes of data to the collector? If you do manage to overwhelm the server then response times can rise quite sharply.

At the moment only m4.large is used that consists in 1 instance with 8GB RAM, 2 vCPUs, dedicated 450 Mbps bandwidth.

I was asking more for the Elastic Beanstalk configuration. How much ram have you allocated to the Clojure Collector server itself?