Unstable Iglu server response times

Hi!

We tried to setup the Iglu server based on the secure quick start example. We have however noticed that the response time of API calls can be extreme slow from time to time, and we suspect this might be the issue we’re seeing with the RDB Loader.

The health call says everything is OK, but about 50 % of the API calls take forever to complete.

curl -kso /dev/null iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com/api/meta/health -w "==============\n\n
| dnslookup: %{time_namelookup}\n
| connect: %{time_connect}\n
| appconnect: %{time_appconnect}\n
| pretransfer: %{time_pretransfer}\n
| starttransfer: %{time_starttransfer}\n
| total: %{time_total}\n
| size: %{size_download}\n
| HTTPCode=%{http_code}\n\n"

| dnslookup: 0.001543

| connect: 75.232440

| appconnect: 0.000000

| pretransfer: 75.232495

| starttransfer: 75.275799

| total: 75.275937

| size: 2

| HTTPCode=200

time curl iglu-lb-<ACCOUNT>.<REGION>.elb.amazonaws.com/api/schemas/com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0 -X GET -H "apikey: <READ KEY>"

0.01s user 0.01s system 0% cpu 1:15.38 total

Moments before the connect was only 0.039454. Is this expected?

We’re having the server and database in two private subnets, and the load balancer in two public subnets of the same VPC. We tried setting the EC2 to t3.medium and the RDS to db.t3.medium, but no success.

This is a pretty unusually long response time.

Is this for all endpoints or just a subset of endpoints?

I’d be tempted to work backwards from the connection (to the load balancer, to the EC2 instance and then to the RDS instance) to try and determine where that potential latency is being introduced as well as having a look through the Cloudwatch metrics for these services.

It is all end points I’ve tried. What is so strange that works every now and then. It doesn’t consistently take long.

I now tried adding one subnet per availability zone, but there is no obvious improvement.

The Cloud Watch metrics of the Monitoring tab for the load balancer, Iglu server and Iglu database look calm.

Can you tell me a VPC setup that is verified to work? How many public and private subnets? What about availability zones?

Or maybe it is the security groups?

iglu-server security group
Inbound

  • type: SSH, protocol: TCP, port: 22, source: 0.0.0.0/0
  • type: Custom TCP, protocol: TCP, port: 8080, source: sg-08ad553c562946650 / iglu-lb

Outbound

  • type: HTTPS, protocol: TCP, port: 443, destination: 0.0.0.0/0
  • type: HTTP, protocol: TCP, port: 80, destination: 0.0.0.0/0
  • type: PostgreSQL, protocol: TCP, port: 5432, destination: sg-0f0cc0f43ab50c185 / iglu-rds
  • type: Custom UDP, protocol: UDP, port: 123, destination: 0.0.0.0/0

iglu-lb security group
Inbound

  • type: HTTPS, protocol: TCP, port: 443, source: 0.0.0.0/0
  • type: HTTP, protocol: TCP, port: 80, source: 0.0.0.0/0

Outbound

  • type: Custom TCP, protocol: TCP, port: 8080, destination: sg-008224009d024b58a / iglu-server

Public / private subnets should be fine as well as any availability zones. I’d dig further into Cloudwatch as there should be some indicator as to where they slow response is coming from if those logs are being written out. The security groups shouldn’t have a material impact on response latency.

I actually got a reply from AWS customer support for how to debug this, but it turns out the problem simply disappeared when we switched from using the new VPC setup for this purpose to older one someone else had setup.

Not entirely sure how the VPC and the surrounding settings differ, but at least this solves the issue for us.

3 Likes

Yeah - that’s odd. Thanks for the update!

1 Like