ELB_5XX 504 Errors with Scala Collector


#1

Howdy folks. I’ve been attempting to troubleshoot intermittent (less than one percent of traffic) ELB_5XX errors on our ELB that sits in front of our scala collector ASG. Presently, we have two collectors (c4.large), one each in an AZ, and the ELB performs cross-AZ load balancing. The collector consumes events produced by the javascript tracker. We are not processing (what I would consider) large volumes of requests. The maximum requests per hour we’ve seen is around 6000, and typically hovers around 2500 requests per hour for our busy periods. Regardless of traffic levels, we tend to observe one ELB_5XX per hour (averaged; there isn’t a regular point in time where the error occurs).

We are noticing on our backend that we are occasionally missing certain snowplow data that our own tracking says should be present. I suspect that the ELB_5XX errors are the likely source of this missing data for us. As an example, our own conversion data might indicate that a given interaction was associated with some domain_sessionid or userid value, and yet atomic.events is either empty for those IDs, or at least missing our custom snowplow conversion event (typically the last event to be fired in the funnel).

After enabling ELB access logs to further investigate, every 5XX is actually a 504, and the client interaction typically looks like a couple requests with a 504 littered in somewhere:

2017-03-28T15:30:46.477991Z snowplow-collector-elb 24.93.XXX.XXX:53085 172.31.XXX.XXX:8080 0.000045 0.001724 0.000039 200 200 0 0 "OPTIONS http://snowplow-collector-elb-XXX.us-west-2.elb.amazonaws.com:80/com.snowplowanalytics.snowplow/tp2 HTTP/1.1" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB7.3; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)" - -
2017-03-28T15:31:00.717323Z snowplow-collector-elb 24.93.XXX.XXX:53085 - -1 -1 -1 504 0 0 0 "POST http://snowplow-collector-elb-XXX.us-west-2.elb.amazonaws.com:80/com.snowplowanalytics.snowplow/tp2 HTTP/1.1" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB7.3; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" - -

Sometimes, there will be a few successful POSTs, but one fails. Sometimes it’s the last POST from the IP that fails. Sometimes, like above, it’s only one POST, and it fails.

Correlating the ELB_5XX metric and the ELB latency metric, each 504 corresponds with a spike in the MAX latency that meets the ELB’s timeout value (currently 45 seconds), which typically (to me) would indicate a problem with the backend application.

===========================================================

Amazon’s own documentation mentions two possibilities for these 504s:

(https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/ts-elb-error-message.html#ts-elb-errorcodes-http504)

  • Cause 1: The application takes longer to respond than the configured idle timeout.
  • Cause 2: Registered instances closing the connection to Elastic Load Balancing.

I have spent a fair amount of time attempting to find any explanation for these 504s, and the paired latency spikes with the ELB_5XX errors led me to believe that Cause 1 must be the issue.

I first…

  • correlated ELB, Kinesis, and EC2 logs in CW under the hypothesis that perhaps we’re exceeding throughput on the kinesis stream and the collector is stalling until it can write to the stream again. Dynamic scaling of the shards is something I’ve put off given our low volume. No correlations though.
  • correlated the 504s with the stdout and stderr of the collector (currently with log level at INFO). No indication of errors.
  • correlated the 504s with the instance itself. Maybe the app is getting bounced? some other resource constraint on the instance? Negative.

I also looked into Cause 2, and the default timeouts for spray-can and the ELB were initially the same for our deployment (60 seconds). AWS documentation indicated that the ELB should be responsible for closing connections to the backend, and so I set the ELB timeout to 45 seconds. Maybe we were hitting edge cases where the collector app tore down the connection while the ELB thought it was still valid? Granted, that wouldn’t be paired with a latency spike (it’d be an immediate rejection from the backend that got sent to the client, I’d think), but still… Alas, no. Still receiving 504s.

Then I decided to dive deeper, and enabled flow logs on the relevant ENIs for the collectors and the ELB, and performed a full tcpdump on each of the collector instance’s eth0 interfaces (minus SSH traffic). The flow logs proved not to be particularly helpful, since it seems that the ENIs associated with the load balancer in each AZ are for its outside interfaces (where the ELB access logs already provide more information; and as an aside, the flow log rows seem too broad to be helpful troubleshooting this particular issue since they combine and summarize IP traffic into capture windows).

===========================================================

However, the tcpdumps seem the most helpful. They are showing me… no POST attempt at all. I’ll filter on an X-Forwarded-For for the problem IP in the ELB access logs, and I’ll see the same traffic on the collector, minus the POST that the ELB says timed out. It’s as if the POST never makes it to the collector. I’ll look at the associated tcp stream and not see any other issues with it. Sometimes other POSTs (for other client IPs) will come in afterwards too on that same stream, before the LB eventually FINs the stream from inactivity. Checking the other collector instance doesn’t show any X-Forwarded-For traffic for the problem IP either, so it’s not as if it just got routed to the other collector. Also, checking all traffic on the other instance from the timeframe the successful tracker-collector interactions were happening doesn’t show anything out of order either. Everything appears nominal.

At this point, I checked for dropped packets on the collector instances’ interfaces. Nothing. tcpdump dropping packets from capture? Ugh, yes, over the past five days or so a few were dropped. Could I be so unlucky as to have every last one of those dropped packets be the “missing” POSTs and not have any other helpful logs from previous correlation efforts? Doubtful. Maybe MTU fragmentation was to blame? I enabled ICMP traffic at least, thinking that perhaps the client was attempting to renegotiate MTU and couldn’t due to the SGs on the ELB and collector not permitting ICMP traffic. No change.

===========================================================

At this point, I’m kind of at a loss as to what might be causing these ELB 504s. My best guess is some sort of network wonkiness between the ELBs’ inside interfaces and the collector interfaces. Has anyone else run into this issue before? Is there something I missed in my troubleshooting? Do I need to pay AWS for the Developer-level support plan for a month so that they can take a look at things? If it turns out I’ve missed something, that’s one thing. If it turns out I have to pay AWS to tell them about something that turns out to be on their end… well, that’s a more sour pill.


#2

Hi @aspensmonster - this is an incredible diagnostic, huge thanks for setting it out so clearly and in so much detail.

As you say, I think you have isolated this to a networking issue between the ELB and the collectors. There is a lot of magic in the AWS ELBs - AWS support have a lot more access to the inner workings of those ELBs than is publically viewable. So yes I think in this case you need to upgrade your support level, take this excellent diagnostic to them and ask them what is happening…


#3

After working with AWS support, I’ve learned more about the tools I’m using at least :slight_smile: Turns out the problem existed between my chair and keyboard in this case.

I was utilizing Wireshark to analyze the captured packets from the collector instances, and utilizing a display filter:

http.x_forwarded_for == 127.0.0.20

for a given client IP address that received a 504. When going through the results, I’d always be short one POST, leading to my conclusion that the ELB was dropping events.

“Roxana C” at AWS looked at the same pcap capture file and figured out what I was doing wrong. There is a setting called

“Allow subdissector to reassemble TCP streams”

for the TCP protocol preferences. This was enabled on my wireshark install, and since the problematic POST didn’t have any associated data to accompany it (no POST body), the subdissector presumably couldn’t reassemble the event, leaving it out entirely from the displayed packets.

When disabling this setting, I was then able to see that the POSTs that 504’d each only had the HTTP POST headers, but no corresponding body.

No.   Time                       Source       Destination  Protocol Length Info 
20035 2017-04-03 17:47:50.902227 172.31.34.53 172.31.29.39 HTTP 602 POST /com.snowplowanalytics.snowplow/tp2 HTTP/1.1 
20036 2017-04-03 17:47:50.902242 172.31.29.39 172.31.34.53 TCP 66 8080 → 36122 [ACK] Seq=1 Ack=537 Win=28032 Len=0 TSval=934513497 TSecr=2977449947 
20351 2017-04-03 17:48:35.249291 172.31.34.53 172.31.29.39 TCP 66 36122 → 8080 [FIN, ACK] Seq=537 Ack=1 Win=18176 Len=0 TSval=2977461034 TSecr=934513497 
20353 2017-04-03 17:48:35.249414 172.31.29.39 172.31.34.53 TCP 66 8080 → 36122 [FIN, ACK] Seq=1 Ack=538 Win=28032 Len=0 TSval=934557844 TSecr=2977461034 
20357 2017-04-03 17:48:35.250199 172.31.34.53 172.31.29.39 TCP 66 36122 → 8080 [ACK] Seq=538 Ack=2 Win=18176 Len=0 TSval=2977461034 TSecr=934557844

Above, we received POST headers, but no corresponding body. 45 seconds later, the ELB terminated the connection and reported a 504 as expected.

Basically, it looks like in these cases, the javascript tracker attempted to POST event payloads, but was only able to send the headers and not the subsequent body. I’m guessing that these are just edge cases where tabs or browsers get closed as payloads are being POSTed to the collector. As an example, our snowplow conversion event doesn’t fire until our own backend submission is completed, so perhaps the user is leaving the page right after the submission completes and just as the snowplow conversion event begins to fire? That explanation would at least jive with the fact that most of our 504s are on the last event seen in the ELB access logs for the given client IP address (i.e., no further traffic was received from the IP after the 504).

====================================================================

In these cases --where the collector receives a POST header but no corresponding body-- should the “proper” outcome be an ELB 504 returned to the client, with the ELB tearing down the connection to the collector backend after hitting the idle timeout waiting for the collector to respond? It feels like the wrong response here, since it’s not really the server timing out after receiving a request, but rather the client never sending a full request to act upon in the first place. Having the collector report a 408 back to the client makes more sense (to me at least). Is there a spray-can configuration that can send back 408s (or some other code) when an HTTP request is not fully received after N seconds (where N is presumably less than the ELB idle timeout)? Something like nginx’s client_body_timeout configuration parameter?

More generally, what approaches are there for dealing with the case where the javascript tracker might not send off all events?


#4

Any update on the issue. I am continously facing this issue on my load balancer.