Performance issues under high load

Recently  at work I came across  a familiar situation after migration a high load application from Adobe Cold Fusion Enterprise edition towards  the open source Lucee Cold Fusion Engine


  After the migration we find ourselves having to tear up the book and start again, four or five years of meticulous fine tuning and optimizing of caching, request timeout and queuing settings are thrown out the window and we start from the beginning because our entire infrastructure is effectively replaced. the only constants are our codebase and storage layers

What did we do?

  • Upgrade the Operating System from RHELS 6.8 to RHELS 7.4 (which swaps SystemV for Systemd)
  • Replaced Adobe Cold Fusion 9 with Lucee 5.23
  • Upgraded Apache from V2.2 to 2.4 and subsequently migrated the (very elaborate) configuration
  • Split off our cluster of 12 servers to have 8 frontend servers and 4 servers dedicated to api requests

How does it perform after migration?

In low load scenarios the throughput seems much better and vertical scalability seems easier. The big issue is the horizontal scalability and load spreading :-(
A typical server will have 8GB memory and 2 cpu cores in a shared virtualized environment

During brief periods of overload, all requests to a server were timing out. It is a catastrophic failure, where the server completely stops serving.
We could add more capacity to prevent overload, but we don't want to throw hardware at a problem and also want to fix the server.
Instead of exploding, it should answer as many requests as possible, while rejecting the rest. Catastrophic failures like this usually happen because the server is processing too many requests at the same time. Typically, the server runs out of memory or requests take too long, causing whatever is making the request to give up. The solution is to limit the maximum number of concurrent requests, so some fraction of requests fail, but the others are processed. The challenge is finding the "right" limit. It can vary hugely depending on the software and the hardware. It also changes as the system evolves, so today's correct value is tomorrow's performance bottleneck.

 
The Plan

Limit the number of concurrent requests in all servers to avoid cascading failures. We need the limit to be well below the "catastrophic" failure point for our service, but setting it too low will limit the throughput of our system. It is a very fine balance and  if I'm forced to guess, I start with a limit set to some multiple of the number of CPUs. E.g. maybe 3× the number if you assume that your requests spend 33% of their time executing code, and 66% of their time waiting (e.g. for other network requests or on disk). However, it really is best to determine the limits by forcing your service to explode, because the "right" limit can vary enormously. If you want to optimize things, you can get lower "tail" latency by adding a queue and using a lower concurrent request limit. However, this requires tuning, so I think a simple concurrent limit is good enough in most cases. If we really want the lowest latency, using a FIFO queue with a "drop head" policy should decrease latency in overload scenarios.


Why limit requests?

To start, let's keep it simple, since it helps understand the basics. Let's imagine we have a server with a single CPU, that processes requests by doing 10 ms of computation. Ideally, this server can process 100 requests per second. If requests arrive at a fixed rate less than or equal to 100 requests per second, then the server responds to all requests after 10 ms. However, what happens if requests start arriving at 101 requests per second? Well, after one second, the server could only process 100 requests, so it will be processing 2 requests at the same time. The operating system will attempt to share the CPU, so now each request takes 20 ms. The server still responds to 100 requests per second, but the latency has increased. As the overload continues, the server begins to process more and more concurrent requests, which increases the latency. Eventually there are an infinite number of requests in the server, and it takes an infinite amount of time to respond.


This is why servers without a limit on the number of requests they process concurrently tend to fail catastrophically. The real world doesn't work with infinite latency. It can also cause clients to retry and click links, reload pages over and over and over, which adds even more overload to the system. In the worst cases, the server will hit some resource limit. A very common one is running out of memory, which causes it to get killed and restart. This causes load to shift to other servers, causing them to run out of memory, which takes out the entire cluster that was carefully replicated for "reliability."


The solution seems easy????  Limit the number of requests being processed at the same time, and reject the others. Rejecting a request should be much cheaper than processing a request, so the server should be able to respond to 10-1000× more rejections per second than "real" requests. It is probably still possible to cause a catastrophic failure, but for most applications, the rate of rejections per second it can serve is so far beyond what the system normally processes that it is effectively infinite.


Setting concurrent request limits

Now that we know why we need a concurrent request limit, how do we determine the specific value to use? The CPU-bound server we have been talking about is relatively straightforward: If we limit it to one concurrent request, then each request gets processed in exactly 10 ms, and the server can process 100 requests per second if they arrive perfectly spaced out. However, in reality requests might arrive in bursts. In this case, the server will reject requests that arrive close together, and will be idle while waiting for the next request. However if we allow more concurrent requests, then in "busy" periods, processing time goes up. In the worst case, if we accept too many requests, we run out of memory or cause a catastrophic failure again.


The real world is much more complex. Requests vary enormously in their resource consumption. For example, some might have cached results, others may read data from disk/DB, and some might use multiple CPUs. The "right" number of requests in flight to maximize throughput and minimize latency depends on the mix of requests coming in. The bottlenecks also change along with the code, data and the physical systems running the software.

An "ideal" solution probably requires monitoring critical resources, and admitting requests only when there is excess capacity. However, setting a fixed limit per server is far simpler, and gets most of the benefit since most applications tend to have pretty "regular" workloads. The critical part is that we must pick a limit that is well below the "catastrophic" limit for your server.

The best way of figuring that out is to make it fail in a controlled experiment such as stress tests with JMeter and such but in a live running production system that is not appropriate. It is also possible "guess and check" by observing the system in production, if we are willing to let the server to break a few times when we get it wrong. Otherwise, in terms of getting good performance, it is probably best to set the limit fairly high. Most systems do a reasonable job of sharing resources between concurrent requests, so this will maximize throughput, at the expense of worst case latency. In many systems, rejecting requests is much worse than a bit of extra latency.


Queuing versus concurrent requests

Now that we have limited the number of requests in flight, what should we do when we reach the limit? The simplest answer is to reject the requests, returning some sort of error. However, for  systems like ours where efficiency is important, we can do better. The key observation is that as long as the server can respond to the request in an "acceptable" amount of time, it should process it. The challenge is that the definition of "acceptable" can vary widely. For backend tasks, "acceptable" might be within an hour. For a request a human is waiting on, then a few seconds is probably the absolute maximum. For a request that is part of a much larger distributed system, it might need to respond within 50 ms for the overall task to complete within a reasonable time. Letting requests wait in a queue allows the server to accept bursts of work and keep it busy during future "quiet" periods. This increases utilization and decreases the probability of rejecting a request, at the expense of some increase in latency.


We can achieve a similar effect by just allowing the server to process more requests concurrently. However, this leads to worse latency than a correctly tuned request limit and queue. Consider our single CPU server that processes requests in 10 ms. Imagine a burst of 5 requests arrives at the same time. If we process all 5 at the same time, then the entire burst completes after 50 ms. If instead we accept one request at a time and queue the rest, then the requests finish after 10, 20, 30, 40, and 50 ms. The worst case is the same in both cases, but the distribution is much better with the queue. My conclusion is that an ideal server will have a concurrent request limit that maximizes throughput, then add a small queue to absorb "bursts" of requests.
The most common queue policy seems to be "first in first out" (FIFO), where arriving requests are dropped when the queue is full (sometimes called drop tail). This is probably not the best policy for overloaded servers. Imagine our server is processing a burst of requests, with a request arriving every 5 milliseconds. It has a queue of 3 messages. At steady state, 50% of the requests are dropped immediately, and the others are processed with 40 ms latency (30 ms queue, 10 ms processing). If instead, we drop the oldest request in the queue (the one at the head of the queue instead of the tail), then 50% of requests are drop after waiting 15 ms in the queue, but the remainder are processed after only 25 ms (15 ms queue, 10 ms processing), since the oldest message was dropped. This provides better latency for the successfully processed requests, at the cost of rejecting requests later. In my opinion: this is probably the better policy for most applications. However, implementing this is a bit more complicated, since you need the ability to reject a request after it has been queued Which is internally implemented by the application server.


The table below shows the arrival time, completion time, and latency for the first 11 packets that arrive. The server has a processing time of 10 ms, and permits 3 requests to be queued. Requests arrive every 5 ms. The servers have reached steady-state at the end of this trace, and each request will either be drop or processed with the same latencies shown.



From bland to brand, get your business online
Website hosting and design by Dedicated to Servers, contact info@dedicatedtoservers.com

Click to send a quick inquiry

All rights Reserved © 2014-2024