Connection management and DoS protections in ShimmerCat
Software is fragile, and the Internet is a hostile place where script kiddies, hackers and bots do their best to take a website down, often in very sofisticated manners. While there is no 100% warrantee that all these incidents can be avoided, there are well known ways around the most common pitfalls, some of which we have implemented in ShimmerCat. A few of these measures are a must-know for anybody intending to handle heavy traffic with the accelerator, and others may cause unexpected effects from time to time. Thus, we are documenting them here.
Ensure that the soft-limit of the number of open connections for ShimmerCat processes is at least 8192, but go much higher if possible.
Due to the lack of sane defaults, connection limits is the most important topic in this document. They are part of the limits mechanism in Linux, they are on by default on all modern Linux distributions, and as we will see below, their default settings are inadequate for a modern website.
A bit of history on ShimmerCat connection management
With the default configuration, a Linux process can not open more than 1024 file descriptors. One file descriptor is needed for each network connection, and one for each open file in disk, among other uses.
Before qs-2590, we worked with the default number of 1024 connections,
and we had a mechanism called
Tidal to aggresively reduce used file-descriptors when reaching 250 visitor connections per worker.
Our rationale was that for handling more visitors, one could simply increase the number of workers using
--worker-count option in ShimmerCat.
The first limitation with that approach is that ShimmerCat uses Redis to coordinate tasks and hold locks for static assets, and a greater number of worker processes results in more connections to Redis, so instead of having the ShimmerCat workers per-se run out of file descriptors, Redis is the one running out of file descriptors.
The second limitation is that an HTTP/2 connection is used to fetch a page on a site, each asset of that page will have to be located on the filesystem or fetched via the network. If the assets are short, they are in the local filesystem and all transfers are completing quickly, the file descriptor for the asset will be in use only for a few milliseconds, thereafter being available again for re-use.
So, our initial approach worked more or less well for a few years, until we started working with loaded sites with remote backends. In those, a file descriptor associated to a transfer can be open for much longer, and the limit of 250 visitor connections becomes too high, since each of those connections can have dozens of file descriptors associated to asset transfers. As a result, the number of I/O errors due to file-descriptor exhaustion increases.
Connection handling from qs-2590 onwards
Thus, from qs-2590 connections are handled in a different way. Now, ShimmerCat finds out the soft file descriptor limit, and divides that number into several portions. First, a fixed-size "reserved" portion of size 200 is booked for ancilliary short-lived file operations which are not directly related to server traffic. Of what is left, a very small proportion is allocated for visitor connections, and the rest is put in severate pools for static asset transfers and application backend requests.
The exact amount of connections, together with warning if the numbers are low, are logged early on on the life of a ShimmerCat process:
.info. 07:44:20,2018-10-17 M40023 Descriptor limits derived from rlimit : 12 visitor connections, 687 static files, 187 proxied connections .attent. 07:44:20,2018-10-17 M20003 Know this! Your file descriptor limit is too low, and because of your scrimpiness, kittens will wail and visitors will queue
Furthermore, whenever usage for any of the connection pools reaches their limit, operations with those file descriptors are queued.
Whenever handling real traffic, ShimmerCat employs transient workers, the number of which can
be controlled with the
shimmercat internet --worker-count 5 ...
Each worker can handle the number of simultenous connections posted in the message
M40023 anounces 12 visitor connections and there are five workers, it means a total of 60 open
connections from browsers and other user agents.
The qualifier transient is because by design these workers are short-lived, with optimal lifetimes of between one and two hours. We have made the workers transient for two reasons:
- To make it possible to refresh the configuration of the accelerator as frequently as needed, and
- to decrease any consequences from cumulative degradation ... we have a stringent test suite that help us prevent issues we have experienced in the past, but there it is good to have a fallback line.
Each worker is composed of two actuall processes in a so-called lane, with an active and a waiting process in each lane. Initially, the active member binds the configured server ports using SO_REUSEPORT. At a marked time which is independent for each lane, the active member of the pair enters programmed death (apoptosis), sending the bound listening sockets to the waiting process, which thus transits to active.
To minimize disruptions, the process which is being terminated uses a grace period of a few seconds during
which network connections are progressively culled as they become idle.
The grace period duration is configurable (it's the entry
lazyClientsWait: ... in the
tweaks.yaml file inside the
scratch folder), and should be adjusted if you want to e.g. let very long file transfers go through the server.
After the grace period, the apoptotic process ends and the master replaces it by a new one, which enters the waiting state. From there the cycle repeats.
In-transit TLS handshakes
TLS handshakes are expensive, and a simple way to DDoS a server is to initiate a lot of them together. For that reason, ShimmerCat has used for several years the approach of allowing only a fixed number of TLS handshakes per IP address to be simulateously active, and timing them out if they take more than 3 seconds.
However, the timeout time of 3 seconds is a bit too short for mobile visitors accesing a site via a GPRS connection, e.g. from the Scandinavian Mountains. From qs-2590, we have increased the default timeout to 60 seconds.