At ADITION technologies, we develop an adserver that processes more than one billion requests per day. In order to use our data center resources most efficiently, we want to scale up the number of requests we can handle per node before scaling out the number of nodes in the cluster. In this article, we will give an overview how better scalability was achieved by improving the concurrency inside the adserver. This allowed us to reduce the number of servers by a factor of six and significantly cut the cluster’s TCO.
Ad-request data flow
Online marketing is undergoing drastic changes. Historically, websites would trigger a request from the user’s browser for an ad and the adserver responds directly with a banner. For each delivery, a fixed amount of money is paid.

Nowadays, more and more ad traffic is auctioned in real-time to dynamically find an optimal price. The user’s browser sends a request to an auctioning platform and multiple adservers can place a bid for the impression. The adserver delivers the ad only if its bid wins the auction.

The big challenge is to keep the response times low because there are hard time limits for the auction. Thus, parallelizing the adserver must not increase the latency while allowing to handle more requests per second.
Back-end architecture
Our backend infrastructure consists of heterogeneous services that run on ~350 servers. The entrypoint for the user’s browser is the loadbalancer that distributes the traffic over the adservers. In total, our system handles 250-300k requests/sec.

Customers book their campaigns and banners through our API, either directly using SOAP or using our frontend web interface. These business objects are stored in a relational database. Upon new or changed business objects, the adservers are notified to reload the specific object. A fast Aerospike key-value store is used to store user data and profile information.
While delivering ads, we store roughly 2 TB of logging data per day. The data gets forwarded into our BI pipeline, where it gets processed so that customers can later on access their performance data for billing, profile generation and data analysis.
Ad selection
The adserver has a large code base. It was not feasible to develop a highly parallel replacement from scratch. Instead, parallelizing the existing adserver code was an iterative process over a period of 1.5 years. In the beginning, a deep understanding of the internal control and data flow was necessary
The essential task of our adserver is to find the best matching ad-banner for a request. This works as follows: (1) A website that wants to monetize its content contains placeholders for ads, which we call content units. When loading the website, the content unit fires a request to our adserver. (2) The ads are organized in order, campaign and banner objects, which allow to assign properties and restrictions for the delivery. For example, you can restrict the delivery of an ad to only female users with estimated age between 25 and 45. (3) The request is enriched with user data to give the ad selection a context. (4) The filters are evaluated to select a fitting banner, the banner is rendered and sent back.

Most of the complexity and execution time is consumed by the filter evaluation. The filters can be as simple as restricting the delivery to a specific set of content units or ensuring that the ad fits in terms of image size. But they can be also more advanced, such as restricting the frequency of the delivery for an individual user, or to check properties of the user’s profile information. Many filters can be combined to create a powerful rule-set to control and steer the delivery of ads. Before the parallelization efforts described in this article, the filter evaluation was a bottleneck and led to full load for one CPU core, whilst others were almost idling. This prevented us to fully utilize modern CPUs, which have an ever increasing number of cores.
The filter evaluation uses customer business objects as input state. These include banners, campaigns and filters. There are constant updates made by customers to these objects to adjust the delivery at run-time, which concurs with the ad-selection evaluating the same filters. Thus, for a parallel ad selection, the access to the business objects must be synchronized. At the beginning, the synchronization was based on a global lock and there was no locking scheme that assigned the lock to individual variables.
Thread management and profiling
For the implementation and compilation of the adserver code, we are able to use the latest GCC and CLANG compiler versions and can benefit from the great improvements of C++11 and C++17. We tried to use the standard library as much as possible for maintenance reasons. For specific concurrency problems we found solid solutions in the Folly open-source C++ library developed and used at Facebook.
Thread Management
We developed our thread management based on std::thread and maintain a global registry where all threads are registered. This is used for statistics output and to enable per-thread profiling using the Linux perf_event interface, e.g. to measure the IPC and see if the thread is doing work when scheduled on the CPU. The intention of our custom threading objects is to maintain an overview of how many threads are necessary to handle the workload. Before unifying this, a mixture of pthread, boost::thread_group and std::async was in use. Unifying the threads was of great help to maintain this overview.

One class that we use quite extensively is the ThreadPool class, which is based on a ThreadGroup and the folly::MPMCQueue. This queue shows a very good performance and has the benefit of integrated thread de-scheduling using a futex if the queue is empty. Under normal load, the current queue size should be zero because there are sufficient threads to process the workload. If the queue size is constantly larger than zero or growing, this indicated too much backpressure and a larger thread pool is necessary. In case the queue runs full, the adserver would notify the load balancer and turn into a throttled state, e.g. it would start to discard requests.
We also apply folly::setThreadName to set the thread’s name that can be shown in htop for a quick overview of the threads in the adserver process.
Thread profiling
We investigate the on-CPU time using FlameGraphs. This is especially interesting if the IPC is larger than 1 and we are not (memory) stalled. To create a FlameGraph, stack frames must be collected. We do this by statically instrumenting the code at each function entry and a sample-based recording. We can enable all compiler optimizations and get full information about the function signature. This mixture of static and statistic profiling allows us to collect realistic profile data from production traffic, while not decreasing the latency of our services too much, which would cause a loss of money.

A flame graph is created by overlaying the sampled stack traces. Each bar corresponds to a function call. Longer bars represent more samples inside this function, in other words more CPU time. The height of the flames encodes the depth of the stack traces. We observed in the graph, that the evaluateFilter function and its call chain is responsible for most of the load and worth optimizing.
Prevailing synchronization
After introducing parallelism based on our custom thread management, we shifted our focus towards synchronization to allow concurrent access of the threads to the data. In the case of lock-based synchronization, it is important to assign locks to data structures. Relying on (undocumented) locking schemes as in the code below is error-prone. Probably every programmer has seen the mystical variable lock_ in large classes and wondered which members it protects:
class RequestHandler {
RequestQueue requestQueue_;
SharedMutex lock_;
void process(const Request& request) {
SharedMutex::WriteHolder lock(lock_);
requestQueue_.push_back(request);
}
};
We found folly::Synchronized<ProtectMe> very handy to enforce synchronized access to the class ProtectMe. It acts as a wrapper around an object and reveals the pointer only after acquiring the corresponding lock, below using wlock() for single operations like pushing an element or withRLock() for compound operations like iterating over all elements. If we try to do the operations directly on the protected class, the compiler will produce an error that will be as easy to fix as winking, compared to investigating and resolving a race condition.
class RequestHandler {
folly::Synchronized<RequestQueue> requestQueue_;
void process(const Request& request) {
requestQueue_.wlock()->push_back(request);
}
void load() {
vec.withRLock([](const auto& rlockedRequestQueue) {
for (const auto& request : rlockedRequestQueue) {
request.dummy();
}
});
}
};
Synchronization strategies
The design space for synchronizing the access to critical sections is very large. In order to make the right choice in picking a strategy, it is important to know what is happening inside the critical section. Without any claim for completeness, we will look into different properties of critical sections and show with which synchronization strategy we had good experiences.

Locking strategy
Non-blocking synchronization is based on read-modify-write instructions, e.g. the compare-and-swap or fetch-and-increment operation. The individual steps of the operation are executed atomically, i.e., they appear as a single indivisible step. Threads never have to wait on a lock and always continue processing instructions. A basic abstraction is provided by std::atomic, which has a generic interface for many CPU platforms but is still hard to get right. Still, atomic operations can be seen as the smallest critical section possible. We use atomic variables for shared counters, flags and to replace some shared_ptr objects.
More convenient for programmers is to use non-blocking open-source data structures that are maintained by experts. An example is the folly::AtomicHashMap. The benefit of such data structures is that they are highly concurrent and robust against failed operations of some threads.
Blocking synchronization is quite the opposite because threads must wait for on another upon contention for a critical section. Therefore, locks are used and implemented by either a futex for thread-preemption with longer wait periods or by a spinlock with a busyloop with short waits. Locks are easier to read in the code but hard to debug at run-time, e.g. when having a deadlock because of different lock acquisition orders. We use locks in our code to protect critical sections which access and modify small portions of large data structures.
Access pattern
Accesses to critical sections can be optimized based on their pattern, i.e., the ratio of read vs. write accesses. If the update rate is small, we use reader-writer locks which allow parallel reader accesses but only a single exclusive writer at a time. Examples are std::shared_mutex or folly::SharedMutex. Some locks allow to upgrade readers with write permission so that the write can be done at the end of the critical section to keep the time in exclusive mode short.
For workloads that have rare updates but the updates change a large portion of a data structure, we use read-copy-update. The writer creates a new version of the data structure by applying its changes and then replaces the old version atomically. The readers can continuously access the data and the old version exists until the last reader decrements the ref count of the shared_ptr. All synchronization is done on the level of the atomic shared_ptr. The cost of copying the entire data structure for updates is quite high, so it pays off only, if the writer keeps the lock for a long time using a mutex.

Granularity
Global locks can be a bottleneck if the critical section is large and threads have to wait. It is an iterative process to decrease the size of a critical section. The locks must be pushed down the call-chain to the data they protect so that a lock is ideally responsible only for one variable. We try to use existing data structures with fine-granular locking schemes if possible, e.g., folly::ConcurrentSkipList or folly::ConcurrentHashMap. As an alternative, one can partition the key-space into separate containers and use distinct locks for each container.
Progress and fairness
The two biggest risks regarding progress are deadlocks and starvation. Deadlocks can happen if data has to be moved between two partitions or data structures and two threads try to acquire their locks in the opposite order. Deadlocks are hard to debug and can be prevented by using std::lock, which always acquires the locks in the same order.
Starvation can happen with reader-writer locks and high contention: As readers can enter the critical section concurrently, writers can be locked out as long as there are consecutive readers present. In our experience, acquiring the write lock must have precedence over readers because it is a rare and short event.
Fairness can be achieved by a fair lock acquisition order. This is a typical property of FIFO queue locks. In practice it is difficult to reason about fairness and progress guarantees over a large code base. While the low-level data structures can be wait-free, there can still be loops or composability problems on higher levels of abstraction that hinder the progress.
Performance improvements
After many iterations of refactoring and applying the above mentioned techniques we were obviously interested in the performance gains. We ran the adserver on old G5 servers with 8 cores and newer G8 servers with 10 cores. This ad server farm had 51 G5 servers in total. The old version of the adserver could only run single threaded due to its global lock, with the ad finder thread producing a load of 55%. The new parallel version with a thread-pool of 4 ad finder threads reduced the load on one core to 15-29%. On the G8 servers we increased the traffic by sending more request to the server. With a traffic weight of 5 or 9% of all requests, the load per core is 24-40%. With 8 ad finder threads and 16% of all requests (weight is 10) the load per core was 20-57%, which is still acceptable. We must maintain a headroom for peak loads and deployments of roughly 50%. This farm was replaced by 8 G8 servers, freeing up significant rack space.
server | threading | cores | traffic weight | total CPU load | ad finder CPU load |
G5 | single | 8 | 1 (2 %) | 17 % | 55% (1 Thread) |
G5 | multi | 8 | 1 (2 %) | 16 % | 15—29% (4 Threads) |
G8 | multi | 20 | 5 (9 %) | 12 % | 24—40% (4 Threads) |
G8 | multi | 20 | 10 (16 %) | 24 % | 20—57 % (8 Threads) |
The graphs show the results for the single threaded adserver on a G5 server (green) and the parallel adserver on a G8 server (yellow). Over time, the traffic weight was increased from 1 to 5 and 10, as in the table. The campaign find time, i.e., the time to search an ad, remained almost constant at around 1.5ms and was slightly faster on G8 servers due to larger caches, CPU and bus speeds. The campaign find count increased with each step of weight accordingly up to 1500 requests/sec.
