Chip Level Multiprocessing
 What is CMT?
The advent of Symmetric Multi-Processing (SMP) added a new degree of scalability to computer systems. Rather than deriving additional performance from an incrementally faster microprocessor, a SMP system leverages multiple processors to obtain large gains in total system performance. Parallelism in software allows multiple jobs to execute concurrently on the system, increasing system throughput accordingly. Given sufficient software parallelism, these systems have proved to scale to several hundred processors.
More recently, a similar phenomenon is occurring at the chip level. Rather than pursue diminishing returns by increasing individual processor performance, manufacturers are producing chips with multiple processor cores on a single die. For example, the AMD Opteron processor now utilizes two entire processor cores per die, providing almost double the performance of a single core chip.
 Chip Level Multiprocessing (CMP)
The AMD Opteron is a good example of a chip level multiprocessor (CMP) design. The chip has two complete processor cores, sharing a bus to memory. As shown in green, is almost identical to it's single core predecessor; the second core is a complete duplication of the first, including it's pipeline and caches. From a performance perspective, the chip behaves much like a dual processor SMP machine, albeit with some potential contention for memory bandwidth though the shared path to memory. From a software perspective, the chip looks almost indistinguishable from a dual processor system. Software threads are scheduled onto the processor cores by the operating system – at least two threads are required to keep both cores busy.
 Simultaneous Multi-Threading
Additionally, as [Kunle] points out, processor designers have found that since most microprocessors spend a significant amount of time idly waiting for memory, software parallelism can be leveraged to hide memory latency. Since memory stalls typically takes in the order of 100 processor cycles, a processor pipeline is idle for a significant amount of time.
By providing additional sets of registers per processor processor pipeline, multiple software jobs can be multiplexed onto the pipeline – a technique known as simultaneous multi threading (SMT). Threads are switched on to the pipeline when another blocks or waits on memory, thus allowing the pipeline to be utilized potentially to it's maximum.
SMT has a high return on performance in relation to additional transistor count. For example, a 50% performance gain may be realized by adding just 10% more transistors with a SMT approach, in contrast to CMP where adding a processor core can almost double the number of transistors.
On a multi-threaded microprocessor, each hardware thread appears to the operating system as an individual processor. Software threads are scheduled by the operating system onto individual hardware threads, and hardware threads are scheduled onto the pipeline by a hardware scheduler. The number of software threads required to keep the core busy varies from 1 to many, depending on the ratio of memory stalls to compute events. Hardware designers typically pick an appropriate number of hardware threads per processor pipeline, according to the expected workload characteristics.
 Chip Level Multithreading (CMT)
A CMT is a combination of CMP and SMT designs; there are multiple cores per chip multiple threads per core. A combination of SMP and SMT is optimal, since SMT can be used to maximize the performance of each core, and through CMP additional cores can fit within one chip.
The Sun Niagara processor is an example of a CMT approach. It uses 8 cores, each with 4 hardware threads.
Each core has four threads sharing a pipeline, a level 1 instruction cache, data cache and a memory management unit (MMU). Since level 1 caches are typically very small and in the order of ten times faster to access than level 2 there are often benefits to co-locating similar software threads within a core so that their code can be shared in the faster caches. Smart operating system algorithms and potentially some knowledge of the application may be required to make optimizations such as these.
These new Chip Level Mulitprocessors (CMP's) are bringing what was once a large multiprocessor system down to the chip level. A low end 4-chip dual core Opteron machine presents itself to software as an eight processor system, and in the case of the Sun Niagara processor with 8 cores and 4 threads per core, a single chip presents itself to software as a 32-processor system. As a result, the ability of system and application software to exploit multiple processors or threads simultaneously is becoming more important than ever. As CMP hardware progresses, software is required to scale accordingly to fully exploit the parallelism of the chip.
Thus, bringing this degree of parallelism down to the chip level represents a significant change to the way we think about scaling. Since the cost of a CMP system is close to that of recent low-end uni-processor systems, it's inevitable that even the cheapest desktops and servers will be highly threaded. Techniques used to scale application and system software on large enterprise level SMP systems will now frequently be leveraged to provide scalability even for single chip systems. We need to consider the change in the degree of scaling at the low end on the way we architect applications, on which operating system we choose, and the techniques we use to deploy applications -- even at the low end.
 Is CMT Just a cost effective SMP?
A very simplistic view of a CMP system is that it appears to software as a SMP system with the number of processors equal to the number of threads in the chip, each with slightly reduced processing capability. Since each hardware thread is sharing the resources of a single processor core, each thread has some fraction of the core's overall performance. Thus, an eight-core chip with thirty-two hardware threads running at 1Ghz may be somewhat crudely approximated as a SMP system with 32 x 250Mhz processors. The affect on software is often a subtle trade-off in per-thread latency for a significant increase of throughput. For a throughput-oriented workload with many concurrent requests (such as a web server), the marginal increase in response time is virtually negligible, but the increase in system throughput is an order of magnitude over a non-CMP processor of the same clock speed.
There are however more subtle differences between a CMP system and an SMP system. If threads or cores within a CMP processor share important resources, then some threads may impact the performance of other threads. For example, when multiple threads share a single core and therefore share first level memory caches, then the performance of a given thread may vary depending on what the other threads, of the same core, are doing with the first thread's data in the cache. Yet, in another similar case, a thread may actually gain if the other threads are constructively sharing the cache; since useful data may be brought into the cache by threads other than the first. We cover this in more detail as we explore some of the potential operating system optimizations.
 How should I Scale my Software?
The performance of system software ideally scales proportionally with the number of processors in the system. There are however factors that limit the speedup.
Amdahl's Law defines scalability that the speedup of a parallel algorithm is effectively limited by the number of operations which must be performed sequentially, i.e its serial fraction. If 10% of a parallel program involves serial code, the maximum speedup that can be attained is 3 using 4 processors (75% of linear), reducing to only 4.75 when the processor count increases to 8 (only 59% of linear). Amdahl's Law tells us that the serial fraction places a severe constraint on the speedup as the number of processors increase.
In addition, software typically incurs overheads due to communication and distribution of work to multiple processors resulting in a scaling curve where the performance peaks and then begins to degrade. Since most operating systems and applications contain a certain amount of sequential code, a possible conclusion of Amdahl's Law is that it is not cost effective to build systems with large numbers of processors because sufficient speedup will never be produced. However, over the last decade, there has been rigorous focus at reducing the serial fraction within systems hardware architectures, operating systems, middle ware and application software. Today, evident examples show that it is possible to scale system software and applications in the order of 100 processors on a SMP system. The figure below shows the results for a series of scaling benchmarks that were performed using database workloads on a large SMP configuration. These application benchmarks were performed on a single system image, by measuring throughput as the number of processors was increased.
 Intra-machine or Inter-machine Scale?
Software scalability for these large SMP machines has historically been obtained through rigorous focus on intra-machine scalability within one large instance of the application within a single operating system. A good example is a one tier enterprise application such as SAP. The original version of SAP utilized a single and large monolithic application server. The application instance obtains its parallelism from the many concurrent requests from users. Providing there are no major serialization points between the users, the application will naturally scale. The focus on scaling these applications has been to remove these serialization points within the applications.
More recently due to the economics of low end systems, the focus has been on leveraging inter-machine scaling, to leverage low cost commodity 1-2 processor servers. Some applications can be made to scale without requiring large expensive SMP systems by running multiple instances in parallel on seperate 1-2 processor systems, resulting in good overall throughput. Applications can be designed to scale this way by moving all shared state to a shared backend service, like a database. Typically, many 1-2 processor systems are configured as mid tier application servers, communicating to a intra-machine scaled database system. The shift to focus on 1-2 processor hardware has however removed a great deal of the pressure to design intra-machine scalability into the software.
The compelling features of CMP; low power, extreme density and high throughput match this space well, mandating a revised focus on intra-machine scalability.
 What is the Impact of CMP on Application Developers?
The most significant impact for application developers is the requirement to scale – the minimum scaling requirement has been raised from 1-4 processors to 32 today, and will likely increase again in the near future.
 How do we build Scalable Applications?
Engineering scalable code is challenging, but the performance wins are huge. The data in the scaling curves for Oracle and DB2 in the figure above show the rewards from a great deal of performance tuning and optimization for scaling. As per Amdahl's law, scaling software requires minimization of the serial fraction of the workload. In many commercial systems, natural parallelism comes from the many concurrent users of the system.
The simple first order scaling bottlenecks (those with a large serial fraction) typically come from contention for shared resources, such as those listed below:
Networks or interconnects: Bandwidth limitations on interconnects between portions of the system – for example, an ingress network on the web servers, tier 1-2 networks for SQL traffic, or a storage network (SAN). CPU/Memory: Queing for CPU or waiting for page faults as a result of resource starvation. I/O Throughput: Insufficient capacity for disk I/O operations or bandwidth.
The more interesting problems result from intrinsic application design problems, which manifest from serial operations within the application or the operating environment. These are often much harder to identify without good observability tools, because rather than showing up as an easy to detect overloaded resource (such as: out of CPU), they often exhibit growing amounts of idle resource as load is increased.
Here's a common example. We were recently asked to help with a scaling problem observed with a large online ecommerce system. The application consisted of thousands of users committing payment transactions from a web application, and as load increased the latency became unacceptable. The application was running on a large SMP system and database, both which were known to scale well. The difficult part was that it there was no clear indicator of where in the system the problem lied – as load was increased, the system CPU resources became more idle. It turned out that there was a single table at the center of all the updates, and the locking strategy for the table became the significant serial fraction of the workload, and user transactions were simply blocking waiting for updates to the table. The solution was to break up the table such that concurrent inserts could occur, thus reducing the serial fraction and increasing scalability.
For CMP, we need to pay attention to those that might limit scaling within one application instance, since we now need to scale in the order of tens of threads, going to the order of one hundred in the near future.
 Writing Scalable Low Level Code
Many middle ware applications (such as databases, application servers or transaction systems) require special attention in order to scale. Here are a few of the common techniques that may serve as a general guideline.
- Scalable Algorithms: Many algorithms become less efficient as the size of the problem set increases. For example, an algorithm that searches for an object using a linear list will increase the amount of CPU required as the size of the list increases, potentially at super linear rate. Picking good algorithms that optimize for the common case is of key importance.
- Locking: Locking strategies have significant impact on scalability. As concurrency increases, the number of threads attempting to lock an object or region increases, resulting in compounding contention as the lock becomes “hotter”. In modern systems, an optimal approach is to provide fine-grained locking using a lock per object, where possible. There are also several approaches to making the reader side of code lock-free at the expense of some memory wasteage or increased writer-side cost.
- Cache Line Sharing: Multiprocessor and CMP systems use hardware coherency algorithms to keep data consistent between different pipelines, which can have a significant affect on scaling. For example, a latency penalty may result if one processor updates a memory object within its cache, which is also accessed from another processor. The cache location will be invalidated due to the cache conherency hardware protocol, which ensures only one version of the data exists. In a CMP system, multiple threads typically access a single first level cache, thus co-locating data, which will be accessed within a single core, may be appropriate.
- Pools of Worker Threads: A good approach for concurrency is to use a pool of worker threads; a general purpose, multithreaded engine which can be used to processes an aggregate set of work events. Using this model, an application gives discrete units of work to the engine, and lets the engine processes them in parallel. The worker pool provides a flexible mechanism to balance the work events across multiple processors or hardware threads. The operating system can automatically tune of the concurrency of the application to meet the topology of the underlying hardware architecture.
- Memory Allocators: Memory allocators pose a significant problem to scaling. Almost every code needs to allocate and free data structures, and typically do so via a central system provided memory allocator. Unfortunately, very few memory allocators scale well. Some of the few that do include Hoard, Solaris 10 “libumem” and “slab”and SmartHeap. It's worth paying attention to more than just one dimension of scalability: different allocators have different properties in light of the nature of allocation/deallocation requests.
 Conduct Scalability Experiments Early and Often
Time has shown that the most efficient way of driving out scaling issues from an application is to perform scaling studies. Given the infinite space in which optimizations can be made, it is important to follow a methodology to prioritize the most important issues.
Modeling techniques can be used to mathematically predict response times and potential scaling bottlenecks from complex systems. They are often used for predicting the performance of hardware, to assist with design trade off analysis. Modeling software however requires intimate knowledge of the software algorithms, code paths and system service latencies. The time taken to construct a model and validate all assumptions is often at odds with running scaling tests.
A well designed set of scaling experiments is key to understanding the performance characteristics of an application, and with proper observability instrumentation, it makes it easy to pinpoint where key issues lie. Scalability prediction and analysis should be done as early as possible in the development cycle. It's often much harder to retrofit scalability improvements to an existing architecture; consider scalability part of the application architecture and design.
Key items to include in scalability experiments are:
- Throughput versus # threads/processors: Does the throughput scale close to linearly as the amount of resource applied increases?
- Throughput versus resource consumed: i.e. CPU, Network I/O, and Disk I/O per transaction. Does the amount of resource consumed per unit of work increase as scale increases.
- Latency versus Throughput: Does the latency of a transaction increase as the throughput of a system increases. A system, which provides linear throughput scalability, might not be useful in the real world if the transaction response times are too long.
- Statistics: Measure code path length in both number of instructions and cycles
 Observablity is the Primary Means to Scalable Software
Effective tools are the most signficant factor in improving application scalability. Being able to identify a root cause of a scaling issue quickly is paramount. The objective of looking for scaling issues is to easily pinpoint the most significant sources of serialization.
We want the tools to help us identify what type of issue is causing the serialization – the two classic cases being starvation due to escalating resource requirements as load increases or increasing idle time as load increases. Ideally, the tools should help identify the source of scaling issue rather than merely pointing to object of contention. This not only helps with identifying what the contention point is, but perhaps some offending code which may be over-utilizing a resource. Often once the source is identified, many obvious optimizations become apparent.
Consider tools which can provide the capability to:
- Locate key sources of wait-time: Tools which can identify contended resources, attribute who is causing the resource utilization and identify how much affect the contention is having on overall performance.
- Identify hot synchronization locks: How much wall clock and CPU time is serialized in locking objects, and which code is responsible.
- Identify non-scalable algorithms: which functions or classes become more expensive as the scale of the application increases.
- Make it clear where the problem lies: either in the application code which you can affect, or point to a contention point in a vendor-supplied middle ware or operating system. Even though the contention point may lie in a vendor code, it may result from how that code is being called which can be affected by optimizing the higher-level code.