Everyone who uses a computer has an opinion about its performance. Unfortunately, those opinions are often based on oversimplified ideas about the dynamics of program execution. Uninformed intuition can lead to expensive wrong guesses about the capacity of a system and the solutions to the perceived performance problems.
This chapter describes the dynamics of program execution and provides a conceptual framework for evaluating system performance. It contains the following major sections:
Using words like "speed" and "fast" to describe contemporary computers, while condoned by precedent, is extreme oversimplification. There was a time when one could read a program, calculate the sum of the instruction times, and confidently predict how long it would take the computer to run that program. Thousands of programmers and engineers have spent the last 30 years making such straightforward calculations impossible, or at least meaningless.
Today's computers are more powerful than their ancestors, not just because they use integrated circuits instead of vacuum tubes and have far shorter cycle times, but because of innumerable hardware and software architectural inventions. Each advance in integrated-circuit density brings an advance in computer performance, not just because it allows the same logic to work in a smaller space with a faster system clock, but because it gives engineers more space in which to implement clever ideas. In short, computers have gained capacity by becoming more complex as well as quicker.
The complexity of modern computers and their operating systems is matched by the complexity of the environment in which they operate. In addition to the execution of individual programs, today's computer has to deal with varying numbers of unpredictably timed interrupts from I/O and communications devices. To the extent that the engineers' clever ideas were based on an assumption of a single program running in a standalone machine, they may be partly defeated by the randomness of the real world. To the extent that those ideas were intended to deal with randomness, they may win back some of the loss. The wins and losses change from program to program and from moment to moment.
The net of all these hardware and software wins and losses is the performance of the system. The "speed" of the system is the rate at which it can handle a specific sequence of demands. If the demands mesh well with the system's hardware and software architectures, we can say, "The system runs this workload fast." We can't say, "The system is fast"--or at least we shouldn't.
As you can see, an accurate and complete definition of the system's workload is critical to predicting or understanding its performance. A difference in workload can cause far more variation in the measured performance of a system than differences in CPU clock speed or RAM size. The workload definition must include not only the type and rate of requests to the system but also the exact software packages and in-house application programs to be executed.
Whenever possible, current users of existing applications should be observed to get authentic, real-world measurements of the rates at which users interact with their workstations or terminals.
Make sure that you include the work that your system is doing "under the covers." For example, if your system contains file systems that are NFS-mounted and frequently accessed by other systems, handling those accesses is probably a significant fraction of the overall workload, even though your system is not officially a "server."
A benchmark is a workload that has been standardized to allow comparisons among dissimilar systems. Any benchmark that has been in existence long enough to become "industry-standard" has been studied exhaustively by systems developers. Operating systems, compilers, and in some cases hardware, have been tuned to run the benchmark with lightning speed.
Unfortunately, few real workloads duplicate the exact algorithms and environment of a benchmark. Even those industry-standard benchmarks that were originally derived from real applications may have been simplified and homogenized to make them portable to a wide variety of hardware platforms. The environment in which they run has been constrained in the interests of reproducible measurements.
Bluntly, reasoning of the form "System A is rated at 50% more MegaThings than System B, so System A should run my program 50% faster than System B" may be a tempting shortcut, but it is wrong. There is no benchmark with such universal applicability. The only valid use for industry-standard benchmarks is to narrow the field of candidate systems that will be subjected to a serious evaluation. There is no substitute for developing a clear understanding of your workload and its performance in systems under consideration.
After defining the workload that the system will have to process, you can choose performance criteria and set performance objectives based on those criteria. The main overall performance criteria of computer systems are response time and throughput. Response time is the time from the initiation of an operation until the initiator has enough information to resume work, while throughput is the number of workload operations that can be accomplished per unit of time. The relationship between these metrics is complex. In some cases you may have to trade off one against the other. In other situations, a single change can improve both.
In planning for or tuning any system, you should have clear objectives for both response time and throughput when processing the specified workload. Otherwise you risk spending analysis time and resource dollars improving an aspect of system performance that is of secondary importance.
Normally, an application programmer thinks of the running program as an uninterrupted sequence of instructions that perform a specified function. Great amounts of inventiveness and effort have been expended on the operating system and hardware to ensure that programmers are not distracted from this idealized view by "irrelevant" space, speed, and multiprogramming/multiprocessing considerations. If the programmer is seduced by this comfortable illusion, the resulting program may be unnecessarily expensive to run--and may not meet its performance objectives.
To think clearly about the performance characteristics of a workload, we need a dynamic, rather than a static, model of program execution, as shown in the figure "Program Execution Hierarchy."
To run, a program must make its way up both the hardware and operating-system hierarchies, more or less in parallel. Each element in the hardware hierarchy is scarcer and more expensive than the element below it. Not only does the program have to contend with other programs for each resource, the transition from one level to the next takes time. To understand the dynamics of program execution, we need to have a basic understanding of each of the levels.
Usually, the time required to move from one hardware level to another consists primarily of the latency of the lower level--the time from the issuing of a request to the receipt of the first data.
By far the slowest operation for a running program (other than waiting on a human keystroke) is obtaining code or data from a disk:
Disk operations can have many causes besides explicit read or write requests in the program. System tuning activities frequently turn out to be hunts for unnecessary disk I/O.
RAM is fast compared to disk, but much more expensive per byte. Operating systems try to keep in RAM the code and data that are currently in use, spilling any excess onto disk (or never bringing them into RAM in the first place).
RAM is not necessarily fast compared to the processor. In the RS/6000, a RAM latency of several processor cycles occurs between the time the hardware recognizes the need for a RAM access and the time the data or instruction is available to the processor.
If the access is to a page of virtual memory that has been spilled to disk (or has not been brought in yet), a page fault occurs, and the execution of the program is suspended until the page has been read in from disk.
One of the ways programmers are insulated from the physical limitations of the system is the implementation of virtual memory. The programmer designs and codes the program as though the memory were very large, and the system takes responsibility for translating the program's virtual addresses for instructions and data into the real addresses that are needed to get the instructions and data from RAM. Since this address-translation process is time-consuming, the system keeps the real addresses of recently accessed virtual-memory pages in a cache called the translation lookaside buffer (TLB). As long as the running program continues to access a small set of program and data pages, the full virtual-to-real page-address translation does not need to be redone for each RAM access. When the program tries to access a virtual-memory page that does not have a TLB entry (a TLB miss ), dozens of processor cycles (the TLB-miss latency) are usually required to perform the address translation.
To minimize the number of times the program has to experience the RAM latency, the RS/6000 incorporates caches for instructions and data. If the required instruction or data is already in the cache (a cache hit ), it is available to the processor on the next cycle (that is, no delay occurs); otherwise (a cache miss ), the RAM latency occurs.
In some systems there are two levels of cache, usually called L1 and L2. If a particular storage reference results in an L1 miss, L2 is checked. If L2 generates a miss, then the reference goes to RAM.
In the RS/6000, the cache sizes and structures vary by model, but the principles of using them efficiently are identical. Appendix C, "Cache and Addressing Considerations", contains a more detailed discussion of cache and TLB architectures for the benefit of the curious and those who envision very low-level program tuning.
The RS/6000's pipelined, superscalar architecture makes possible, under certain circumstances, the simultaneous processing of multiple instructions. Large sets of general-purpose registers and floating-point registers make it possible to keep considerable amounts of the program's data in registers, rather than continually storing and reloading.
The RS/6000 optimizing compilers are designed to take maximum advantage of these capabilities. The compilers' optimization functions should always be used when generating production programs, however small. The Optimization and Tuning Guide for XL Fortran, XL C and XL C++ describes the ways in which programs can be tuned for maximum performance.
To run, a program must also progress through a series of steps in the software hierarchy.
When a user requests the execution of a program, AIX performs a number of operations to transform the executable program on disk to a running program. First, the directories in the user's current PATH environment variable must be scanned to find the correct copy of the program. Then, the system loader (not to be confused with ld, the binder) must resolve any external references from the program to shared libraries.
To represent the user's request, the operating system creates a process, which is a set of resources, such as a private virtual address segment, required by any running program.
In AIX Version 4, the operating system also automatically creates a single thread within that process. A thread is the current execution state of a single instance of a program. In AIX Version 4, access to the processor and other resources is allocated on a thread basis, rather than a process basis. Multiple threads can be created within a process by the application program. Those threads share the resources owned by the process within which they are running.
Finally, the system branches to the entry point of the program. If the program page that contains the entry point is not already in memory (as it might be if the program had been recently compiled, executed, or copied), the resulting page-fault interrupt causes the page to be read.
The mechanism for notifying the operating system that an external event has taken place is to interrupt the currently running thread and transfer control to an interrupt handler. Before the interrupt handler can run, enough of the hardware state must be saved to ensure that the system can restore the context of the thread after interrupt handling is complete. Newly invoked interrupt handlers experience all of the delays of moving up the hardware hierarchy (except page faults). Unless the interrupt handler was run very recently (or the intervening programs were very economical), it is unlikely that any of its code or data remains in the TLBs or the caches.
When the interrupted thread is dispatched again, its execution context (such as register contents) is logically restored, so that it functions correctly. However, the contents of the TLBs and caches must be reconstructed on the basis of the program's subsequent demands. Thus, both the interrupt handler and the interrupted thread can experience significant cache-miss and TLB-miss delays as a result of the interrupt.
Whenever an executing program makes a request that cannot be satisfied immediately, such as an I/O operation (either explicit or as the result of a page fault), that thread is put in a Wait state until the request is complete. Normally, this results in another set of TLB and cache latencies, in addition to the time required for the request itself.
When a thread is dispatchable, but not actually running, it is accomplishing nothing useful. Worse, other threads that are running may cause the thread's cache lines (the areas of the cache that contain the instructions and/or data of this thread--see Appendix C, "Cache and Addressing Considerations") to be re-used and real memory pages to be reclaimed, resulting in even more delays when the thread is finally dispatched.
The scheduler chooses the thread that has the strongest claim to the use of the processor. (The considerations that affect that choice are discussed in "Performance Overview of the AIX CPU Scheduler".) When the thread is dispatched, the logical state of the processor is restored to that in effect when the thread was interrupted.
Most of the machine instructions in a RS/6000 are capable of executing in a single processor cycle, if no TLB or cache miss occurs. In contrast, if a program branches rapidly to different areas of the executable and/or accesses data from a large number of different areas, causing high TLB and cache miss rates, the average number of processor cycles per instruction executed might be much greater than one. The program is said to exhibit poor "locality of reference." It might be using the minimum number of instructions necessary to do its job, but consuming an unnecessarily large number of cycles. In part because of this poor correlation between number of instructions and number of cycles, sitting down with a program listing to calculate "path length" no longer yields a time value directly. While a shorter path is usually faster than a longer path, the speed ratio can be very different from the path-length ratio.
The XL compilers rearrange code in very sophisticated ways to minimize the number of cycles required for the execution of the program. The programmer seeking maximum performance should be primarily concerned with ensuring that the compiler has all the information necessary to optimize effectively, rather than trying to second-guess the compiler's optimization techniques. (See "Effective Use of Preprocessors and the XL Compilers".) The real measure of optimization effectiveness is the performance of an authentic workload.
It's not enough to create the most efficient possible individual programs. In many cases, the actual programs being run were created outside of the control of the person who is responsible for meeting the organization's performance objectives. Further, most of the levels of the hierarchy we have just described are managed by one or more parts of AIX. In any case, once the application programs have been acquired, or implemented as efficiently as possible, further improvement in the overall performance of the system becomes a matter of system tuning. The main components that are subject to system-level tuning are:
Fixed Disk | The Logical Volume Manager
(LVM) controls the placement of file systems and paging spaces on the disk, which can significantly affect the amount of seek latency the system experiences.
The disk device drivers control the order in which I/O requests are acted on. |
Real Memory | The Virtual Memory Manager (VMM) controls the pool of free real-memory frames and determines when and from whom to steal frames to replenish the pool. |
Running Thread | The scheduler determines which dispatchable entity should receive control next. (In AIX Version 4, the dispatchable entity changes from a process to a thread. See "AIX Version 4 Thread Support".) |
Communications I/O | Depending on the type of workload and the type of communications link, it may be necessary to tune one or more of the communications device drivers, TCP/IP, or NFS. |
Workloads tend to fall naturally into a small number of classes. The types that follow are sometimes used to categorize systems. However, since a single system often is called upon to process multiple classes, "workload" seems more apt in the context of performance.
When a single system is processing workloads of more than one type, there must be a clear understanding between the users and the performance analyst as to the relative priorities of the possibly conflicting performance objectives of the different workloads.