Inference from the Kind of Performance Problem Reported

AIX Versions 3.2 and 4 Performance Tuning Guide

Inference from the Kind of Performance Problem Reported

When a performance problem is reported, the kind of performance problem will often help the performance analyst to narrow the list of possible culprits.

This topic includes the following major sections:

"A Particular Program Runs Slowly"
"Everything Runs Slowly at a Particular Time of Day"
"Everything Runs Slowly at Unpredictable Times"
"Everything an Individual User Runs is Slow"
"A Number of LAN-Connected Systems Slow Down Simultaneously"

If everything that uses a particular device or service slows down at times, refer to the topic that covers that device or service:

A Particular Program Runs Slowly

This may seem to be the trivial case, but there are still questions to be asked:

Has the program always run slowly?
If the program has just started running slowly, a recent change may be the cause.
Has the source code been changed or a new version installed?
If so, check with the programmer or vendor.
Has something in the environment changed?
If a file used by the program (including its own executable) has been moved, it may now be experiencing LAN delays that weren't there before; or files may be contending for a single disk accessor that were on different disks before.

If the system administrator has changed system-tuning parameters, the program may be subject to constraints that it didn't experience before. For example, if the schedtune -r command has been used to change the way priority is calculated, programs that used to run rather quickly in the background may now be slowed down, while foreground programs have speeded up.
Is the program written in the awk, csh, or some other interpretive language?
While they allow programs to be written quickly, interpretive languages have the problem that they are not optimized by a compiler. Also, it is easy in a language like awk to request an extremely compute- or I/O-intensive operation with a few characters. It is often worthwhile to perform a desk check or informal peer review of such programs with the emphasis on the number of iterations implied by each operation.
Does the program always run at the same speed, or is it sometimes faster?
The AIX file system uses some of system memory to hold pages of files for future reference. If a disk-limited program is run twice in quick succession, it will normally run faster the second time than the first. Similar phenomena may be observed with programs that use NFS and DFS. This can also occur with large programs, such as compilers. The program's algorithm may not be disk-limited, but the time needed to load a large executable may make the first execution of the program much longer than subsequent ones.
If the program has always run slowly, or has slowed down without any obvious change in its environment, we need to look at its dependency on resources.
"Identifying the Performance-Limiting Resource" describes techniques for finding the bottleneck.

Everything Runs Slowly at a Particular Time of Day

Most people have experienced the rush-hour slowdown that occurs because a large number of people in the organization habitually use the system at one or more particular times each day. This phenomenon is not always simply due to a concentration of load. Sometimes it is an indication of an imbalance that is (at present) only a problem when the load is high. There are also other sources of periodicity in the system that should be considered.

If you run iostat and netstat for a period that spans the time of the slowdown (or have previously captured data from your monitoring mechanism), are some disks much more heavily used than others? Is the CPU Idle percentage consistently near zero? Is the number of packets sent or received unusually high?
If the disks are unbalanced, look at "Monitoring and Tuning Disk I/O".

If the CPU is saturated, use ps to identify the programs being run during this period. The script given in "Performance Monitoring Using iostat, netstat, vmstat" simplifies the search for the CPU hogs.

If the slowdown is counter-intuitive, such as paralysis during lunch time, look for a pathological program such as a graphic Xlock or game program. Some versions of Xlock are known to use huge amounts of CPU time to display graphic patterns on an idle display. It is also possible that someone is running a program that is a known CPU burner and is trying to run it at the least intrusive time.
Unless your /var/adm/cron/cron.allow file is null, you may want to check the contents of the /var/adm/cron/crontab directory for expensive operations. For example, users have been known to request an hourly copy of all of their home directory files to an NFS-mounted backup directory.

If you find that the problem stems from conflict between foreground activity and long-running, CPU-intensive programs that are, or should be, run in the background, you should consider using schedtune -r -d to give the foreground higher priority. See "Tuning the Process-Priority-Value Calculation with schedtune".

Everything Runs Slowly at Unpredictable Times

The best tool for this situation is an overload detector, such as xmperf's filtd program (a component of PTX). filtd can be set up to execute shell scripts or collect specific information when a particular condition is detected. You can construct a similar, but more specialized, mechanism using shell scripts containing vmstat, netstat, and ps.

If the problem is local to a single system in a distributed environment, there is probably a pathological program at work, or perhaps two that intersect randomly.

Everything an Individual User Runs Is Slow

Sometimes a system seems to "pick on" an individual.

Quantify the problem. Ask the user which commands are used frequently, and run them with the time command, as in the following example:
```
$ time cp .profile testjunk
real    0m0.08s
user    0m0.00s
sys     0m0.01s
```
Then run them under a satisfactory userid. Is there a difference in the reported real time?
A program should not show much CPU time (user+sys) difference from run to run, but may show a real time difference because of more or slower I/O. Are the user's files on an NFS-mounted directory? On a disk that has high activity for other reasons?
Check the user's .profile file for strange $PATH specifications. For example, if you always search a couple of NFS-mounted directories (fruitlessly) before searching /usr/bin, everything will take longer.

A Number of LAN-Connected Systems Slow Down Simultaneously

There are some common problems that arise in the transition from independent systems to distributed systems. They usually result from the need to get a new configuration running as soon as possible, or from a lack of awareness of the cost of certain functions. In addition to tuning the LAN configuration in terms of MTUs and mbufs (see the Monitoring and Tuning Communications I/O chapter), we should look for LAN-specific pathologies or nonoptimal situations that may have evolved through a sequence of individually reasonable decisions.

Some types of software or firmware bugs can sporadically saturate the LAN with broadcast or other packets.
When a broadcast storm occurs, even systems that are not actively using the network can be slowed by the incessant interrupts and by the CPU resource consumed in receiving and processing the packets. These bugs are better detected and localized with LAN analysis devices than with normal AIX performance tools.
Do you have two LANs connected via an AIX system?
Using an AIX system as a router consumes large amounts of CPU time to process and copy packets. It is also subject to interference from other work being processed by the AIX system. Dedicated hardware routers and bridges are usually a more cost-effective and robust solution to the need to connect LANs.
Is there a clearly defensible purpose for each NFS mount?
At some stages in the development of distributed configurations, NFS mounts are used to give users on new systems access to their home directories on their original systems. This simplifies the initial transition, but imposes a continuing data communication cost. It is not unknown to have users on system A interacting primarily with data on system B and vice versa.

Access to files via NFS imposes a considerable cost in LAN traffic, client and server CPU time, and end-user response time. The general principle should be that user and data should normally be on the same system. The exceptions are those situations in which there is an overriding concern that justifies the extra expense and time of remote data. Some examples are a need to centralize data for more reliable backup and control, or a need to ensure that all users are working with the most current version of a program.

If these and other needs dictate a significant level of NFS client-server interchange, it is better to dedicate a system to the role of server than to have a number of systems that are part-server, part-client.
Have programs been ported correctly (and justifiably) to use remote procedure calls (RPCs)?
The simplest method of porting a program into a distributed environment is to replace program calls with RPCs on a 1:1 basis. Unfortunately, the disparity in performance between local program calls and RPCs is even greater than the disparity between local disk I/O and NFS I/O. Assuming that the RPCs are really necessary, they should be batched whenever possible.

Everything That Uses a Particular Service or Device Slows Down at Times

Make sure you have followed the configuration recommendations in the appropriate subsystem manual and/or the recommendations in the appropriate "Monitoring and Tuning" chapter of this book.