Using tprof to Analyze Programs for CPU Use

AIX Versions 3.2 and 4 Performance Tuning Guide

Using tprof to Analyze Programs for CPU Use

The typical program execution is a variable mixture of application code, library subroutines, and kernel services. Frequently, a program that has not yet been tuned is found to expend most of its CPU cycles in a few statements or subroutines. Quite often, these "hot spots " are a surprise to the implementor--they can be considered performance bugs. Our tool of choice for pinpointing the hot spots in a program is the trace-driven profiler--tprof. tprof can profile any program produced by one of the XL compilers: XL C, XL C++, and XL FORTRAN.

In AIX Version 4, the tprof program is packaged as part of the Performance Toolbox for AIX. To determine whether tprof is available, use:

lslpp -lI perfagent.tools

If this package has been installed, tprof is available.

The raw data for tprof is obtained via the Trace facility (see "Performance Analysis with the Trace Facility"). When a program is profiled, the Trace facility is activated and instructed to collect data from the trace hook (hook ID 234) that records the contents of the Instruction Address Register when a system-clock interrupt occurs--100 times a second. Several other trace hooks are also activated to allow tprof to track process and dispatch activity. The trace records are not written to a disk file; they are written to a pipe that is read by a program that builds a table of the unique program addresses that have been encountered and the number of times each one occurred. When the workload being profiled is complete, the table of addresses and their occurrence counts is written to disk. The data-reduction component of tprof then correlates the instruction addresses that were encountered with the ranges of addresses occupied by the various programs and reports the distribution of address occurrences ("ticks") across the programs involved in the workload.

The distribution of ticks is roughly proportional to the CPU time spent in each program (10 milliseconds per tick). Once the high-use programs have been identified, the programmer can take action to restructure their hot spots or minimize their use.

A (Synthetic) Cautionary Example

The following C program initializes each byte of a large array of integers to 0x01, increments each int by a random constant, and prints out a randomly selected int. The program does nothing useful, but it is representative of programs that process large arrays.

/*  Array Incrementer -- Version 1  */
#include <stdlib.h>
#define Asize 1024
#define RowDim InnerIndex
#define ColDim OuterIndex
main()
{
  int Increment;
  int OuterIndex;
  int InnerIndex;
  int big [Asize][Asize];
  /* initialize every byte of the array to 0x01 */
  for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
  {
    for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
      big[RowDim][ColDim] = 0x01010101;
  }
  Increment = rand();
  /* increment every element in the array */
  for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
  {
    for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
    {
      big[RowDim][ColDim] += Increment;
      if (big[RowDim][ColDim] < 0)
       printf("Negative number. %d\n",big[RowDim][ColDim]);
    }
  }
  printf("Version 1 Check Num: %d\n",
        big[rand()%Asize][rand()%Asize]);
  return(0);
}

The program was compiled with the command:

$ xlc -g version1.c -o version1

The -g parameter causes the XL C compiler to generate the object module with symbolic debugging information for use by tprof. Although tprof can profile optimized modules, we have omitted the -O parameter to make the line numbers that tprof uses more precise. When the XL C compiler is optimizing, it often does enough rearrangement of code to make tprof's output harder to interpret. On the test system, this program runs in about 5.97 seconds of elapsed time, of which more than 5.9 seconds is user CPU time. Clearly the program meets its objective of being CPU-limited.

If we tprof the program with the following command:

$ tprof -p version1 -x version1

we get a file called __version1.all (shown below) that reports how many CPU ticks each of the programs involved in the execution consumed.

          Process      PID    Total   Kernel     User   Shared    Other
          =======      ===    =====   ======     ====   ======    =====
         version1 33570480      793       30      763        0        0
              bsh 33566383        8        8        0        0        0
        /etc/init        1        6        0        6        0        0
       /etc/syncd     3059        5        5        0        0        0
            tprof     5038        4        2        2        0        0
          rlogind    11345        2        2        0        0        0
          PID.771      771        1        1        0        0        0
            tprof    11940        1        1        0        0        0
            tprof    11951        1        1        0        0        0
            tprof    13987        1        1        0        0        0
              bsh    16048        1        1        0        0        0
          =======      ===    =====   ======     ====   ======    =====
            Total               823       52      771        0        0
          Process     FREQ    Total   Kernel     User   Shared    Other
          =======      ===    =====   ======     ====   ======    =====
         version1        1      793       30      763        0        0
              bsh        2        9        9        0        0        0
        /etc/init        1        6        0        6        0        0
       /etc/syncd        1        5        5        0        0        0
            tprof        4        7        5        2        0        0
          rlogind        1        2        2        0        0        0
          PID.771        1        1        1        0        0        0
          =======      ===    =====   ======     ====   ======    =====
            Total       11      823       52      771        0        0
   Total Ticks For version1(    USER) =    763
           Subroutine   Ticks     %            Source   Address  Bytes
        =============  ======  ======         =======   =======  =====
                .main     763    92.7      version1.c       632    560

The first section of the report shows the number of ticks consumed by, or on behalf of, each process. version1 used 763 ticks itself, and 30 ticks occurred in the kernel on behalf of version1's process. Two processes running the Bourne shell were involved in the execution of version1. Four processes were running tprof-related code. The init process, the sync daemon, an rlogin process, and one other process accounted for 14 ticks.

Remember that the program associated with a given numerical process ID changes with each exec call. If one application program execs another, both program names will appear in the tprof output associated with the same process ID.

The second section of the report summarizes the results by program, regardless of process ID. It shows the number (FREQ) of different processes that ran each program at some point.

The third section breaks down the user ticks associated with the executable program being profiled. It reports the number of ticks used by each function in the executable, and the percentage of the total run's CPU ticks (823) that each function's ticks represent.

Up to this point, none of the tprof processing has required access to the specially compiled version of the program. We could have done the preceding analysis on a program for which we did not have access to the source code.

It is clear from this report that the preponderance (92.7%) of CPU consumption is in the program itself, not in the kernel nor in library subroutines that the program uses. We have to look at the program itself more closely.

Since we compiled version1.c with the -g option, the object file contains information that relates offsets in the program text to lines of source code. Consequently, tprof created an annotated version of the source file version1.c, called __t.version1.c, based on the offsets and line number information in the object module. The first column is simply the line number. The second column is the number of times the trace hook reported that the Timer interrupt occurred while the system was executing one of the instructions associated with that line.

Ticks Profile for main in version1.c
     
   Line   Ticks   Source
      
    14     34       for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
    15      -       {
    16     40         for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
    17    261           big[RowDim][ColDim] = 0x01010101;
    18      -       }
    19      -       Increment = rand();
    20      -
    21      -       /* increment every element in the array */
    22     70       for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
    23      -       {
    24      -         for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
    25      -         {
    26     69           big[RowDim][ColDim] += Increment;
    27     50           if (big[RowDim][ColDim] < 0)
    28    239            printf("Negative number.%d\n",
                                 big[RowDim][ColDim]);
    29      -         }
    30      -       }
    31      -       printf("Version 1 Check Num: %d\n",
    32      -             big[rand()%Asize][rand()%Asize]);
    33      -       return(0);
    34      -     }
       
 763 Total Ticks for main in version1.c

This shows clearly that the largest numbers of ticks are associated with accessing elements of the array big, so we should be able to improve performance significantly by concentrating on the inner for loops. The first (initialization) for loop is a case of brute-force programming. It is very inefficient to initialize arrays one element at a time. If we were setting the array to 0, we should have used bzero. Since we are setting each byte to a specific character, we will use memset to replace the first for loop. (The very efficient bzero and memset functions, like the str functions, are written in assembler language and use hardware instructions that have no direct equivalent in the C language.)

We have to access the array one element at a time to increment the values, but we should ensure that the pattern of memory reference is to consecutive addresses, to maximize cache use. In this case, we have the row dimension changing faster than the column dimension. Since C arrays are arranged in row-major order, we are skipping over a complete row with each successive memory reference. Since the rows are 1024 ints long (4096 bytes), we are changing pages on every reference. The size of the array greatly exceeds both the data cache and data TLB capacities, so we have written a program for maximum cache and TLB thrashing. To fix this problem, we simply transpose the #defines to reverse the values of RowDim and ColDim.

The unoptimized form of the resulting program (version2.c) consumes about 2.7 CPU seconds, compared with 7.9 CPU seconds for version1.

The following file, __t.version2.c, is the result of a tprof run against the unoptimized form:

Ticks Profile for main in version2.c
    
   Line   Ticks   Source
     
    15      -       memset(big,0x01,sizeof(big));
    16      -       Increment = rand();
    17      -
    18      -       /* increment in memory order */
    19     60       for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
    20      -       {
    21      -         for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
    22      -         {
    23     67          big[RowDim][ColDim] += Increment;
    24     60          if (big[RowDim][ColDim] < 0)
    25     43          printf("Negative number. %d\n",big[RowDim][ColDim]);
    26      -         }
    27      -       }
    28      -       printf("Version 2 Check Num: %d\n",
    29      -                   big[rand()%Asize][rand()%Asize]);
    30      -       return(0);
    31      -     }
     
 230 Total Ticks for main in version2.c

By being aware of its CPU use pattern, we have improved the CPU speed of this program by a factor of almost three--for the unoptimized case. When we compile version1.c and version2.c with optimization and compare their performance, the "before and after" improvement due to our changes is a factor of 7.

In many cases, most of a program's CPU use will occur in the library subroutines it uses rather than in the program itself. If we take version2.c and remove the conditional test on line 24 and the printf entry on line 28, to create a version3.c that reads as follows:

#include <string.h>
#include <stdlib.h>
#define Asize 256
#define RowDim OuterIndex
#define ColDim InnerIndex

main()
{ 
  int Increment;
  int OuterIndex;
  int InnerIndex;
  int big [Asize][Asize];
    
  /* Initialize every byte to 0x01 */
  memset(big,0x01,sizeof(big));
  Increment = rand();
  /* increment in memory order */
  for(OuterIndex=0; OuterIndex<Asize; OuterIndex++)
  {
    for (InnerIndex=0; InnerIndex<Asize; InnerIndex++)
    {
      big[RowDim][ColDim] += Increment;
      printf("RowDim=%d, ColDim=%d, Number=%d\n",
              RowDim, ColDim, big[RowDim][ColDim]);
    }
  }
  return(0);
}

the execution time becomes dominated by the printf statement. The command:

$ tprof -v -s -k -p version3 -x version3 >/dev/null

produces a __version3.all that includes profiling data for the kernel and the shared subroutine library libc.a (the only shared library this program uses):

          Process      PID    Total   Kernel     User   Shared    Other
          =======      ===    =====   ======     ====   ======    =====
         version3 33568373      818       30       19      769        0
              bsh 33567348        5        5        0        0        0
            tprof    15987        3        1        2        0        0
            tprof     7784        1        1        0        0        0
            tprof    12905        1        1        0        0        0
              bsh    13941        1        1        0        0        0
          =======      ===    =====   ======     ====   ======    =====
            Total               829       39       21      769        0
    
          Process     FREQ    Total   Kernel     User   Shared    Other
          =======      ===    =====   ======     ====   ======    =====
         version3        1      818       30       19      769        0
              bsh        2        6        6        0        0        0
            tprof        3        5        3        2        0        0
          =======      ===    =====   ======     ====   ======    =====
            Total        6      829       39       21      769        0
     
   Total Ticks For version3(    USER) =     19
     
           Subroutine   Ticks     %            Source   Address  Bytes
        =============  ======  ======         =======   =======  =====
                .main      11     1.3      version3.c       632    320
              .printf       8     1.0         glink.s      1112     36
     
   Total Ticks For version3(  KERNEL) =     30

           Subroutine   Ticks     %            Source   Address  Bytes
        =============  ======  ======         =======   =======  =====
             .sc_flih       7     0.8           low.s     13832   1244
            .i_enable       5     0.6           low.s     21760    256
            .vmcopyin       3     0.4        vmmove.c    414280    668
         .xix_setattr       2     0.2     xix_sattr.c    819368    672
          .isreadonly       2     0.2        disubs.c    689016     60
               .lockl       2     0.2         lockl.s     29300    208
            .v_pagein       1     0.1    v_getsubs1.c    372288   1044
             .curtime       1     0.1         clock.s     27656     76
             .trchook       1     0.1          noname     48168    856
               .vmvcs       1     0.1         vmvcs.s     29744   2304
           .spec_rdwr       1     0.1    spec_vnops.c    629596    240
                .rdwr       1     0.1          rdwr.c    658460    492
               .imark       1     0.1         isubs.c    672024    184
               .nodev       1     0.1     devsw_pin.c    135864     32
           .ld_findfp       1     0.1      ld_libld.c    736084    240
    
   Total Ticks For version3( SH-LIBs) =    769
    
        Shared Object   Ticks     %            Source   Address  Bytes
        =============  ======  ======         =======   =======  =====
         libc.a/shr.o     769    92.0        /usr/lib    794624 724772
     
   Profile: /usr/lib/libc.a  shr.o
      
   Total Ticks For version3(/usr/lib/libc.a) =    769
      
           Subroutine   Ticks     %            Source   Address  Bytes
        =============  ======  ======         =======   =======  =====
             ._doprnt     476    56.9        doprnt.c     36616   7052
              .fwrite     205    24.5        fwrite.c     50748    744
              .strchr      41     4.9        strchr.s     31896    196
              .printf      18     2.2        printf.c    313796    144
             ._moveeq      16     1.9        memcmp.s     36192    184
              .strlen      10     1.2      strerror.c     46800    124
              .isatty       1     0.1        isatty.c     62932    112
             ._xwrite       1     0.1        flsbuf.c      4240    280
             .__ioctl       1     0.1         ioctl.c     57576    240

This confirms that most of the ticks are being used by the shared libraries--libc.a, in this case. The profile of libc.a shows that most of those ticks are being consumed by the _doprnt subroutine.

_doprnt is the processing module for printf, sprintf, etc. With a simple change, we have increased the run time from 2.7 seconds to 8.6 seconds, and our formatted printing now consumes about 60% of the CPU time. This makes it clear why formatting should be used judiciously. _doprnt performance is also affected by the locale. See Appendix I, "National Language Support--Locale vs Speed." These tests were run in the C locale--the most efficient.