mbuf Pool Performance Tuning

AIX Versions 3.2 and 4 Performance Tuning Guide

mbuf Pool Performance Tuning

Note: This section applies primarily to AIX Version 3.2.5. The mbuf allocation mechanism in AIX Version 4 is substantially different. In AIX Version 4, you can set the maximum amount of memory that will be used by the network allocator in the same way you set this in AIX Version 3.2.5--with the no command and thewall parameter. All other tuning options that were available in AIX Version 3.2.5 have been removed from AIX Version 4 because the AIX Version 4 mbuf allocation mechanism is much more self-tuning.

The network subsystem uses a memory management facility that revolves around a data structure called an mbuf. Mbufs are mostly used to store data for incoming and outbound network traffic. Having mbuf pools of the right size can have a very positive effect on network performance. If the mbuf pools are configured improperly, both network and system performance can suffer. The AIX operating system offers the capability for run-time mbuf pool configuration. With this convenience comes the responsibility for knowing when the pools need adjusting and how much they should be adjusted. The following sections contain more information on mbuf pools:

"Overview of the mbuf Management Facility"
"When to Tune the mbuf Pools"
"How to Tune the mbuf Pools"

Overview of the mbuf Management Facility

The mbuf management facility controls two pools of buffers: a pool of small buffers (256 bytes each), which are simply called mbufs, and a pool of large buffers (4096 bytes each), which are usually called mbuf clusters or just clusters. The pools are created from system memory by making an allocation request to the Virtual Memory Manager (VMM). The pools consist of pinned pieces of virtual memory; this means that they always reside in physical memory and are never paged out. The result is that the real memory available for paging in application programs and data has been decreased by the amount that the mbuf pools have been increased. This is a nontrivial cost that must always be taken into account when considering an increase in the size of the mbuf pools.

The initial size of the mbuf pools is system-dependent. There is a minimum number of (small) mbufs and clusters allocated for each system, but these minimums are increased by an amount that depends on the specific system configuration. One factor affecting how much they are increased is the number of communications adapters in the system. The default pool sizes are initially configured to handle small- to medium-size network loads (network traffic of 100-500 packets/second). The pool sizes dynamically increase as network loads increase. The cluster pool shrinks if network loads decrease (the mbuf pool is never reduced). To optimize network performance, the administrator should balance mbuf pool sizes with network loads (packets/second). If the network load is particularly oriented towards UDP traffic (as it would be on an NFS server, for example) the size of the mbuf pool should be two times the packet/second rate. This is due to UDP traffic consuming an extra small mbuf.

To provide an efficient mbuf allocation service, an attempt is made to maintain a minimum number of free buffers in the pools at all times. The lowmbuf and lowclust network parameters (which can be manipulated using the no command) are used to define these lower limits.

The lowmbuf parameter controls the minimum number of free buffers for the mbuf pool. The lowclust parameter controls the minimum number of free buffers for the cluster pool. When the number of buffers in the pools drops below the lowmbuf or lowclust thresholds the pools are expanded by some amount. The expansion of the mbuf pools is not done immediately, but is scheduled to be done by a kernel service named netm. When the netm kernel service is dispatched, the pools are expanded to meet the minimum requirements of lowclust and lowmbuf. Having a kernel process do this work is required by the structure of the VMM.

An additional function that the netm kernel service provides is to limit the growth of the cluster pool. The mb_cl_hiwat network parameter defines this maximum value.

The mb_cl_hiwat parameter controls the maximum number of free buffers the cluster pool can contain. When the number of free clusters in the pool exceeds mb_cl_hiwat, netm is scheduled to release some of the clusters back to the VMM.

The netm kernel system runs at a very favored priority (fixed 37). Because of this, excessive netm kernel system dispatching can cause not only poor network performance but also poor system performance because of contention with other system and user processes. Improperly configured pools can result in netm "thrashing" due to conflicting network traffic needs and improperly tuned thresholds. The netm kernel system dispatching can be minimized by properly configuring the mbuf pools to match system and networking needs.

The last network parameter that is used by the mbuf management facility is thewall.

The thewall parameter controls the maximum amount of RAM (in kilobytes) that the mbuf management facility can allocate from the VMM. This parameter is used to prevent unbalanced VMM resources which result in poor system performance.

When to Tune the mbuf Pools

When and how much to tune the mbuf pools is directly related to the network load to which a given machine is being subjected. A server machine that is supporting many clients is a good candidate for having the mbuf pools tuned to optimize network performance. It is important for the system administrator to understand the networking load for a given system.

By using the netstat command you can get a rough idea of the network load in packets/second. For example:

netstat -I tr0 5

reports the input and output traffic both for the tr0 adapter and for all LAN adapters on the system. The output below shows the activity caused by a large ftp command operation:

$ netstat -I tr0 2
   input    (tr0)     output            input   (Total)    output
 packets  errs  packets  errs colls   packets  errs  packets  errs colls 
   20615   227     3345     0     0     20905   227     3635     0     0
      17     0        1     0     0        17     0        1     0     0
     174     0      320     0     0       174     0      320     0     0
     248     0      443     0     0       248     0      443     0     0
     210     0      404     0     0       210     0      404     0     0
     239     0      461     0     0       239     0      461     0     0
     253     1      454     0     0       253     1      454     0     0
     246     0      467     0     0       246     0      467     0     0
      99     1      145     0     0        99     1      145     0     0
      13     0        1     0     0        13     0        1     0     0

The netstat command also has a flag, -m, that gives detailed information about the use and availability of the mbufs and clusters:

253 mbufs in use:
        50 mbufs allocated to data
        1 mbufs allocated to packet headers
        76 mbufs allocated to socket structures
        100 mbufs allocated to protocol control blocks
        10 mbufs allocated to routing table entries
        14 mbufs allocated to socket names and addresses
        2 mbufs allocated to interface addresses
16/64 mapped pages in use
319 Kbytes allocated to network (39% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

The line 16/64 mapped pages in use indicates that there are 64 pinned clusters, of which 16 are currently in use.

This report can be compared to the existing system parameters by issuing a no -a command. The following lines from the report are of interest:

   lowclust = 29
    lowmbuf = 88
    thewall = 2048
mb_cl_hiwat = 58

It is clear that on the test system, the 319 Kbytes allocated to network is considerably short of thewall value of 2048KB and the (64 - 16 = 48) free clusters are short of the mb_cl_hiwat limit of 58.

The requests for memory denied counter is maintained by the mbuf management facility and is incremented each time a request for an mbuf allocation cannot be satisfied. Normally the requests for memory denied value will be 0. If a system experiences a high burst of network traffic, the default configured mbuf pools may not be sufficient to meet the demand of the incoming burst, causing the error counter to be incremented once for each mbuf allocation request that fails. Usually this is in the thousands due to the large number of packets arriving in a short interval. The requests for memory denied statistic will correspond to dropped packets on the network. Dropped network packets mean retransmissions, resulting in degraded network performance. If the requests for memory denied value is greater than zero, it may be appropriate to tune the mbuf parameters--see the following section, "How to Tune the mbuf Pools."

The Kbytes allocated to the network statistic is maintained by the mbuf management facility and represents the current amount of system memory that has been allocated to both mbuf pools. The upper bound of this statistic set by thewall is used to prevent the mbuf management facility from consuming too much of a system's physical memory. The default value for thewall limits the mbuf management facility to 2048KB (as shown in the report generated by the no -a command). If the Kbytes allocated to the network value approaches thewall, it may be appropriate to tune the mbuf parameters. See "How to Tune the mbuf Pools", below.

There are cases where the above indicators suggest that the mbuf pools may need to be expanded, when in fact there is a system problem that should be corrected first. For example:

mbuf memory leak
Queued data not being read from socket or other internal queuing structure

An mbuf memory leak is a situation in which some kernel or kernel-extension code has neglected to release an mbuf resource and has destroyed the pointer to its memory location, thereby losing the address of the mbuf forever. If this occurs repeatedly, eventually all the mbuf resources will be used up. If the netstat mbuf statistics show a gradual increase in usage that never decreases or high mbuf usage on a relatively idle system, there may be an mbuf memory leak. Developers of kernel extensions that use mbufs should always include checks for memory leaks in their testing.

It is also possible to have a large number of mbufs queued at the socket layer because of an application defect. Normally an application program would read data from the socket, causing the mbufs to be returned to the mbuf management facility. An administrator can monitor the statistics generated by the netstat -m command and look for high mbuf usage while there is no expected network traffic. The administrator can also view the current list of running processes (by entering ps -ef) and scan for those that use the network subsystem with large amounts of CPU time being used. If this behavior is observed, the suspected application defect should be isolated and fixed.

How to Tune the mbuf Pools

With an understanding of how the mbuf pools are organized and managed, tuning the mbuf pools is simple in the AIX operating system and can be done at run time. The no command can be used by the root user to modify the mbuf pool parameters. Some guidelines are:

When adjusting the lowclust and lowmbuf attributes, thewall may need to be increased first to prevent pool expansions from hitting thewall.
The value of the mb_cl_hiwat attribute should be at least two times greater than the lowclust attribute at all times. This will prevent the netm thrashing discussed earlier.
When adjusting lowclust, lowmbuf should be adjusted by at least the same amount. For every cluster there will exist an mbuf that points to that cluster.
After expanding the pools, use the vmstat command to ensure that paging rates have not increased. If you cannot expand the pools to the necessary levels without adversely affecting the paging rates, additional memory may be required.

The following is an example shell script that might be run at the end of /etc/rc.net to tune the mbuf pools for an NFS server that experiences a network traffic load of approximately 1500 packets/sec.

#!/bin/ksh
# echo "Tuning mbuf pools..."
# set maximum amount of memory to allow for allocation (10MB)
no -o thewall=10240
   
# set minimum number of small mbufs
no -o lowmbuf=3000
   
# generate network traffic to force small mbuf pool expansion
ping 127.0.0.1  1000 1 >/dev/null
   
# set minimum number of small mbufs back to default to prevent netm from
# running unnecessarily
no -d lowmbuf
   
# set maximum number of free clusters before expanding pool
# (about 6MB)
no -o mb_cl_hiwat=1500
   
# gradually expand cluster pool
N=10
while [ $N -lt 1500 ]
do
  no -o lowclust=$N
  ping 127.0.0.1 1000 1 >/dev/null
  let N=N+10
done
   
# set minimum number of clusters back to default to prevent netm 
# from running unnecessarily
no -d lowclust

You can use netstat -m following the above script to verify the size of the pool of clusters (which the netstat command calls mapped pages). To verify the size of the pool of mbufs you can use the crash command to examine a kernel data structure, mbstat (see the /usr/include/sys/mbuf.h file). The kernel address of mbstat can be displayed while in crash using the od mbstat command. You will then need to enter od <kernel address> to dump the first word in the mbstat structure, which contains the size of the mbuf pool. If you are using AIX version 4 or AIX version 3.2 with PTF U437500 installed, the dialog would be similar to the following:

$ crash
> od mbstat
001f7008: 00000180
> quit

The size of the mbuf pool is therefore 18016 (38410).

If you are using AIX version 3.2 without PTF U437500 installed, the dialog would be similar to the following:

$ crash
> od mbstat
000e2be0: 001f7008
> od 1f7008
001f7008: 00000180
> quit

The size of the mbuf pool is therefore 18016 (38410).