Network Lock Manager

AIX Version 4.3 System Management Guide: Communications and Networks

Network Lock Manager

The network lock manager is a facility that works in cooperation with the Network File System (NFS) to provide a System V style of advisory file and record locking over the network. The network lock manager (rpc.lockd) and the network status monitor (rpc.statd) are network-service daemons. The rpc.statd daemon is a user level process while the rpc.lockd daemon is implemented as a set of kernel threads (similar to the NFS server). Both daemons are essential to the kernel's ability to provide fundamental network services.

Note: Mandatory or enforced locks are not supported over NFS.

Network Lock Manager Architecture

The network lock manager contains both server and client functions. The client functions are responsible for processing requests from the applications and sending requests to the network lock manager at the server. The server functions are responsible for accepting lock requests from clients and generating the appropriate locking calls at the server. The server will then respond to the client's locking request.

In contrast to NFS, which is stateless, the network lock manager has an implicit state. In other words, the network lock manager must remember certain information about a client, that is, whether the client currently has a lock. The network status monitor, rpc.statd, implements a simple protocol that allows the network lock manager to monitor the status of other machines on the network. By having accurate status information, the network lock manager can maintain a consistent state within the stateless NFS environment.

Network File Locking Process

When an application wants to obtain a lock on a local file, it sends its request to the kernel using the lockf, fcntl, or flock subroutine. The kernel then processes the lock request. However, if an application on an NFS client makes a lock request for a remote file, the Network Lock Manager client will generate a Remote Procedure Call (RPC) to the server to handle the request.

When the client receives an initial remote lock request, it registers interest in the server with the client's rpc.statd daemon. The same is true for the network lock manager at the server. On the initial request from a client, it will register interest in the client with the local network status monitor.

Crash Recovery Process

Each machine's rpc.statd daemon notifies every other machine's rpc.statd daemon of its activities. When a machine's rpc.statd daemon receives notice that another machine crashed or recovered, it notifies its rpc.lockd daemon.

If a server crashes, clients with locked files must be able to recover their locks. If a client crashes, its servers must hold the client locks while it recovers. Additionally, to preserve the overall transparency of NFS, the crash recovery must occur without requiring the intervention of the applications themselves.

The crash recovery procedure is simple. If the failure of a client is detected, the server releases the failed client locks, on the assumption that the client application will request locks again as needed. If the crash and recovery of a server is detected, the client lock manager retransmits all lock requests previously granted by the server. This retransmitted information is used by the server to reconstruct its locking state during a grace period. (The grace period, 45 seconds by default, is a time period within which a server allows clients to reclaim their locks.)

The rpc.statd daemon uses the host names kept in /etc/sm and /etc/sm.bak to keep track of which hosts must be informed when the machine needs to recover operations.

Starting the Network Lock Manager

By default, the /etc/rc.nfs script starts the rpc.lockd and rpc.statd daemons along with the other NFS daemons. If NFS is already running, you can verify that the rpc.lockd and rpc.statd daemons are running by following the instructions in "Get the Current Status of the NFS Daemons". The status of these two daemons should be active. If the rpc.lockd and rpc.statd daemons are not active, and therefore not running, do the following:

Using your favorite text editor, open the /etc/rc.nfs file.

Search for the following lines:

if [ -x /usr/sbin/rpc.statd ]; then
       startsrc -s rpc.statd
fi
if [ -x /usr/sbin/rpc.lockd ]; then
       startsrc -s rpc.lockd
fi

If there is a pound sign (#) at the beginning of any of these lines, delete it, then save and exit the file. Then start the rpc.statd and rpc.lockd daemons by following the instructions in "Start the NFS Daemons".
Note: Sequence is important. Always start the statd daemon first.
If NFS is running and the entries in the /etc/rc.nfs file are correct, stop and restart the rpc.statd and rpc.lockd daemons by following the instructions in "Stop the NFS Daemons" and "Start the NFS Daemons".
Note: Sequence is important. Always start the statd daemon first.

If the rpc.statd and rpc.lockd daemons are still not running, see "Troubleshooting the Network Lock Manager."

Troubleshooting the Network Lock Manager

If you receive a message on a client similar to:

clnttcp_create: RPC: Remote System error - Connection refused
rpc.statd:cannot talk to statd at {server}

then the machine thinks there is another machine which needs to be informed that it may have to take recovery measures. When a machine reboots, or when rpc.lockd and rpc.statd are stopped and restarted, machine names are moved from /etc/sm to /etc/sm.bak and the rpc.statd tries to inform each machine corresponding to each entry in /etc/sm.bak that recovery procedures are needed.

If the rpc.statd can reach the machine, then its entry in /etc/sm.bak is removed. If rpc.statd cannot reach the machine, then it will keep trying at regular intervals. Each time the machine fails to respond, the timeout generates the above message. In the interest of locking integrity, the daemon will continue to try; however, this can have an adverse effect on locking performance. The handling is different, depending on whether the target machine is just unresponsive or semi-permanently taken out of production. To eliminate the message:

Verify that the statd and lockd daemons on the server are running by following the instructions in "Get the Current Status of the NFS Daemons". (The status of these two daemons should be active.)
If these daemons are not running, start the rpc.statd and rpc.lockd daemons on the server by following the instructions in "Start the NFS Daemons".
Note: Sequence is important. Always start the statd daemon first.

After you have restarted the daemons, remember that there is a grace period. During this time, the lockd daemons allow reclaim requests to come from other clients that previously held locks with the server, so you will not get a new lock immediately after starting the daemons.

Alternatively, you can eliminate the message by:

Stop the rpc.statd and rpc.lockd daemons on the client by following the instructions in "Stop the NFS Daemons".
On the client, remove the target machine entry from /etc/sm.bak by entering:
```
rm /etc/sm.bak/TargetMachineName
```
This action will keep the target machine from being aware that it may need to participate in locking recovery, so it should only be used when it can be determined that the machine does not have any applications running that are participating in network locking with the affected machine.
Start the rpc.statd and rpc.lockd daemons on the client by following the instructions in "Start the NFS Daemons".

If you are unable to obtain a lock from a client, do the following:

Use the ping command to verify that the client and server can reach and recognize each other. If the machines are both running and the network is intact, check the host names listed in the /etc/hosts file for each machine. Host names must exactly match between server and client for machine recognition. If a name server is being used for host name resolution, make sure the host information is exactly the same as that in the /etc/hosts file.
Verify that the rpc.lockd and rpc.statd daemons are running on both the client and the server by following the instructions in "Get the Current Status of the NFS Daemons". The status of these two daemons should be active.
If they are not active, start the rpc.statd and rpc.lockd daemons by following the instructions in "Start the NFS Daemons".
If they are active, you may need to reset them on both clients and servers. To do this, stop all the applications that are requesting locks.
Next, stop the rpc.statd and rpc.lockd daemons on both the client and the server by following the instructions in "Stop the NFS Daemons".
Now, restart the rpc.statd and rpc.lockd daemons, first on the server and then on the client, by following the instructions in "Start the NFS Daemons".
Note: Sequence is important. Always start the statd daemon first.

If the procedure does not alleviate the locking problem, run the lockd daemon in debug mode, by doing the following:

Stop the rpc.statd and rpc.lockd daemons on both the client and the server by following the instructions in "Stop the NFS Daemons".
Start the rpc.statd daemon on the client and server by following the instructions in "Start the NFS Daemons".
Start the rpc.lockd daemon on the client and server be entering:
```
/usr/sbin/rpc.lockd -d1
```
When invoked with the -d1 flag, the lockd daemon provides diagnostic messages to standard output. At first, there will be a number of messages dealing with the grace period; wait for them to time out. After the grace period has timed out on both the server and any clients, run the application that is having lock problems and verify that a lock request is transmitted from client to server and server to client.