Client-side Daemon

Overview

The client-side daemon (CSD) is responsible for those aspects of client-side operation for which kernel implementation is not necessary or desirable.

The CSD tracks the locations of each page of data stored NMS servers, and for noticing when a page has dropped below the desired degree of replication, due to a server crash. In this case, a thread is used to perform a pagein/pageout operation to restore the required degree of replication.

Interfaces

The CSD uses the following interfaces with other system components:

Data Structures

The CSD maintains the following principal data strutures:

Server Operation

Initialization

The CSD performs the following initialization actions when it starts:

  1. It opens a connection to the syslog facility, used for logging status and debugging messages.
  2. It registers itself with the CSKM by performing an open() call on the control interface.
  3. It obtains the list of available NMS servers, their 16-bit host IDs, their Ethernet addresses, and their Myrinet addresses. In the initial prototype, this list will be read from a file on the disk. Using ioctl() calls on the control interface, the CSD informs the CSKM about each of the servers. Initially, the servers are marked by the CSD and CSKM as "down".
  4. It initializes its 64-bit global session ID to the current local time of day.
  5. It joins a multicast group, with a well-known address, used for communicating general system statistics between NMS hosts.
  6. It creates a stream socket for accepting administrative control connections, and binds it to a well-known port.

Main Loop

After initialization, the CSD enters its main loop, in which the following processing is performed:

Shutdown

The crash recovery mechanisms permit the CSD to shutdown at any time without any warning. However, under normal circumstances, the CSD will issue, via the TCP connection to each server with which it is in contact, a "client shutdown" message. Once this message has been sent, the client will close the TCP connection with the server. When the server receives the client shutdown message, it closes its end of the TCP connection, and then proceeds as if the client had crashed.

If the CSD receives a "server shutdown" message over the TCP connection to a server, then the CSD marks that server as "shutting down", and it begins "down server processing" and "restore replication" processing to restore the proper degree of replication of any data stored on that server. The only difference between this procedure and one that would occur if the TCP connection with the server had been lost, is that in this case the CSD will regard the server that is shutting down as a viable candidate for pagein (but not page out) of the pages to be replicated. Once the degree of replication has been restored for all pages stored on the server that is shutting down, the CSD will issue a "server shutdown acknowledge" message and close its side of the TCP connection.

Detailed Descriptions

Status Tracking

The CSD maintains TCP connections over Ethernet with the SSD at each server host with whom it communicates. These connections are used to establish a context for communication, and for identifying and handling failures or reinitialization of the NMS system on the server hosts.

As part of its main loop, the CSD attempts to contact each server it knows about, with which it does not currently have a TCP connection, and to establish such a connection. When such a connection is established, the SO_KEEPALIVE option is set, so that if the connection is broken, the CSD will be notified the next time it tries to transmit over the connection. Periodic heartbeat messages are used to ensure that this notification occurs reasonably promptly.

Down Server Processing

When a TCP connection with a server is broken, "down server" processing is performed. This processing includes the following:

Up Server Processing

When a new TCP connection with a server is established, "up server" processing is performed. This processing includes the following:

Restore Replication Processing

The CSD maintains a re-replication list, consisting of pages currently known to have inadequate degree of replication. The degree of replication of a page is decreased when the connection to one of the servers storing the page is broken. When this occurs, the data on that server is regarded as lost. and the page is added to the re-replication list.

The re-replication list is processed as a separate task within the CSD. As each page in the list is treated, an ioctl() is first issued to request the CSKM to page in the page, and then once the page has arrived, an ioctl() is issued to request the CSKM to page it out again. As a result, the page ends up stored by a new replica group having the full desired degree of replication.

The re-replication task is subject to a throttle, which keeps re-replication processing from saturating the system and preventing useful work from taking place.

Administrative Commands

The CSD maintains a socket on which it will accept TCP connections from system adminstrators. A simple interactive command language is supported over these connections. This permits system administrators, and perhaps system administration scripts, to contact the CSD in order to change system parameters or get status information.

When a connection is made on the administrative connection, the CSD responds by issuing a single greeting line that identifies it, followed by a prompt. The user then issues a one-line command, after which the CSD issues a zero or more line response, followed by a prompt. Commands supported are:


Last modified: Tue Jul 23 09:16:07 EDT 2002