Client-side kernel module (CSKM)

Ashish Raniwala & Ganesh Venkitachalam

Overview
This module is responsible for most aspects of NMS operation on the client side. It implements the block and character pseudo-device driver interfaces, it has the responsibility for handling client page faults that require NMS service, and it handles page-out requests originating from the kernel swap daemon. It also generates the event streams consumed by the CSD and APDs, and it manages a small cache consisting of data pages recently pushed by servers in response to prepaging requests from APDs.

    The module is fully implemented as pluggable device-driver, as it is required to be portable across different operating systems.

Interfaces & Functionality Specifications

This module implements the following interfaces:

In addition, this module will contain the following internal functionality:


Issues

  • Mmap & Kswapd. The kswapd does not hand the parameter wait to the dev_swapout function in the device driver. So, to sleep or not to sleep, that is the question. If we always sleep, it might be bad for system performance. If we always return, kswapd might free up the page before it is safely in some server.
  • Initialization & configuration

          The CSKM driver will export a file_operations struct to the kernel, when it registers with the kernel.
    The block device is registered with register_blkdev() and the char device is registered with the register_chrdev() functions from the kernel. After registration, the driver sets blk_dev[NMS_MAJOR].request_fn = nms_request_fn; The kernels queues up requests in the request queue & calls this function. Note that the char dev does not have request functions. (anyway, we need only open & ioctls). Refer to cskm.h for actual char dev  operations.

            static struct file_operations remotemem_fops = {
                 NULL,                    /* lseek */
                 block_read,              /* read - general block-dev read */
                 block_write,             /* write - general block-dev write */
                 NULL,                    /* readdir */
                 NULL,                    /* select */
                 NULL,             /* ioctl */
                 remotemem_mmap,              /* mmap */
                 remotemem_open,              /* open */
                 NULL,                    /* release */
                 NULL,                    /* fsync */
                 NULL,                    /* fasync */
                 NULL,                    /* check_media_change */
                 NULL                     /* revalidate */
         };
     

       Virtual Unit Interface to the  to the CSD, client applications, and active processing daemons.

       When a process does an mmap() on the device, we install the following  vm_operations_struct in the corresponding vm_area_struct.

        static struct vm_operations_struct  remotemem_vmops = {
                 NULL,                /* no special open */
                 NULL,                /* no special close */
                 remotemem_unmap,         /* unmap - we need to sync/explicit free the pages*/
                 NULL,                /* no special protect */
                 NULL,                /* sync */
                 NULL,                /* advise */
                 remotemem_nopage,        /* nopage, do we need this?? */
                 NULL,        /* wppage, we definitely won't need this:-)*/
                 remotemem_swapout,       /* swapout */
                 remotemem_swapin,        /* swapin */
         };
     

      CSKM event stream interface to the APD/CSD.

    The client starts operation in the following way: The NMS master daemon does an open on the char dev associated with remote memory,  with minor number zero. At this time, the cskm initialises itself. If the cskm was a module in the kernel, an insmod would have to be done before this step. The cskm also registers a function with the HSN driver. This function will get called from interrupt-level on arrival of each packet.

    Each process which wants to use remote memory loops through the char devices in /dev, trying to find a free device. Once it does find a free char device, it opens that device. Then, it does an ioctl on the char device and specifies the size of the backing store that it wants. At this time, the virtual_unit structure in the kernel (which is associated with each minor number)  is initialised with the size of the backing store, and a replica group bitmap to hold this many entries will be allocated.
    Now the process can do an mmap on the associated block device. If  the device is to be used as a swap device, a  swapon would be done on this device instead of an mmap.
     

    Data Structures  (may want to keep cskm.h along-side)

    Prepaging-Cache
        The cache is used to hold the prepaged stuff temporarily. It implements a simple FIFO replacement policy.
        It supports two operations -
               cskm_cache_insert(int freeable, phys_page_num, page_id) - This operation puts a page into the cache. The page is associated with the key 'page_id'. If there is not enough space available in the cache, then this operation chucks out the oldest page lying in the cache. If a page given to the cache is not marked as freeable, it will not  be considered fot replacement. ie, it will stay in the cache unless explicitly freed.
               cskm_cache_lookup(page_id) - This operation gets a page which was earlier put in the cache. It returns the physical page number. It also deletes the page from the cache, if the page is a "deletable" page.
               cskm_cache_delete(int page_id) - This will free up the page indicated by page_id.

        The organization of this cache should be similar to the buffer cache, i.e. using both hash table (for implementing fast lookups) and linked list (for implementing the FIFO policy). It should be noted that the pages in the cache are not fixed, however the maximum number of pages in the cache are fixed. So, if somebody tries to insert more than MAX_CSKM_CACHE_PAGES number of pages, then the cache will itself free the oldest entry. The cache insert function will get called from interrupt level, by the per-packet-arrival-processing function.

        The cache also hold the pages fetched in from a server for re-replication. That is the reason why we need pages which "stay" in the cache, and the explicit free function in the cache. Maintaining coherence is easier if we have only  the cache, and not another replication-pool of pages.

    PageId
        Pages on the NMS system are identified using globally valid, 8-byte (64-bit) addresses.
            (1) The first two bytes of the address will identify an NMS server system,
            (2) the next 2 bytes will be the virtual unit which this page falls under, and
            (3) the remaining 4 bytes will be the offset into the virtual unit of a chunk of data whose size is equal to the native page-size of the client machine owning the page. Also, the page should be page-aligned in the virtual unit.
         All these semantics are only used by the clients. Servers do not know the details of the page-id.

    EventQ
       Events to be notified to the user-level daemons (CSD/SSD/APD) are stored in a ring-buffer (implemented as a list) in the kernel. There is a buffer which maintains all the events which at least one user-level entity is looking at. A pointer is maintained for each such user-level entity. These pointers denote the events yet to be read by the corresponding user-level daemon.

        The C-structures look like
            struct EventQ {
                int event_type;
                unsigned long time_stamp;
                unsigned int length;   /* length of the buffer actually used */
                unsigned int vm_unit;
                unsigned char buffer[MAX_BUF];
                struct EventQ *next;
            };

        The buffer field is decoded based on the event_type and length.  It contains additional event-dependent information about the event. A comprehensive list of event-types is placed here.

    All the controlling ioctls for a block device will be actually issued to the  coresponding char device. The char device should not be closed before the block device is closed.  All config info and event info will be communicated between the cskm and some user process(like the csd) via ioctls. The device on which an ioctl is done decides whether the info is pertaining to the whole system or just to one virtual unit. For example, to notify the kernel that a particular server is down, an ioctl will be issued on the NMS master devide (minor number zero). To get the size of a particular virtual unit, an ioctl would be done an that virtual unit's minor number.

    Any process that wants to listen to events will do an ioctl on the minor char device associated with a virtual unit, and pass a list of events it is interested in to the kernel. The kernel will create a structure for this process which contains the events it is interested in, and its current_evt_pointer into the buffer. The events will be picked by the process with ioctls.  Note that there will be a ring-buffer per virtual unit.  The events are noted below, with the device  to which the evnt will be delivered. Master device indicates events pertaining to the client system as a whole. Virtual unit indicates events pertaining to only one particular virtual unit.

    Server-Info-Table
        This gives information on all the NMS servers currently known to this client. The information includes:
            Status: up, down, non-responsive
            16-bit host ID
            IP address
            Myrinet address
            Current load information
        It should be noted a similar table is maintained in the CSD. We need a copy here because we need this information for communcating with the servers.  It should be noted that the load information in the kernel-copy is more current than that in the CSD-copy of the table. Other information is more uptodate in the CSD-copy.

    Server-Banks
        The concept of server-banks is used to reduce the number of bits required in PLT, the page location table. For storing a page,  a group of servers is always choosen from a single bank, it is never split across banks.  We are allowing 2^8 server-banks, thus consuming 8 bits to represent a bank-id.
        (** Note that we DO NOT have a priority queue mechanism for maintaing the servers with the least load. The reason is that the randomization scheme can not work with priority queues. That scheme has to figure out the least loaded servers from a set of randomly chosen servers each time. The scheme does not chose random servers from a set of least loaded servers.**)

    Virtual_Unit_Structure
    struct  remotemem_virtual_unit{
            int major;          /* Major number of driver, not needed? */
            int minor;         /* Actual virtual unit number*/
            int pid;           /* pid of owner*/
            int num_pages;     /* Size of memory (in pages) backed by this guy */

            /*Note: Only size is needed. For mmap, we get the virtual address
            from the vm_area_struct*/

            short int *replica_grp_bmap;
    };

    Page-Location-Table (PLT)
        This table stores the location of the swapped out page. Each entry is 16-bits long consisting of 8-bits bank-id, followed by an 8-bit long bit-map of the actual servers in that partition storing copies of the page. This is the replica_grp_bmap field in the structure above.

    Message Structure
        This page describes the common structure of the messages exchanged between the CSKM and the SSKM. The page is located here.

    Design

    Page-out operation

        Page-out operation is initiated by the kswapd. As discussed in the earlier sections, there can be two cases; let us call them the mmap case and the swap-device case.

        The kswapd could be doing  synchronous or asynchoronous i/o.  In the mmap case, we do not get this information and so we always do synchronous i/o. In the swap-device case, however, we do get this information and depending on this,  we decide the way, we end our processing.

        It should be kept in mind that the page-outs are not  time-critical operations.

        The major work involved in both the situations is same. The following are the major steps which have to be performed for a page-out operation.

    1.    Construct a page-id for the page.
    2.    Choose a set of servers as the server-replica-group for page-out.
    3.    Mutlicast the page-out request to these servers (message-structure)
    4.    Set a timeout.
    5.    Mark the PTE as not-present.
    6.    Depending on the model, either return or sleep (sleep in the mmap model and return in the swap-device model.)
    7.    Collect ACKs from the servers.
    8.    Take care of timeouts (in case some server did not acknowledge) by choosing new server and sending it the page.
    9.    After collecting all the ACKs, free up the page and wakeup the kswapd in the mmap model
    10.    or call the our_end_request() function in the swap-device model (this is the equivalent of ide_end_request).
    11.    Record the 16-bit replica-group-id in the PLT and delete the timer.
        The following discussion relates to the steps of the above algorithm.

    Choosing the server-replica-group for page-out.

        The job of choosing the set of servers is performed using an adaptive scheme in which clients track current server loads, and use randomization together with picking servers with minimum load, to select servers to receive page-out.  Using randomization,  the CSKM chooses a set of random servers from 'a' server bank (the server bank itself could be chosen randomly on each pageout operation).  To support k-degree-replication, the CSKM chooses k+x least loaded servers out of the randomly chosen servers. The set of these k servers is used as the replica-group. The remaining x servers are the auxiliary servers to be used should one of the replica-group-servers fail.

           According to Michael Bender, recent theoretical work has shown good bounds on the balance achievable with this type of scheme. With good balance, we just need to equip servers with sufficient buffers to handle short-term bursts in load. We can establish the required number of buffers empirically. The measurement of server load can then be as simple as sampling the number of free buffers (or perhaps better, the number of buffers currently in use). Load measurements are piggybacked on messages sent to acknowledge receipt of page-out data, or can be triggered by timeouts in periods of system inactivity.

        This brings us to another interesting issue. It seems that there should be a concept of primary server and secondary servers. A primary server should try to keep the page in the main-memory and the secondary servers can swap the page onto the disk. This will help in utilising the main-memory space efficiently. A simple implementation of this would be to mark the secondary copies as such. On the server-side, these pages will be swapped out to the disk first whenever there is a choice.
     

        On receiving the acknowledgements, HSN calls the function IF_ACKNOWLEDGEMENT().

    IF_ACKNOWLDEGEMENT()
        { Mark the DONE-BIT for this server as 1.
           If all the servers are done, then follow step 9,10,11.
            else return
       }

    IF_TIMEOUT()
        { /** This condition depicts the scenario in which we did not receive ACKs from all the servers.
            The servers from whom we did not receive ACKs may or may not have successfully stored the page. **/
            To be on the safer side, send (by queuing in the network queue) a Page-free request to each of these servers.
           /** It should be noted that this eliminates the need of reliable messaging for the page-out operation. **/
            Next, choose servers from the set of auxiliary servers (to maintain the k-replication invariant) for this virtual unit.
           /** Again, we have the possibility of at-least-k-degree-replication or exactly-k-degree replication for the pages
            This option should be made available to the process using this virtual unit **/
            Repeat all the steps for this server.
        }
     

    Page-in operation

        The page-in operation is initiated by the page-fault handler. As with page-outs, there can be two cases -

        Page-ins are time-critical operations.

    The following are the major steps involved in a page-in operation-

    1. Lookup the prepaging-cache.
    2. If (found in cache),
    3.     Insert the page-address in the PTE of the procress in the mmap case.
    4.     Or copy the page in the swap-device case.
    5.     Free up the copies of this page on the servers (info obtained using PLT) by sending them the Page-free requests.
    6.     Return freeing up the cache entry.
    7. Else /* if not found in cache */
    8.     Use PLT to obtain a list of servers who are storing this page.
    9.     Pick up the primary server (or one of the servers, but keeping in mind the load balancing issues)
    10.     Form a page-in request (message-structure)
    11.     CRITICAL-SECTION BEGINS  {
    12.     Lookup the prepaging-cache again.
    13.     If (found) abandon the processing, go back to step 1
    14.     If (not found) Send the request to the server.
    15.     Set a timeout
    16.     Enter the pair(pid,pageid) in the waiting_for_table
    17.     }  CRITICAL-SECTION ENDS
    18.     If mmap case sleep. If swap-device case return.  /** There could be some problem here **/
    pagein_ISR()
    {
    If (page_received)
                { Lookup in the waiting_for_table.
                    If (not found) /*** This indicates that the request has already been processed by prepaging after the request was sent ***/
                                Discard the page and goto END_PAGEIN.

                    else
                            {  In the mmap case,
                                    Insert the page address into the page table.
                                    goto END_PAGEIN.
                                In the swap-device case,
                                    Copy the page into the page which kernel allocated for this purpose.
                                    goto END_PAGEIN.
                            }

    LABEL END_PAGEIN :
                        Queue up the Page-free requests to free up the copies of the page on the servers.
                        Wakeup the process which was sleeping by unlocking the page.
                        Delete the timer.
                 }
    else  /**** ISR indicates some error during pagein, e.g. server_load_too_high, server_could_not_process_the_request ***/
                Delete the timer .
                Do the same as the TIMEOUT() routine.
    } /** of ISR **/
     

    TIMEOUT()
    {  Figure out the next server to try out.
        Repeat the above step for this newly picked server.
    }

       It should be pretty clear that this algorithm does not require any reliable messaging layer to work on.

        Explicit free requests are not performance-critical, either in terms of latency or bandwidth. They can therefore be batched by a client, and forwarded periodically to servers over the conventional network link. It should however be noted that explicit free requests have to be reliably transmitted in order to avoid storage leaks on the NMS servers. Explicit free requests are queued in the request queue which is maintained between the CSKM and the CSD. It is the CSD's responsibility to reliably send these requests to respective servers.

    Prepaging operation

       Prepaging (or prefetching) is done to reduce the latency incurred during pageins. The APDs try to predict the pages which an application is going to fault on and prepage those pages to the client memory. The prepaging operation (client-side) is implemented as an ISR. The prepaging has to take care of the race conditions discussed in the pagein.

        The following are the main steps of the routine for handling the prepaged stuff:
    cache_ISR() {
    {
    If (Page is already present in the client)
            Ignore the page.
    else if  (some one is waiting for the page) /** This means that a request for this page has been sent to the NMS server. We   should behave as if we got the reply to that request. Using the waiting_for_table, it is always possible to match up the requesting process **/
        {
        Do the same processing as in the pagein_ISR() .
         }
    else
         Put the page in the prepaging-cache.
      }
    }
         The pageins should carry priorities, so that the prepaging requests generated by active processing receive less favorable treatment by servers than requests one which a client application is currently blocked. The mechanism for this is pretty simple. Just have two priority levels, one for prepaging-stuff and the other for non-prepaging-stuff.