Overview
This module is responsible for most aspects of NMS operation on the
client side. It implements the block and character pseudo-device driver
interfaces, it has the responsibility for handling client page faults that
require NMS service, and it handles page-out requests originating from
the kernel swap daemon. It also generates the event streams consumed by
the CSD and APDs, and it manages a small cache consisting of data pages
recently pushed by servers in response to prepaging requests from APDs.
The module is fully implemented as pluggable device-driver, as it is required to be portable across different operating systems.
Interfaces & Functionality Specifications
This module implements the following interfaces:
Swapping and mmap need to coexist.
Issues
There is an even worse thing that could happen. After a cskm page-fault pages in a page from the server, an APD could "prepage" the page into the cache. (If the primary server for the page and the APD run on the same machine, it might be possible to avoid this particular race). The net effect is that we end up with 2 copies of the data in the machine.
One point to note is that having 2 copies of the data is not as bad as it sounds. It might actually be OK. This is because the cskm prepage cache is small in size (lets say 128 or 256 pages), and acts as a FIFO. Hence any secondary copy in thecskm cache will actually be kicked out very soon. To me, it does not sound too bad. The situation that Prof. Stark said, i.e, a pageout for a page occurring immediately after a pagein is remote.
The CSKM driver will export a file_operations
struct to the kernel, when it registers with the kernel.
The block device is registered with register_blkdev() and
the char device is registered with the register_chrdev() functions from
the kernel. After registration, the driver sets blk_dev[NMS_MAJOR].request_fn
= nms_request_fn; The kernels queues up requests in the request queue
& calls this function. Note that the char dev does not have request
functions. (anyway, we need only open & ioctls). Refer to cskm.h for
actual char dev operations.
static struct file_operations
remotemem_fops = {
NULL,
/* lseek */
block_read,
/* read - general block-dev read */
block_write,
/* write - general block-dev write */
NULL,
/* readdir */
NULL,
/* select */
NULL,
/* ioctl */
remotemem_mmap,
/* mmap */
remotemem_open,
/* open */
NULL,
/* release */
NULL,
/* fsync */
NULL,
/* fasync */
NULL,
/* check_media_change */
NULL
/* revalidate */
};
Virtual Unit Interface to the to the CSD, client applications, and active processing daemons.
When a process does an mmap() on the device, we install the following vm_operations_struct in the corresponding vm_area_struct.
static struct vm_operations_struct remotemem_vmops
= {
NULL,
/* no special open */
NULL,
/* no special close */
remotemem_unmap, /* unmap
- we need to sync/explicit free the pages*/
NULL,
/* no special protect */
NULL,
/* sync */
NULL,
/* advise */
remotemem_nopage, /* nopage,
do we need this?? */
NULL, /* wppage, we definitely
won't need this:-)*/
remotemem_swapout, /* swapout */
remotemem_swapin, /* swapin */
};
CSKM event stream interface to the APD/CSD.
The client starts operation in the following way: The NMS master daemon does an open on the char dev associated with remote memory, with minor number zero. At this time, the cskm initialises itself. If the cskm was a module in the kernel, an insmod would have to be done before this step. The cskm also registers a function with the HSN driver. This function will get called from interrupt-level on arrival of each packet.
Each process which wants to use remote memory loops through the char
devices in /dev, trying to find a free device. Once it does find a free
char device, it opens that device. Then, it does an ioctl on the char device
and specifies the size of the backing store that it wants. At this time,
the virtual_unit structure
in the kernel (which is associated with each minor number) is initialised
with the size of the backing store, and a replica group bitmap to hold
this many entries will be allocated.
Now the process can do an mmap on the associated block device. If
the device is to be used as a swap device, a swapon would
be done on this device instead of an mmap.
Data Structures (may want to keep cskm.h along-side)
Prepaging-Cache
The cache is used to hold the prepaged stuff temporarily.
It implements a simple FIFO replacement policy.
It supports two operations -
cskm_cache_insert(int
freeable, phys_page_num, page_id) - This operation puts a page into
the cache. The page is associated with the key 'page_id'. If there is not
enough space available in the cache, then this operation chucks out the
oldest page lying in the cache. If a page given to the cache is not marked
as freeable, it will not be considered fot replacement. ie, it will
stay in the cache unless explicitly freed.
cskm_cache_lookup(page_id)
- This operation gets a page which was earlier put in the cache. It returns
the physical page number. It also deletes the page from the cache, if the
page is a "deletable" page.
cskm_cache_delete(int
page_id) - This will free up the page indicated by page_id.
The organization of this cache should be similar to the buffer cache, i.e. using both hash table (for implementing fast lookups) and linked list (for implementing the FIFO policy). It should be noted that the pages in the cache are not fixed, however the maximum number of pages in the cache are fixed. So, if somebody tries to insert more than MAX_CSKM_CACHE_PAGES number of pages, then the cache will itself free the oldest entry. The cache insert function will get called from interrupt level, by the per-packet-arrival-processing function.
The cache also hold the pages fetched in from a server for re-replication. That is the reason why we need pages which "stay" in the cache, and the explicit free function in the cache. Maintaining coherence is easier if we have only the cache, and not another replication-pool of pages.
PageId
Pages on the NMS system are identified using globally
valid, 8-byte (64-bit) addresses.
(1) The first two bytes
of the address will identify an NMS server system,
(2) the next 2 bytes will
be the virtual unit which this page falls under, and
(3) the remaining 4 bytes
will be the offset into the virtual unit of a chunk of data whose size
is equal to the native page-size of the client machine owning the page.
Also, the page should be page-aligned in the virtual unit.
All these semantics are only
used by the clients. Servers do not know the details of the page-id.
EventQ
Events to be notified to the
user-level daemons (CSD/SSD/APD) are stored in a ring-buffer (implemented
as a list) in the kernel. There is a buffer which maintains all the events
which at least one user-level entity is looking at. A pointer is maintained
for each such user-level entity. These pointers denote the events yet to
be read by the corresponding user-level daemon.
The C-structures look like
struct EventQ {
int event_type;
unsigned long time_stamp;
unsigned int length; /* length of the buffer actually used
*/
unsigned int vm_unit;
unsigned char buffer[MAX_BUF];
struct EventQ *next;
};
The buffer field is decoded based on the event_type and length. It contains additional event-dependent information about the event. A comprehensive list of event-types is placed here.
All the controlling ioctls for a block device will be actually issued to the coresponding char device. The char device should not be closed before the block device is closed. All config info and event info will be communicated between the cskm and some user process(like the csd) via ioctls. The device on which an ioctl is done decides whether the info is pertaining to the whole system or just to one virtual unit. For example, to notify the kernel that a particular server is down, an ioctl will be issued on the NMS master devide (minor number zero). To get the size of a particular virtual unit, an ioctl would be done an that virtual unit's minor number.
Any process that wants to listen to events will do an ioctl on the minor char device associated with a virtual unit, and pass a list of events it is interested in to the kernel. The kernel will create a structure for this process which contains the events it is interested in, and its current_evt_pointer into the buffer. The events will be picked by the process with ioctls. Note that there will be a ring-buffer per virtual unit. The events are noted below, with the device to which the evnt will be delivered. Master device indicates events pertaining to the client system as a whole. Virtual unit indicates events pertaining to only one particular virtual unit.
Server-Banks
The concept of server-banks is used to reduce the
number of bits required in PLT, the page location table. For storing a
page, a group of servers is always choosen from a single bank, it
is never split across banks. We are allowing 2^8 server-banks, thus
consuming 8 bits to represent a bank-id.
(** Note that we DO NOT have a priority queue mechanism
for maintaing the servers with the least load. The reason is that the randomization
scheme can not work with priority queues. That scheme has to figure out
the least loaded servers from a set of randomly chosen servers each time.
The scheme does not chose random servers from a set of least loaded servers.**)
Virtual_Unit_Structure
struct remotemem_virtual_unit{
int major;
/* Major number of driver, not needed? */
int minor;
/* Actual virtual unit number*/
int pid;
/* pid of owner*/
int num_pages;
/* Size of memory (in pages) backed by this guy */
/*Note: Only size is
needed. For mmap, we get the virtual address
from the vm_area_struct*/
short int *replica_grp_bmap;
};
Page-Location-Table (PLT)
This table stores the location of the swapped out
page. Each entry is 16-bits long consisting of 8-bits bank-id, followed
by an 8-bit long bit-map of the actual servers in that partition storing
copies of the page. This is the replica_grp_bmap field in the
structure above.
Message Structure
This page describes the common structure of the
messages exchanged between the CSKM and the SSKM. The page is located here.
Design
Page-out operation
Page-out operation is initiated by the kswapd. As discussed in the earlier sections, there can be two cases; let us call them the mmap case and the swap-device case.
It should be kept in mind that the page-outs are not time-critical operations.
The major work involved in both the situations is same. The following are the major steps which have to be performed for a page-out operation.
Choosing the server-replica-group for page-out.
The job of choosing the set of servers is performed using an adaptive scheme in which clients track current server loads, and use randomization together with picking servers with minimum load, to select servers to receive page-out. Using randomization, the CSKM chooses a set of random servers from 'a' server bank (the server bank itself could be chosen randomly on each pageout operation). To support k-degree-replication, the CSKM chooses k+x least loaded servers out of the randomly chosen servers. The set of these k servers is used as the replica-group. The remaining x servers are the auxiliary servers to be used should one of the replica-group-servers fail.
According to Michael Bender, recent theoretical work has shown good bounds on the balance achievable with this type of scheme. With good balance, we just need to equip servers with sufficient buffers to handle short-term bursts in load. We can establish the required number of buffers empirically. The measurement of server load can then be as simple as sampling the number of free buffers (or perhaps better, the number of buffers currently in use). Load measurements are piggybacked on messages sent to acknowledge receipt of page-out data, or can be triggered by timeouts in periods of system inactivity.
This brings us to another interesting issue. It seems
that there should be a concept of primary server and secondary servers.
A primary server should try to keep the page in the main-memory and the
secondary servers can swap the page onto the disk. This will help in utilising
the main-memory space efficiently. A simple implementation of this would
be to mark the secondary copies as such. On the server-side, these pages
will be swapped out to the disk first whenever there is a choice.
On receiving the acknowledgements, HSN calls the function IF_ACKNOWLEDGEMENT().
IF_ACKNOWLDEGEMENT()
{ Mark the DONE-BIT for this server as 1.
If all the servers are done, then
follow step 9,10,11.
else return
}
IF_TIMEOUT()
{ /** This condition depicts the scenario in which
we did not receive ACKs from all the servers.
The servers from whom we
did not receive ACKs may or may not have successfully stored the page.
**/
To be on the safer side,
send (by queuing in the network queue) a Page-free request to each of these
servers.
/** It should be noted that this
eliminates the need of reliable messaging for the page-out operation. **/
Next, choose servers from
the set of auxiliary servers (to maintain the k-replication invariant)
for this virtual unit.
/** Again, we have the possibility
of at-least-k-degree-replication or exactly-k-degree replication for the
pages
This option should be made
available to the process using this virtual unit **/
Repeat all the steps for
this server.
}
Page-in operation
The page-in operation is initiated by the page-fault handler. As with page-outs, there can be two cases -
The following are the major steps involved in a page-in operation-
else
{ In the mmap case,
Insert the page address into the page table.
goto END_PAGEIN.
In the swap-device case,
Copy the page into the page which kernel allocated for this purpose.
goto END_PAGEIN.
}
LABEL END_PAGEIN :
Queue up the Page-free requests to free up the copies of the page on the
servers.
Wakeup the process which was sleeping by unlocking the page.
Delete the timer.
}
else /**** ISR indicates some error during pagein, e.g. server_load_too_high,
server_could_not_process_the_request ***/
Delete the timer .
Do the same as the TIMEOUT() routine.
} /** of ISR **/
TIMEOUT()
{ Figure out the next server to try out.
Repeat the above step for this newly picked server.
}
It should be pretty clear that this algorithm does not require any reliable messaging layer to work on.
Explicit free requests are not performance-critical, either in terms of latency or bandwidth. They can therefore be batched by a client, and forwarded periodically to servers over the conventional network link. It should however be noted that explicit free requests have to be reliably transmitted in order to avoid storage leaks on the NMS servers. Explicit free requests are queued in the request queue which is maintained between the CSKM and the CSD. It is the CSD's responsibility to reliably send these requests to respective servers.
Prepaging operation
Prepaging (or prefetching) is done to reduce the latency incurred during pageins. The APDs try to predict the pages which an application is going to fault on and prepage those pages to the client memory. The prepaging operation (client-side) is implemented as an ISR. The prepaging has to take care of the race conditions discussed in the pagein.
The following are the main steps of the routine for
handling the prepaged stuff:
cache_ISR() {
{
If (Page is already present in the client)
Ignore the page.
else if (some one is waiting for the page) /** This means that
a request for this page has been sent to the NMS server. We
should behave as if we got the reply to that request. Using the waiting_for_table,
it is always possible to match up the requesting process **/
{
Do the same processing as in the pagein_ISR() .
}
else
Put the page in the prepaging-cache.
}
}
The pageins should carry priorities, so that
the prepaging requests generated by active processing receive less favorable
treatment by servers than requests one which a client application is currently
blocked. The mechanism for this is pretty simple. Just have two priority
levels, one for prepaging-stuff and the other for non-prepaging-stuff.