Homework 5: Advanced C Programming

Due December 11, 1997

For this assignment, you will write the program ``Internet Bumbler'' which will be a ``barely functional'' Web browser. In order to accomplish this feat, you will make use of two ``canned'' packages. The first package is the ``curses'' package, which is a standard Unix package, originally developed for System V, that supports the creation of text-mode screen-oriented user interfaces. The second package you will use is a simple set of routines that I wrote to support the interpretation of Uniform Resource Locators (URLs), the sending of of HTTP requests to remote servers, and the interpretation of the resulting responses.

Background Information

HTML

Although the full HTML language specification is somewhat complicated, for this assignment you can take a very simple-minded point of view, which I will now describe.

An HTML file consists of three types of tokens:

whitespace
Whitespace tokens consist of sequences of whitespace characters, which are spaces (ASCII 0x20), tabs (ASCII 0x9), newlines (ASCII 0xA), carriage returns (ASCII 0xD), and form feeds (ASCII 0xE).

markup
Markup tokens start with a ``less than'' sign < and end with a matching ``greater than'' sign >. The general form of what is in between is as follows: < TAG ATTR1=VALUE1 ATTR2=VALUE2 ... > The TAG is an identifier that distinguishes different kinds of markup tokens. The identifiers ATTR1, ATTR2, ... name attributes to which the respective values VALUE1, VALUE2, ... are assigned. Values are strings, which may be enclosed either in single quotes or in double quotes, and which must be so enclosed if the value strings contain spaces. The set of attributes and values that may be present differs for each different HTML tag. Whitespace inside of markup tokens is ignored, except when it occurs between double quotes as part of a value string. HTML tags and attribute names are case-insensitive, so that, for example, <PRE> and <pre> mean the same thing.

text
Text tokens are any sequences of characters that are not whitespace tokens and are not markup tokens.

When a browser reads an HTML document, it formats the text to fit on the user's screen. The markup indicates the important structural elements of the document, which will generally influence the way in which the document is formatted. Markup never appears in the formatted output. Whitespace, including newlines, is also generally ignored by a browser, in the sense that any sequence of whitespace characters is regarded as equivalent to a single whitespace character, and is rendered as a space between words in the output. The only exception to this is when rendering ``preformatted'' text, which occurs betweeen markup tags <pre> and </pre>. In this case, spaces and newlines are output exactly as they appear in the document source.

One of the important features of HTML is that it is very easy for a browser to distinguish markup from text and whitespace, and a browser can simply ignore markup whose tags it does not understand. For this assignment, you only have to handle a very small number of tags. The only markup tag your browser absolutely must handle is:

<A HREF=url ...> ... anchor text ... </A>
This type of construction is called an anchor and it is used to identify a hypertext link in the document. An anchor starts with a markup token having tag ``A'' and it ends with a markup token having tag ``/A''. The ``HREF'' attribute has as its value the URL of the document to which the link points. Note that there might be other attributes besides the ``HREF'', and that the ``HREF'' need not appear first. The ``anchor text'', which occurs after the initial ``<A ...>'' markup token, and which continues up until the matching ``</A>'' markup token, would usually be highlighted by a browser to indicate the presence of a link.

For additional credit, and a better-looking display, you should also attempt to handle the following markup tags:

<H1> ... header text ... </H1>
This indicates a ``level 1'' section heading. The header text should be set off from the rest of the document and possibly highlighted. You may also wish to handle the tags ``<H2> ... </H2>'', ``<H3> ... </H3>'', etc. (up to <H6>) which indicate section headings of lesser importance.

<P>
This tag indicates the beginning of a paragraph, and would usually cause a blank line to be inserted in the output.

<BR>
This tag indicates a line break, and it would usually cause the current line of output to end and a new line to start. If this tag occurs at the beginning of a line of output, a blank line will result.

<PRE> ... preformatted text ...</PRE>
The ``preformatted text'' is output without attempting to fill lines to match the width of the display device. Spaces and newlines are output exactly as they appear in the document source. Markup within the preformatted region is still interpreted as usual.

HTTP

HTTP (HyperText Transport Protocol) is the protocol that is used to transfer documents over the World Wide Web. Again, the official specification is very technical, but you don't really have to know anything much about it for this assignment. I have programmed for you a basic set of routines for interpreting URLs, making HTTP connections over the network, sending HTTP requests, and interpreting the responses to such requests. My code consists of the files http.h, and http.c, located in this directory. The header file http.h contains comments telling you how to use the package. I have also provided a demonstration program, ``retrieve'', which illustrates how to use the package. This program takes a single URL as a command-line argument, retrieves the document using HTTP, and prints out various components of the response, including the document body. The source to this program is in retrieve.c. There is also a Makefile for building the demonstration program.

Here is a brief summary of how to use this package. There are two important data types that are provided by this package: HTTP and URL. An HTTP object is used to control an HTTP connection to a remote site. A URL object contains information from a URL (Web address). A third data type, IPADDR is actually an alias for the system-defined data type struct sockaddr_in, and it is used to hold the IP address of a host on the Internet.

To retrieve a document located at a particular URL, which we assume is given as a string such as http://www.ug.cs.sunysb.edu/, you must first obtain the IP address and port number to contact. This is done as follows:

  1. Use url_parse() to parse the URL string and create a URL object.

  2. Extract the IP address and port number using url_address() and url_port(). Note: the first time url_address() is used on a URL object, a DNS lookup will occur, which will usually occur quickly, but could result in a delay of up to a minute or so if the required nameservers are down or inaccessible.

  3. If desired, extract the "access method" ("http", "ftp", "mailto", etc.) using url_method(), the hostname using url_hostname(), and the path on the remote server using url_path(). This package only supports document retrieval using access method "http".

  4. When finished with the URL, free it with url_free().
Once you have the IP address and the port number, the document can be retrieved as follows:
  1. Use http_open() to create an open HTTP connection to the specified IP address and port.

  2. Use http_request() to issue an HTTP GET request for a particular URL to the server at the other end of an HTTP connection.

  3. [optional] Output additional request headers to the connection using the FILE pointer returned by http_file().

  4. Use http_response() to finish the request and collect the response code and response headers.

  5. Examine the response code and message using http_status(). Query the response headers using http_headers_lookup().

  6. Collect the document making up the body of the response using http_getc().

  7. Close the HTTP connection using http_close().

Curses

The ``curses'' package was originally from System V, but versions are now pretty much universally available on all types of Unix systems. There are actually two versions of curses available on our systems: ``curses'' and ``ncurses''. I recommend using the ``ncurses'' version; the ``curses'' version only supports a subset of the functionality. Documentation on ncurses can be found by running ``man ncurses''. I have also put documentation (for ``curses, but also generally applicable to ``ncurses'') online in Postscript form and in ASCII text form. What this package does for you is to give you a simple model for dealing with the user's screen, so that you don't have to worry about the details of how to output the proper escape sequences to change the screen from one display to another. Basically, what you do with curses is you begin by calling initscr() to let curses take over control of your screen. Then, whenever you want to change what is on the screen, you use the functions provided by the package to construct the new screen view on a ``hidden,'' in-memory view of the screen, and when you are ready you call the function ``refresh().'' This tells the system, in essence, ``make the screen look like that!'' The system handles all the details of figuring out a minimal set of commands to send to the screen that will change the display from what was previously there to the new stuff, and generating the proper escape codes to invoke these commands. Finally, when your program is finished, you call endwin() to restore the screen to normal operation.

Something important to keep in mind is that, while you are using curses to manage your screen, you must not use the standard I/O library functions to read input from the keyboard. Instead, you have to use the special functions (getch(), getstr(), etc.) provided by curses for this purpose. The reason is, curses has to know at all times what is showing on the screen. When the user types something, the echoing of characters will change what is shown on the screen and move the cursor. If you use curses functions to read the typein, curses can track what has happened to the screen. If you use non-curses functions to read the screen, curses will not be aware that the screen has changed and will produce incorrect output later on.

In order to compile your program with ncurses, you should put ``#include <ncurses.h>'' in each of your source files that use the curses routines, and you have to add ``-lncurses'' to the end of the C compiler command line that links your program. If you don't do this, you'll get ``undefined function'' messages for all the curses functions. I have placed here the source to the demo program I discussed in class.

From previous experience, I know that beginners tend to approach curses with the wrong idea. The correct way of programming with curses is to simply think about what the next screen view is that you want to show the user. With this point of view, you just worry about constructing that view and let curses deal with the problem of how to actually get that view on the screen. The wrong way of programming with curses, which will actually make your life much more miserable, is to try to describe a particular sequence of changes to be made to the screen.

Program Specifications

Your program should behave roughly like a very limited version of the ``lynx'' text-mode Web browser that is available on the UG lab systems.

When you start your program, it should take as a command line argument the URL of an initial document to display (the ``home page''). Your program should retrieve this document using HTTP, format it as described below, and then display the result on the screen. The user should be able to use the up and down arrow keys to scroll one line forward and back in the document, and the ``page up'' and ``page down'' keys to scroll one screen forward and back. (See /usr/include/ncurses.h to find definitions of the codes returned by getch() for the various keypad keys such as the arrow keys.)

Your program should properly recognize HTML anchors, and highlight the link text on the screen. The user should be able to use the ``tab'' key to move the cursor from one link in a document to the next, and ``shift tab'' to move the cursor from one link to the previous one. When the cursor is positioned on a link, typing the right arrow key should cause the program to follow that link and display the document that the link points to. Typing the left arrow key should abandon the document currently being displayed and return to the previously displayed document (if any).

When your program retrieves a document using HTTP, you should check the Content-type header to find out what type of document has been returned. Your program should be able to display documents that have Content-type equal to ``text/plain'' or ``text/html''. For documents of other types, your program should produce an error message and refuse to display the document.

Documents of type text/plain should simply be displayed verbatim on the screen. This is the type of document that would be returned, for example if you retrieve this C program. This is probably the easiest display mode to get working first, and I suggest you start with this. If the lines are too long to fit on the screen, they should simply be truncated. It is not necessary to provide a horizontal scrolling capability. However, if the document is too long to fit on the screen, the user should be able to scroll through it as described above. HTML markup, including hyperlinks, should not be processed for text/plain documents.

Most of the documents on the Web are of type text/html. For example, this is what you will get if you retrieve this page. For text/html documents, your browser should identify the text, whitespace, and markup tokens in the document, and it should identify the tag on each of the markup tokens. Your program is required to at least handle HTML anchors (markup tag ``A''). You may also choose to handle other kinds of markup, as discussed above. No matter what markup your program can handle, markup should never be displayed on the screen -- any markup that your program doesn't know what to do with should simply be discarded. When displaying text/html documents, your program should format the text to fit the screen width. That is, your program should ``fill'' each line of output with as many text tokens as will fit before going to the next line. Whitespace tokens should be ignored, except insofar as they serve to delimit text tokens. You do not need to ``adjust'' the spaces between the text tokens to achieve alignment in the right-hand margin.

One interesting HTML tag you may wish to handle is the ``<pre>'' tag. Text in between ``<pre>'' and ``</pre>'' is to be regarded as ``preformatted.'' This means you should output the whitespace tokens verbatim. Markup should still be processed, though. An example of the use of preformatted text can be seen when you get a directory listing from a Web server, such as this one.

Suggestions

To get this program to work, you'll need to be careful about proper organization and modularization, and also to choose the right data structures. I suggest defining a ``token'' data type to represent the three possible kinds of tokens (text, whitespace, and markup), and writing a function ``next_token()'' to read and return the next token from the input stream. This will enable you to read in a document as a sequence of tokens, which you can organize in memory as a linked list (let's call this the token list) You can then have one ``output_format()'' function that scans over the token list and builds a new, doubly linked list of lines to be displayed (let's call this the display list). If you then keep track of the position in the display list that corresponds to the start of the current screen, then you can write just one ``display()'' function that uses curses functions to copy the lines in the ``current window'' from the display list to the screen. With this kind of organization, scrolling the screen is simply a matter of changing the current position in the display list and calling display().

Handling hyperlinks under the above organization requires that you keep track of the correspondence between positions on the screen and tokens in the token list. This can be done by having each line in the display list point back to the token in the token list that corresponds to the start of that line. As you build the display list, you can record within any anchor tokens in the token list the screen positions where the highlighted text starts and ends. Then, when the user selects a link you can go back to the start of that line in the token list, scan forward in the token list to find the proper anchor token, and then you will have the information needed to retrieve the linked document.

A different (and possibly conceptually simpler) way of handling the hyperlinks would be to simply make a list of the positions of all the links on the screen as you build the display list. When the user selects a link, you just scan through this list to find the selected anchor, then use the URL in it to retrieve the linked document.

Note that if you use the above organization, the difference between displaying text/html and text/plain is that for the latter you can just bypass the intermediate token list and build the display list directly from the input. I suggest doing this first, then when you have the display of text/plain working correctly with scrolling, etc. you can go back and introduce the intermediate token list data structure to be able to handle text/html and hyperlinks.

If proper organization and programming technique is used, I don't think this project needs to be extremely long or difficult (probably less than 1000 lines of C, not counting the 500 lines of network code I already wrote for you). However, I expect that there will likely be a number of students who will not be able to finish this project. To maximize your grade, I suggest you organize your work on this project so that you are able to hand in a program that compiles and does something interesting, even if you are not able to finish all the functionality specified above. This implies that you should adopt a kind of incremental approach to building this program, so that when you run out of time you can hand in the most recent thing you had working.


Eugene W. Stark