curses
'' package, which is a standard
Unix package, originally developed for System V, that supports the
creation of text-mode screen-oriented user interfaces.
The second package you will use is a simple set of routines that I wrote
to support the interpretation of Uniform Resource Locators (URLs),
the sending of of HTTP requests to remote servers, and the interpretation
of the resulting responses.
An HTML file consists of three types of tokens:
TAG
is an identifier that distinguishes
different kinds of markup tokens.
The identifiers ATTR1
, ATTR2
, ...
name attributes to which the respective values
VALUE1
, VALUE2
, ...
are assigned. Values are strings, which may be enclosed
either in single quotes or in double quotes,
and which must be so enclosed if the value strings contain spaces.
The set of attributes and values that may be present differs
for each different HTML tag.
Whitespace inside of markup tokens is ignored, except when it
occurs between double quotes as part of a value string.
HTML tags and attribute names are case-insensitive,
so that, for example, <PRE> and <pre> mean the
same thing.
When a browser reads an HTML document, it formats the text to
fit on the user's screen. The markup indicates the important
structural elements of the document, which will generally influence
the way in which the document is formatted. Markup never appears in
the formatted output. Whitespace, including newlines, is also
generally ignored by a browser, in the sense that any sequence of whitespace
characters is regarded as equivalent to a single whitespace character,
and is rendered as a space between words in the output.
The only exception to this is when rendering ``preformatted'' text,
which occurs betweeen markup tags <pre>
and
</pre>
. In this case, spaces and newlines are
output exactly as they appear in the document source.
One of the important features of HTML is that it is very easy for a browser to distinguish markup from text and whitespace, and a browser can simply ignore markup whose tags it does not understand. For this assignment, you only have to handle a very small number of tags. The only markup tag your browser absolutely must handle is:
A
''
and it ends with a markup token having tag ``/A
''.
The ``HREF
'' attribute has as its value
the URL of the document to which the link points.
Note that there might be other attributes besides the
``HREF
'', and that the ``HREF
''
need not appear first.
The ``anchor text'', which occurs after the initial ``<A ...>''
markup token, and which continues up until the matching
``</A>'' markup token, would usually be highlighted
by a browser to indicate the presence of a link.
For additional credit, and a better-looking display, you should also attempt to handle the following markup tags:
header text
should be set off from the rest
of the document and possibly highlighted.
You may also wish to handle the tags
``<H2> ... </H2>'',
``<H3> ... </H3>'', etc. (up to <H6>)
which indicate section headings of lesser importance.
preformatted text
'' is output without attempting
to fill lines to match the width of the display device.
Spaces and newlines are output exactly as they appear in the
document source. Markup within the preformatted region is
still interpreted as usual.
http.h
, and
http.c
,
located in this directory.
The header file
http.h
contains comments telling you how to use the package.
I have also provided a demonstration program, ``retrieve
'',
which illustrates how to use the package.
This program takes a single URL as a command-line argument, retrieves
the document using HTTP, and prints out various components of the response,
including the document body.
The source to this program is in
retrieve.c
.
There is also a
Makefile
for building the demonstration program.
Here is a brief summary of how to use this package.
There are two important data types that are provided by this
package: HTTP
and URL
.
An HTTP
object is used to control an HTTP connection
to a remote site. A URL
object contains information
from a URL (Web address). A third data type, IPADDR
is actually an alias for the system-defined data type
struct sockaddr_in
, and it is used to hold the IP address
of a host on the Internet.
To retrieve a document located at a particular URL, which we assume
is given as a string such as http://www.ug.cs.sunysb.edu/
,
you must first obtain the IP address and port number to contact.
This is done as follows:
url_parse()
to parse the URL string and create
a URL object.
url_address()
and url_port()
.
Note: the first time url_address()
is used
on a URL object, a DNS lookup will occur, which will usually
occur quickly, but could result in a delay of up to a minute
or so if the required nameservers are down or inaccessible.
url_method()
,
the hostname using url_hostname()
, and the path on
the remote server using url_path()
.
This package only supports document retrieval using access method
"http".
url_free()
.
http_open()
to create an open HTTP connection to
the specified IP address and port.
http_request()
to issue an HTTP GET request for a
particular URL to the server at the other end of an HTTP connection.
http_file()
.
http_response()
to finish the request and collect
the response code and response headers.
http_status()
.
Query the response headers using http_headers_lookup()
.
http_getc()
.
http_close()
.
curses
'' package was originally from System V, but versions
are now pretty much universally available on all types of Unix systems.
There are actually two versions of curses
available on our
systems: ``curses
'' and ``ncurses
''.
I recommend using the ``ncurses
'' version;
the ``curses
'' version only supports a subset of the
functionality.
Documentation on ncurses
can be found by running
``man ncurses
''.
I have also put documentation (for ``curses
, but also
generally applicable to ``ncurses
'') online
in Postscript form and
in ASCII text form.
What this package does for you is to give you a simple model for dealing
with the user's screen, so that you don't have to worry about the details
of how to output the proper escape sequences to change the screen from
one display to another. Basically, what you do with curses
is you begin by calling initscr()
to let curses
take over control of your screen.
Then, whenever you want to change what is on the screen, you use the
functions provided by the package to construct the new screen view
on a ``hidden,'' in-memory view of the screen, and when you are ready
you call the function ``refresh()
.'' This tells the system,
in essence, ``make the screen look like that!'' The system handles all
the details of figuring out a minimal set of commands to send to the
screen that will change the display from what was previously there to
the new stuff, and generating the proper escape codes to invoke these
commands. Finally, when your program is finished, you call
endwin()
to restore the screen to normal operation.
Something important to keep in mind is that, while you are using
curses
to manage your screen, you must not use the
standard I/O library functions to read input from the keyboard.
Instead, you have to use the special functions
(getch()
, getstr()
, etc.)
provided by curses
for this purpose.
The reason is, curses
has to know at all times what is
showing on the screen. When the user types something, the echoing of
characters will change what is shown on the screen and move the cursor.
If you use curses
functions to read the typein,
curses
can track what has happened to the screen.
If you use non-curses functions to read the screen, curses
will not be aware that the screen has changed and will produce incorrect
output later on.
In order to compile your program with ncurses
, you should
put ``#include <ncurses.h>
'' in each of your source
files that use the curses
routines, and you have to
add ``-lncurses
'' to the end of the C compiler
command line that links your program. If you don't do this, you'll
get ``undefined function'' messages for all the curses
functions. I have placed here
the source to the demo program I discussed in class.
From previous experience, I know that beginners tend to approach
curses
with the wrong idea. The correct way of programming
with curses
is to simply think about what the next screen view
is that you want to show the user. With this point of view, you just
worry about constructing that view and let curses
deal with
the problem of how to actually get that view on the screen.
The wrong way of programming with curses, which will actually make your life
much more miserable, is to try to describe a particular sequence of changes
to be made to the screen.
lynx
'' text-mode Web browser that is available
on the UG lab systems.
When you start your program, it should take as a command line argument
the URL of an initial document to display (the ``home page'').
Your program should retrieve this document using HTTP, format it as
described below, and then display the result on the screen.
The user should be able to use the up and down arrow keys to scroll one
line forward and back in the document, and the ``page up'' and ``page down''
keys to scroll one screen forward and back.
(See /usr/include/ncurses.h
to find definitions of the
codes returned by getch()
for the various keypad keys
such as the arrow keys.)
Your program should properly recognize HTML anchors, and highlight the link text on the screen. The user should be able to use the ``tab'' key to move the cursor from one link in a document to the next, and ``shift tab'' to move the cursor from one link to the previous one. When the cursor is positioned on a link, typing the right arrow key should cause the program to follow that link and display the document that the link points to. Typing the left arrow key should abandon the document currently being displayed and return to the previously displayed document (if any).
When your program retrieves a document using HTTP, you should check
the Content-type
header to find out what type of document
has been returned. Your program should be able to display documents
that have Content-type
equal to ``text/plain
''
or ``text/html
''. For documents of other types, your program
should produce an error message and refuse to display the document.
Documents of type text/plain
should simply be displayed
verbatim on the screen. This is the type of document that would be
returned, for example if you retrieve
this C program.
This is probably the easiest display mode to
get working first, and I suggest you start with this. If the lines are
too long to fit on the screen, they should simply be truncated.
It is not necessary to provide a horizontal scrolling capability.
However, if the document is too long to fit on the screen, the user should
be able to scroll through it
as described above. HTML markup, including hyperlinks, should not be
processed for text/plain
documents.
Most of the documents on the Web are of type text/html
.
For example, this is what you will get if you retrieve
this page.
For text/html
documents, your browser should identify the
text, whitespace, and markup tokens in the document, and it should identify
the tag on each of the markup tokens. Your program is required to at least
handle HTML anchors (markup tag ``A
'').
You may also choose to handle other kinds of markup, as discussed above.
No matter what markup your program can handle, markup should never be
displayed on the screen -- any markup that your program doesn't know what
to do with should simply be discarded.
When displaying text/html
documents, your program should
format the text to fit the screen width. That is, your program should
``fill'' each line of output with as many text tokens as will fit before
going to the next line. Whitespace tokens should be ignored, except
insofar as they serve to delimit text tokens.
You do not need to ``adjust'' the spaces between the text tokens to achieve
alignment in the right-hand margin.
One interesting HTML tag you may wish to handle is the
``<pre>
'' tag. Text in between
``<pre>
'' and ``</pre>
''
is to be regarded as ``preformatted.'' This means you should
output the whitespace tokens verbatim. Markup should still be
processed, though. An example of the use of preformatted text
can be seen when you get a directory listing from a Web server,
such as this one.
next_token()
'' to read and
return the next token from the input stream.
This will enable you to read in a document as a sequence of tokens,
which you can organize in memory as a linked list (let's call this the
token list)
You can then have one ``output_format()
'' function
that scans over the token list and builds a new,
doubly linked list of lines to be displayed (let's call this the
display list).
If you then keep track of the position in the display list that corresponds
to the start of the current screen, then you can write just one
``display()
'' function that uses curses
functions to copy the lines in the ``current window'' from the display
list to the screen.
With this kind of organization, scrolling the screen is simply a
matter of changing the current position in the display list and
calling display()
. Handling hyperlinks under the above organization requires that you keep track of the correspondence between positions on the screen and tokens in the token list. This can be done by having each line in the display list point back to the token in the token list that corresponds to the start of that line. As you build the display list, you can record within any anchor tokens in the token list the screen positions where the highlighted text starts and ends. Then, when the user selects a link you can go back to the start of that line in the token list, scan forward in the token list to find the proper anchor token, and then you will have the information needed to retrieve the linked document.
A different (and possibly conceptually simpler) way of handling the hyperlinks would be to simply make a list of the positions of all the links on the screen as you build the display list. When the user selects a link, you just scan through this list to find the selected anchor, then use the URL in it to retrieve the linked document.
Note that if you use the above organization, the difference between displaying
text/html
and text/plain
is that for the latter
you can just bypass the intermediate token list and build the display list
directly from the input. I suggest doing this first, then when you have
the display of text/plain
working correctly with scrolling,
etc. you can go back and introduce the intermediate token list
data structure to be able to handle text/html
and hyperlinks.
If proper organization and programming technique is used, I don't think this project needs to be extremely long or difficult (probably less than 1000 lines of C, not counting the 500 lines of network code I already wrote for you). However, I expect that there will likely be a number of students who will not be able to finish this project. To maximize your grade, I suggest you organize your work on this project so that you are able to hand in a program that compiles and does something interesting, even if you are not able to finish all the functionality specified above. This implies that you should adopt a kind of incremental approach to building this program, so that when you run out of time you can hand in the most recent thing you had working.