For this assignment, you are to write a C program words
that
reads a text file, makes a tally of all the different words used in the
file, and prints out a list of the words that appear most frequently,
along with the number of times each word appears. This list should be
printed in descending order of frequency of appearance, so that the
most-frequently-used word is printed first, then the second-most-frequently
used word, etc.
For the purpose of this assignment, a word is any sequence of upper and lower-case letters that appears in the input file preceded immediately either by the beginning of the file or a non-letter and followed immediately either by the end of the file or a non-letter. Except for their function in delimiting the words in the file, your program should ignore non-letter characters such as punctuation, digits, and control characters. Your program should not make any assumption about the maximum length of a word in the input file, but you may regard two words as equal unless they differ somewhere in their first 31 characters. Your program should be case-insensitive; that is, if two words differ only in that one has upper-case letters where the other has lower-case letters, then the words should be regarded as equal.
It should be possible to invoke words
by a command of any one
of the following forms:
words
should read the text file from the standard input, and the output should
contain information about the 20 most-frequently used words.
If invoked with a single command-line argument of the form -nnn
,
where nnn
is a number, words
should read the text
file from the standard input, and produce output concerning the
nnn
most-frequently used words.
If invoked with a list of filenames as arguments, words
should read each of the files in succession, and at the end produce
information about the 20 most-frequently used words.
Finally, if invoked with a list of filenames and a -nnn
argument,
words
should read each of the files and at the end produce
information about the nnn
most-frequently used words.
An example of the output of words -5
would be as follows:
If you find it helpful, you may assume that a file contains at most 100,000 different words. If your program is given a file that contains more than 100,000 different words, it should print a warning message and then ignore any new words after the first 100,000. Note that, even if you do make this assumption, your program should still produce correct output for the first 100,000 different words encountered; thus, if your program is given as input a file containing the word "frog" repeated 150,000 times, it should not print any warning message and it will output the proper count (150,000) for the word "frog".
Once you have written and debugged your program, perform a comparison of
the CPU time taken by your program and by the program I wrote, which is
available
here
(rename words.bin
to words
after downloading).
I have also included
test files of approximately 1K bytes, 10K bytes, 100K bytes, and 1M bytes.
Use the time
command as shown in the following example:
x
and CPU time as the dependent variable y
,
which of the following functions:
y = x
y = x^2
y = exp(x)
Use the handin
program to electronically hand in your work
as assignment HW3. Organize your work into several files as follows:
README
should contain a human-readable report
on what you did, and should describe the contents of the
other files and how to compile and run them.
build
should be a POSIX shell script to compile your
program and produce an executable file words
.
demo
should be a shell script to demonstrate
the operation of your program on some test files.
words.c
(you may choose another name, or split
your program into more than one file), should contain the
source to your program words
.