Homework 3: C Programming Exercise

Due October 30, 1997

For this assignment, you are to write a C program words that reads a text file, makes a tally of all the different words used in the file, and prints out a list of the words that appear most frequently, along with the number of times each word appears. This list should be printed in descending order of frequency of appearance, so that the most-frequently-used word is printed first, then the second-most-frequently used word, etc.

For the purpose of this assignment, a word is any sequence of upper and lower-case letters that appears in the input file preceded immediately either by the beginning of the file or a non-letter and followed immediately either by the end of the file or a non-letter. Except for their function in delimiting the words in the file, your program should ignore non-letter characters such as punctuation, digits, and control characters. Your program should not make any assumption about the maximum length of a word in the input file, but you may regard two words as equal unless they differ somewhere in their first 31 characters. Your program should be case-insensitive; that is, if two words differ only in that one has upper-case letters where the other has lower-case letters, then the words should be regarded as equal.

It should be possible to invoke words by a command of any one of the following forms:

words words -nnn words file1 file2 ... words -nnn file1 file2 ... If invoked by the first form (no command-line arguments), words should read the text file from the standard input, and the output should contain information about the 20 most-frequently used words. If invoked with a single command-line argument of the form -nnn, where nnn is a number, words should read the text file from the standard input, and produce output concerning the nnn most-frequently used words. If invoked with a list of filenames as arguments, words should read each of the files in succession, and at the end produce information about the 20 most-frequently used words. Finally, if invoked with a list of filenames and a -nnn argument, words should read each of the files and at the end produce information about the nnn most-frequently used words. An example of the output of words -5 would be as follows: a 1573 the 439 an 128 i 23 if 10

If you find it helpful, you may assume that a file contains at most 100,000 different words. If your program is given a file that contains more than 100,000 different words, it should print a warning message and then ignore any new words after the first 100,000. Note that, even if you do make this assumption, your program should still produce correct output for the first 100,000 different words encountered; thus, if your program is given as input a file containing the word "frog" repeated 150,000 times, it should not print any warning message and it will output the proper count (150,000) for the word "frog".

Once you have written and debugged your program, perform a comparison of the CPU time taken by your program and by the program I wrote, which is available here (rename words.bin to words after downloading). I have also included test files of approximately 1K bytes, 10K bytes, 100K bytes, and 1M bytes. Use the time command as shown in the following example:

time words testfile.100k > /dev/null to measure the CPU time taken by your program and my program on each of the test files. Include in a file README an analysis of your measurements. If you were to draw a graph with file size as the independent variable x and CPU time as the dependent variable y, which of the following functions: would you say best matches the shape of the graph for your program and for my program? If your program runs slower than mine, can you think of a way in which your program could be improved? Is there any other interesting difference you observed between your program and mine during your tests?

Use the handin program to electronically hand in your work as assignment HW3. Organize your work into several files as follows:

Make sure to submit everything, including any additional test files of your own.


Eugene W. Stark