For this assignment, you are to write a C program
reads a text file, makes a tally of all the different words used in the
file, and prints out a list of the words that appear most frequently,
along with the number of times each word appears. This list should be
printed in descending order of frequency of appearance, so that the
most-frequently-used word is printed first, then the second-most-frequently
used word, etc.
For the purpose of this assignment, a word is any sequence of upper and lower-case letters that appears in the input file preceded immediately either by the beginning of the file or a non-letter and followed immediately either by the end of the file or a non-letter. Except for their function in delimiting the words in the file, your program should ignore non-letter characters such as punctuation, digits, and control characters. Your program should not make any assumption about the maximum length of a word in the input file, but you may regard two words as equal unless they differ somewhere in their first 31 characters. Your program should be case-insensitive; that is, if two words differ only in that one has upper-case letters where the other has lower-case letters, then the words should be regarded as equal.
It should be possible to invoke
words by a command of any one
of the following forms:
wordsshould read the text file from the standard input, and the output should contain information about the 20 most-frequently used words. If invoked with a single command-line argument of the form
nnnis a number,
wordsshould read the text file from the standard input, and produce output concerning the
nnnmost-frequently used words. If invoked with a list of filenames as arguments,
wordsshould read each of the files in succession, and at the end produce information about the 20 most-frequently used words. Finally, if invoked with a list of filenames and a
wordsshould read each of the files and at the end produce information about the
nnnmost-frequently used words. An example of the output of
words -5would be as follows:
If you find it helpful, you may assume that a file contains at most 100,000 different words. If your program is given a file that contains more than 100,000 different words, it should print a warning message and then ignore any new words after the first 100,000. Note that, even if you do make this assumption, your program should still produce correct output for the first 100,000 different words encountered; thus, if your program is given as input a file containing the word "frog" repeated 150,000 times, it should not print any warning message and it will output the proper count (150,000) for the word "frog".
Once you have written and debugged your program, perform a comparison of
the CPU time taken by your program and by the program I wrote, which is
words after downloading).
I have also included
test files of approximately 1K bytes, 10K bytes, 100K bytes, and 1M bytes.
time command as shown in the following example:
xand CPU time as the dependent variable
y, which of the following functions:
y = x
y = x^2
y = exp(x)
handin program to electronically hand in your work
as assignment HW3. Organize your work into several files as follows:
READMEshould contain a human-readable report on what you did, and should describe the contents of the other files and how to compile and run them.
buildshould be a POSIX shell script to compile your program and produce an executable file
demoshould be a shell script to demonstrate the operation of your program on some test files.
words.c(you may choose another name, or split your program into more than one file), should contain the source to your program