SUNY Geneseo Department of Computer Science
CSci 240, Spring 2007
Prof. Doug Baldwin
Due Tuesday, March 20
There are scholars who find statistical analysis of word usage in literary texts to be a source of clues about authorship, dates, etc. For example, it might be possible to make a good guess about who wrote a particular work by comparing the frequency of use of various words in it to the frequencies in contemporary works by known authors, or it might be possible to determine approximately when a particular work was written by comparing its vocabulary to that in widepsread use at various times. This exercise asks you to write a program that produces simple statistics about word usage in text documents. Specifically, you are to produce an alphabetized list of all the words used in the document, with the number of times each is used, the total number of words, and the total number of distinct words (see the "Exercise" section below for a more detailed description of the assignment).
You might guess from the phrase "alphabetized list" that sorting will play a central role in your program, and you are right. There are many classical algorithms for sorting, two of which we have studied in some depth so far: insertion sort and selection sort. Both of these algorithms have worst-case (and average-case) execution times of Θ(n2), where n is the number of data items being sorted. Part of what you need to do for this lab is code one of these algorithms in Java, for use in the text-analysis program.
Coding the central algorithm is only part, and often only a small part, of what it takes to build a complete program around an algorithm. The algorithm in its generic, textbook, form almost certainly needs some translation in order to work with the data types and structures of any particular program. Furthermore, even the translated algorithm is only part of the program, and so some thought has to be devoted to how the algorithm will contribute to solving a larger problem, how the data relevant to that problem can be input to and output from the algorithm, etc. This exercise also requires you to deal with such larger issues of making a textbook algorithm a useful part of a real program.
There are a number of Java library classes that may be useful in this program. I point you to these classes here, but encourage you to read about them in the On-Line Java Library Documentation to learn how to use them in your own programs.
One of the things you need to do is read text from a file and break that text into words. The Scanner
class, which is new in Java 1.5, can be extremely useful for this. If you don't want to use Java 1.5 features, the older StreamTokenizer
class can also be used.
A program can interactively read a file name from its user via the JFileChooser
class. This is much easier for users than trying to type a full pathname for the file in the console window.
You will also need to make a list of words in the document, and be able to sort that list, iterate over it in order, etc. You could use an ordinary array for this, but then you would have to know ahead of time how many words were in the document. The Vector
class is an alternative to an array. Vector
objects act much like arrays, except that they can expand as items are added to them, so you needn't know in advance how many items will be in a Vector
. You were introduced to Vector
in Lab 4 on debugging and proofs.
I have given you some real works of literature to run your program on (listed in the "Exercise" section below). These come from the Gutenberg Project, an organization that is trying to put much, if not all, of the world's non-copyrighted literature into plain-text, computer-readable form. You can learn more about the Gutenberg Project, and/or download your own favorite literature to analyze, at http://www.gutenberg.org/wiki/Main_Page (if you're typing this yourself, there is an underscore between "Main" and "Page").
Write a program that reads a text document from a file and produces the following output:
The phrase "distinct word" means an element of the vocabulary of whatever language the document is written in. Each of these elements will generally appear multiple times in the document. The "total number of words" in the document is basically the sum of the number of occurrences of all the distinct words. For example, suppose the document contains exactly the sentence
To be or not to be, that is the question.
This "document" contains 8 distinct words, namely (in alphabetical order) "be," "is," "not," "or," "question," "that," "the," and "to." Of these distinct words, "be" and "to" appear twice, and the others appear once, for 10 words total.
Define a word to be any sequence of non-space characters, without leading or trailing punctuation (e.g., "Sentence" is a word, but "Sentence." needs to have the final "." stripped off before it counts as a word). Characters that act as punctuation may, however, appear inside a word. For example, "isn't" is a single word despite containing an apostrophe inside it, even though " 'quoted' " needs to have the single-quotation marks stripped off its beginning and end to count as a word, and those single-quotation marks are the same character as the apostrophe in most computer character sets. Note, also, that numbers, such as "124", count as words under this definition.
There are many ways to identify and count distinct words in a document, but you must use the following strategy: make a list of all the words in the document, and sort it into alphabetical order. This necessarily leaves all occurrences of each distinct word adjacent in the sorted list, whereupon you can step through the list counting how many times each word appears. Furthermore, you must use either insertion sort or selection sort as your sorting algorithm.
In sorting the list of words, and subsequently counting distinct words, beware of capitalization differences! "Word" and "word" are the same distinct word, even though one begins with a capital "W" and the other with a small "w".
The requirement that you use insertion sort or selection sort deliberately hobbles your program. I expect that these algorithms' Θ(n2) execution times will make your program very slow. However, that is the main point of this exercise: to get you to see for yourself what, if anything, asymptotic execution times like Θ(n2) mean for real programs processing real data.
To help you test your program, I have provided some sample documents for you. These are on the "OutBox" file server, in folder "/ComputerScience/baldwin/CSci240/WordCount". The samples are as follows:
Note that the first two of these files are typical of what a programmer might test a program such as this on, and, in fact, I have provided them precisely because they should help you test your program. However, the last four files are more typical of what an actual user would want to run a program such as this on.
Finally, in addition to counting words in documents, I want you to estimate how long your program would take to process War and Peace, assuming that it contains 564,000 words. You may arrive at this estimate by actually running your program on the "WarAndPeace.txt" file, although you can also extrapolate from smaller files.
I will grade this exercise in a face-to-face meeting with you. Make an appointment to meet with me at some time convenient for you, as long as that time is before the end of the due date above. Meetings only need to be 15 minutes long. You can make an appointment by signing up on the copy of my schedule on the bulletin board outside my office.
I would like to read and run your program during our meeting. To do this, you can either bring the program to my office on a laptop computer, or we can go to a lab during the meeting.