SUNY Geneseo, Department of Computer Science
CSci 141 , Spring 2005
Prof. Doug Baldwin
Due Monday, May 2
Computer-assisted statistical analysis of literary texts sometimes provides clues to things like authorship or age of a text. For example, a computer program might read a text and gather information on things such as which words are used and how frequently, how often particular groups of words are used in proximity to one another, the lengths of sentences or paragraphs, and similar statistics on word or grammar usage. The hope is that the resulting statistics can help resolve otherwise unsolved puzzles (e.g., perhaps it can be shown that Shakespeare’s plays were really written by someone else based on commonalities in grammatical structure between Shakespeare plays and works by the other author, perhaps the relative ages of two sections of the Bible can be determined based on patterns of word usage, etc.) This assignment asks you to write a program that does simple analysis of word usage in a text file.
You learned in Homework 1 how to read lines of text from files. Much of that knowledge will be helpful here, but now you need to read words, lines of text are less significant. I recommend either of two general approaches to reading words from a text file. Both will require you to use some items from the standard Java library that you may not have used before. Consult the On-Line Documentation for the library for details of these items.
In both cases, you will create a buffered reader connected to the file as you did in Homework 1.
Rather than reading whole lines from a text file, you can read individual
characters. The message that reads one character from a buffered reader is
read
. This message has no parameters, and returns an int
. This return value
is -1 if the end of the buffered reader has been reached, otherwise it is a
positive number that represents a character, and that can be cast to a char
.
For example…
BufferedReader file = new BufferedReader(...);
int i = file.read();
if ( i != -1 ) { // If not end of file...
char c = (char)i;
... c now contains the character from the file ...
As you read characters from the file, you can decide whether or not they are part of a word, and concatenate them onto a string containing the word if they are.
The Character
class contains a number of static methods that are helpful
for working with characters. Two that are likely to be particularly useful
for
this exercise are isLetter
, which takes a char
as its parameter and returns
true if the character is a letter and false otherwise, and toLowerCase
, which
takes a char
as its parameter and, if the character is an upper-case letter,
returns the equivalent lower-case letter (there is also an analogous toUpperCase
method). For example
char c = ...;
if ( Character.isLetter(c) ) {
...
char lower = Character.toLowerCase( c );
...
In text processing, a “token” is, roughly, a smallest-possible
linguistically meaningful fragment of text. For example, a word is a token,
a punctuation symbol is a token, a number is a token, but a letter by itself
(unless it is a one-letter word) is not a token. The standard Java library
provides a very useful, but somewhat complicated, class named StreamTokenizer
that
allows you to read text streams as sequences of tokens.
You create stream tokenizers from Reader
objects whose contents are to be
tokenized. For example, to prepare to read tokens from file “text.txt” you
might use the following Java statements:
BufferedReader in = new BufferedReader( new FileReader( "text.txt" ) );
StreamTokenizer inTokens = new StreamTokenizer( in );
Once you have created a stream tokenizer, you read tokens from it
using the nextToken
message. This message is a little tricky,
in that it doesn’t
directly return the next token from the reader. Instead, it returns an int
whose
value indicates what kind of token was read (a word, a number, etc.), and places
the actual token value in a public member variable of the tokenizer.
Stream tokenizers define constants for the result codes, including
TT_NUMBER
Indicates that that token is a numberTT_WORD
Indicates that the token is a wordTT_EOF
Indicates that the end of the input has been reachedDepending on the token type, the actual value will be in one of the following member variables of the tokenizer:
nval
, If the token is a number (this member is a double
, i.e., it contains
the token’s value as a genuine, immediately usable, number)sval
, If the token is a word (this member is a String
,
which contains the text of the word)For example, here is code that reads tokens from stream tokenizer inTokens
,
adding numbers to a running total and printing words:
int tokenType = inTokens.nextToken(); // Get the first token
while ( tokenType != StreamTokenizer.TT_EOF ) { // Process tokens 'til end of the input
if ( tokenType == StreamTokenizer.TT_NUMBER ) {
total = total + inTokens.nval;
}
else if ( tokenType == StreamTokenizer.TT_WORD ) {
System.out.println( "Read word " + inTokens.sval );
}
tokenType = inTokens.nextToken(); // Go on to the next token
}
Stream tokenizers are extremely customizable. Programmers can tell stream tokenizers which characters to treat as letters, which as digits, which as punctuation, etc. Generally, you do this on a character-by-character (or range of characters) basis, using messages to the stream tokenizer to tell it how to treat each character. Here are a few examples:
inTokens.quoteChar( '"' ); // Treat " as a quotation mark
inTokens.wordChars( 'A', 'Z' ); // Treat the letters A through Z as elements of words
You can also set various boolean flags within a stream tokenizer to turn certain kinds of processing on or off. For example
inTokens.eolIsSignificant( true ); // Treat line breaks as tokens
inTokens.slashSlashComments( true ); // Detect and skip Java-style comments
inTokens.lowerCaseMode( true ); // Convert letters in words to lower case
For details on all of these customizations, see the entry for StreamTokenizer
in the standard Java class library documentation.
Finally, StreamTokenizer
is part of the java.io package, and so will need
to be imported into source files that use it with lines of the form
import java.io.StreamTokenizer;
or
import java.io.*;
Many messages to stream tokenizers throw exceptions, and so methods that use stream tokenizers must either handle those exceptions or declare that they throw the exceptions themselves.
Some of the sample texts I have provided for you to test your programs on are complete works of literature. These come from a project called “Project Gutenberg,” which is a long-running effort to make as many as possible of the world’s uncopyrighted books available electronically for free. Generally the books available from Project Gutenberg are older works on which the copyrights have expired. For more information on Project Gutenberg, see http://www.gutenberg.org/
Design and code a program that reads a text file, and counts the number of times each word that it contains appears. For example, if the file contained the first line of this paragraph (“Design and code a program that reads a text file, and counts the number of times each word that it contains appears”), the program’s output might look something like this:
design 1 and 2 code 1 a 2 program 1 that 2 reads 1 text 1 file 1 counts 1 the 1 number 1 of 1 times 1 each 1 word 1 it 1 contains 1 appears 1
For purposes of this program, define a “word” by the following rules:
It is up to you to decide what algorithms and data structures to use to keep track of which words appear in a file, and how many times. You simply need to produce a progam that is practical to use on real-world texts (e.g., the test files discussed below).
To help you test your programs, I have placed a set of test files on the “Out Boxes” server. You can find these files in the folder named “Texts” within our course folder. These files include the following:
Finally, include as comments in your program a sentence or two about what (if any) relevance asymptotic execution times of algorithms have to “real” programming.
There are many possible extensions to the basic project described above, and you are welcome to incorporate any that you wish. For example, you could extract words from the text in more sophisticated ways, e.g., use more discriminating ways of determining which words are “trivial,” count contractions, numbers, and other things commonly considered words but excluded by the definition above, etc. You could lump variations on a word (e.g., a word and its plural form, all conjugations of the same verb, etc.) together as a single word. You could also output additional statistics on the text, for instance the total number of words, the number of distinct words, etc. If you want a very challenging project, you could try to produce statistics on inter-word relationships (e.g., how many times does a particular pair of words appear, how many times does one word appear close to another, etc.)