SUNY Geneseo, Department of Computer Science

Homework 2— Counting Words

CSci 141 , Spring 2005

Prof. Doug Baldwin

Due Monday, May 2

Purpose

This assignment gives you some experience with, and insight into, how some of the ideas covered in this course apply in more “real-world” settings than other exercises were able to do. In particular, this exercise gives you a problem to solve, but leaves you to use your own judgement about the algorithms or data structures to use — much as a professional software developer relies on their own judgement. The data sets your program must work with are large, just as is the case in many real-world programs. While the amount of code you need to write for this exercise isn’t large, you do have to decide where you can use existing classes, where you need to adapt those classes to the particular problem at hand, and where you have to write code from scratch — again, very much the decisions a modern software developer must make in planning a project.

Background

Computer-assisted statistical analysis of literary texts sometimes provides clues to things like authorship or age of a text. For example, a computer program might read a text and gather information on things such as which words are used and how frequently, how often particular groups of words are used in proximity to one another, the lengths of sentences or paragraphs, and similar statistics on word or grammar usage. The hope is that the resulting statistics can help resolve otherwise unsolved puzzles (e.g., perhaps it can be shown that Shakespeare’s plays were really written by someone else based on commonalities in grammatical structure between Shakespeare plays and works by the other author, perhaps the relative ages of two sections of the Bible can be determined based on patterns of word usage, etc.) This assignment asks you to write a program that does simple analysis of word usage in a text file.

Reading Words from Files

You learned in Homework 1 how to read lines of text from files. Much of that knowledge will be helpful here, but now you need to read words, lines of text are less significant. I recommend either of two general approaches to reading words from a text file. Both will require you to use some items from the standard Java library that you may not have used before. Consult the On-Line Documentation for the library for details of these items.

In both cases, you will create a buffered reader connected to the file as you did in Homework 1.

Approach 1: Read Characters

Rather than reading whole lines from a text file, you can read individual characters. The message that reads one character from a buffered reader is read. This message has no parameters, and returns an int. This return value is -1 if the end of the buffered reader has been reached, otherwise it is a positive number that represents a character, and that can be cast to a char. For example…

    BufferedReader file = new BufferedReader(...);
    int i = file.read();
	 if ( i != -1 ) {   // If not end of file...
	    char c = (char)i;
		... c now contains the character from the file ...

As you read characters from the file, you can decide whether or not they are part of a word, and concatenate them onto a string containing the word if they are.

The Character class contains a number of static methods that are helpful for working with characters. Two that are likely to be particularly useful for this exercise are isLetter, which takes a char as its parameter and returns true if the character is a letter and false otherwise, and toLowerCase, which takes a char as its parameter and, if the character is an upper-case letter, returns the equivalent lower-case letter (there is also an analogous toUpperCase method). For example

    char c = ...;
    if ( Character.isLetter(c) ) {
	   ...
	   char lower = Character.toLowerCase( c );
	   ...

Approach 2: Use a Stream Tokenizer

In text processing, a “token” is, roughly, a smallest-possible linguistically meaningful fragment of text. For example, a word is a token, a punctuation symbol is a token, a number is a token, but a letter by itself (unless it is a one-letter word) is not a token. The standard Java library provides a very useful, but somewhat complicated, class named StreamTokenizer that allows you to read text streams as sequences of tokens.

You create stream tokenizers from Reader objects whose contents are to be tokenized. For example, to prepare to read tokens from file “text.txt” you might use the following Java statements:

    BufferedReader in = new BufferedReader( new FileReader( "text.txt" ) );
    StreamTokenizer inTokens = new StreamTokenizer( in );

Once you have created a stream tokenizer, you read tokens from it using the nextToken message. This message is a little tricky, in that it doesn’t directly return the next token from the reader. Instead, it returns an int whose value indicates what kind of token was read (a word, a number, etc.), and places the actual token value in a public member variable of the tokenizer. Stream tokenizers define constants for the result codes, including

TT_NUMBER Indicates that that token is a number
TT_WORD Indicates that the token is a word
TT_EOF Indicates that the end of the input has been reached

Depending on the token type, the actual value will be in one of the following member variables of the tokenizer:

nval, If the token is a number (this member is a double, i.e., it contains the token’s value as a genuine, immediately usable, number)
sval, If the token is a word (this member is a String, which contains the text of the word)

For example, here is code that reads tokens from stream tokenizer inTokens, adding numbers to a running total and printing words:

    int tokenType = inTokens.nextToken();             // Get the first token
    while ( tokenType != StreamTokenizer.TT_EOF ) {   // Process tokens 'til end of the input
        if ( tokenType == StreamTokenizer.TT_NUMBER ) {
            total = total + inTokens.nval;
        }
        else if ( tokenType == StreamTokenizer.TT_WORD ) {
            System.out.println( "Read word " + inTokens.sval );
        }
        tokenType = inTokens.nextToken();             // Go on to the next token
    }

Stream tokenizers are extremely customizable. Programmers can tell stream tokenizers which characters to treat as letters, which as digits, which as punctuation, etc. Generally, you do this on a character-by-character (or range of characters) basis, using messages to the stream tokenizer to tell it how to treat each character. Here are a few examples:

    inTokens.quoteChar( '"' );       // Treat " as a quotation mark
    inTokens.wordChars( 'A', 'Z' );  // Treat the letters A through Z as elements of words

You can also set various boolean flags within a stream tokenizer to turn certain kinds of processing on or off. For example

    inTokens.eolIsSignificant( true );     // Treat line breaks as tokens
    inTokens.slashSlashComments( true );   // Detect and skip Java-style comments
    inTokens.lowerCaseMode( true );        // Convert letters in words to lower case

For details on all of these customizations, see the entry for StreamTokenizer in the standard Java class library documentation.

Finally, StreamTokenizer is part of the java.io package, and so will need to be imported into source files that use it with lines of the form

import java.io.StreamTokenizer;

import java.io.*;

Many messages to stream tokenizers throw exceptions, and so methods that use stream tokenizers must either handle those exceptions or declare that they throw the exceptions themselves.

Project Gutenberg

Some of the sample texts I have provided for you to test your programs on are complete works of literature. These come from a project called “Project Gutenberg,” which is a long-running effort to make as many as possible of the world’s uncopyrighted books available electronically for free. Generally the books available from Project Gutenberg are older works on which the copyrights have expired. For more information on Project Gutenberg, see http://www.gutenberg.org/

Exercise

Design and code a program that reads a text file, and counts the number of times each word that it contains appears. For example, if the file contained the first line of this paragraph (“Design and code a program that reads a text file, and counts the number of times each word that it contains appears”), the program’s output might look something like this:

    design    1
    and       2
    code      1
    a         2
    program   1
    that      2
    reads     1
    text      1
    file      1
    counts    1
    the       1
    number    1
    of        1
    times     1
    each      1
    word      1
    it        1
    contains  1
    appears   1

For purposes of this program, define a “word” by the following rules:

A word is any group of contiguous letters from the file. In other words, spaces, punctuation, and other non-letter characters separate words, and are not words themselves. This rule also means that numbers written as strings of digits (e.g., “100”) aren’t counted as words, and contractions are counted as two words separated by an apostrophe.
Capitalization alone does not make words different. For example “Design” and “design” should be considered the same word.
It is usual in programs that index or analyze text to ignore “trivial” words such as “a,” “the,” etc. For this exercise, consider any word with 3 or fewer letters in it to be trivial (so the list used an example above would actually not have the entries for “a,” “the,” “of,” or “it”). This definition of “trivial” words also conveniently eliminates the silliest consequences of the suggested handling of contractions.

It is up to you to decide what algorithms and data structures to use to keep track of which words appear in a file, and how many times. You simply need to produce a progam that is practical to use on real-world texts (e.g., the test files discussed below).

To help you test your programs, I have placed a set of test files on the “Out Boxes” server. You can find these files in the folder named “Texts” within our course folder. These files include the following:

test.txt. A short file designed purely to test this exercise. This file contains a list of its own words, and their numbers of occurrences, to help you check the output of your program.
treelab.txt. The plain text of an old CSci 141 lab handout (i.e., all the HTML markup that makes it look nice on the Web has been removed).
Usher.txt. The text of Edgar Alan Poe’s short story “The Fall of the House of Usher.”
ChristmasCarol.txt. The text of Charles Dicken’s “A Christmas Carol.”
MobyDick.txt. The text of the novel by Herman Melville.
WarAndPeace.txt. The text of the novel by Leo Tolstoy.

Finally, include as comments in your program a sentence or two about what (if any) relevance asymptotic execution times of algorithms have to “real” programming.

There are many possible extensions to the basic project described above, and you are welcome to incorporate any that you wish. For example, you could extract words from the text in more sophisticated ways, e.g., use more discriminating ways of determining which words are “trivial,” count contractions, numbers, and other things commonly considered words but excluded by the definition above, etc. You could lump variations on a word (e.g., a word and its plural form, all conjugations of the same verb, etc.) together as a single word. You could also output additional statistics on the text, for instance the total number of words, the number of distinct words, etc. If you want a very challenging project, you could try to produce statistics on inter-word relationships (e.g., how many times does a particular pair of words appear, how many times does one word appear close to another, etc.)

Follow-Up

This exercise is due on Monday, May 2, as indicated above. Turn in a printout of the program you wrote. Along with this printout (or incorporated in it as comments), explain how you tested the program, including what files you tested it on and how long it took to process those files. You do not need to turn in sample outputs from the program (for interesting documents, they will be too long to be practical to turn in).