Laboratory Exercise

Counting Words

Supplemental Material for Baldwin and Scragg, Algorithms and Data Structures: The Science of Computing; Charles River Media, 2004 (Now published by Cengage Learning)

Site Index

Purpose

This exercise provides experience with, and insight into, how some of the ideas covered in this book apply in “real-world” settings. In particular, this exercise poses a problem to solve, but leaves the student to use their own judgement about the algorithms or data structures to use — much as a professional software developer relies on their own judgement. The data sets the program must work with are large, just as is the case in many real-world programs. While the amount of code to write for this exercise isn’t large, students do have to decide where to use existing classes, where to adapt those classes to the particular problem at hand, and where to write code from scratch — again, very much the decisions a modern software developer must make in planning a project.

Prerequisites

Familiarity with two or more of the following sections and topics from Algorithms and Data Structures: The Science of Computing:

Chapter 11, Lists
Chapter 13, Binary trees
Section 14.2, Representing dynamic data structures with arrays
Section 14.3, Hash tables

Background

Computer-assisted statistical analysis of literary texts sometimes provides clues to things like authorship or age of a text. For example, a computer program might read a text and gather information on such things as which words are used and how frequently, how often particular groups of words are used in proximity to one another, the lengths of sentences or paragraphs, and similar statistics on word or grammar usage. The hope is that the resulting statistics can help resolve otherwise unsolved puzzles (e.g., perhaps it can be shown that Shakespeare’s plays were really written by someone else based on commonalities in grammatical structure between Shakespeare plays and works by the other author, perhaps the relative ages of two sections of the Bible can be determined based on patterns of word usage, etc.) This exercise involves writing a program that does simple analysis of word usage in a text file.

Reading Words from Files

In order to analyze word usage in a text file, one has to read words from the file in the first place. A complete explanation of Java’s support for text files is beyond the scope of this document, but a summary of some relevant features is feasible. Keep in mind, however, that this a summary, concentrating on features that might be helpful in this particular exercise. Java can do much more with text than is described here, and there are even other ways of reading words for this exercise.

For more information on text input in Java, consult the On-Line Documentation for the standard Java class library.

Basics

Five things are typically done in order to read text from a file:

The file is opened, and connected to a BufferedReader that will deliver data from the file to the program.
Data is read from the file.
When all required data has been read, the file is closed.
Throughout the process of opening the file, reading it, and closing it, exceptions may be thrown and so must be handled somehow.
Because the classes and exceptions related to input are part of the Java library package “java.io”, any Java source file that reads text files must import that package.

As an example, here is a Java program that opens a file named “textfile.txt”, does some (unspecified) processing of text from it, and then closes the file and stops:

    import java.io.*;
    class InputDemo {
        public static void main( String[] args ) throws Exception {
            BufferedReader in = new BufferedReader( new FileReader( "textfile.txt" ) );
            // ... Code to read and process data from the file goes here ...
            in.close();
        }
    }

This program handles exceptions by propagating them to whatever called main (the phrase throws Exception in the declaration for main asserts that main may propagate exceptions to its caller). The first line of main creates the BufferedReader for file “textfile.txt”. The second line is a placeholder for code to read text from the file; see below for ideas about the code that would go here. Finally, the third line sends the BufferedReader a close message, which closes the file.

Once the file is opened, either of two approaches to reading words from the file seems appropriate for this exercise: read text character by character, and assemble the characters into words, or use an object called a “stream tokenizer” that can break the text in the file into words automatically.

Approach 1: Read Characters

The message that reads one character from a buffered reader is read. This message has no parameters, and returns an int. This return value is -1 if all the characters in the buffered reader have already been read, otherwise it is a positive number that represents a character, and that can be cast to a char. For example…

    BufferedReader file = new BufferedReader(...);
    int i = file.read();
    if ( i != -1 ) {   // If not end of file...
        char c = (char)i;
        // ... c now contains the character from the file ...

As it reads characters from the file, a program can decide whether or not they are part of a word, and concatenate them onto a string containing the word if they are.

The Character class contains a number of static methods that are helpful for working with characters. Two that are likely to be particularly useful for this exercise are isLetter, which takes a char as its parameter and returns true if the character is a letter and false otherwise, and toLowerCase, which takes a char as its parameter and, if the character is an upper-case letter, returns the equivalent lower-case letter (there is also an analogous toUpperCase method). For example

    char c = ...;
    if ( Character.isLetter(c) ) {
        ...
        char lower = Character.toLowerCase( c );
        ...

Approach 2: Use a Stream Tokenizer

In text processing, a “token” is, roughly, a smallest-possible linguistically meaningful fragment of text. For example, a word is a token, a punctuation symbol is a token, a number is a token, but a letter by itself (unless it is a one-letter word) is not a token. The standard Java library provides a very useful, but somewhat complicated, class named StreamTokenizer that reads text streams as sequences of tokens.

Stream tokenizers are created from Reader objects whose contents are to be tokenized. For example, to prepare to read tokens from file “text.txt” one might use the following Java statements:

    BufferedReader in = new BufferedReader( new FileReader( "text.txt" ) );
    StreamTokenizer inTokens = new StreamTokenizer( in );

After creating a stream tokenizer, read tokens from it using the nextToken message. This message is a little tricky, in that it doesn’t directly return the next token from the reader. Instead, it returns an int whose value indicates what kind of token was read (a word, a number, etc.), and places the actual token value in a public member variable of the tokenizer. Stream tokenizers define constants for the result codes, including

TT_NUMBER Indicates that the token is a number
TT_WORD Indicates that the token is a word
TT_EOF Indicates that the end of the input has been reached

Depending on the token type, the actual value will be in one of the following member variables of the tokenizer:

nval, If the token is a number (this member is a double, i.e., it contains the token’s value as a genuine, immediately usable, number)
sval, If the token is a word (this member is a String, which contains the text of the word)

For example, here is code that reads tokens from stream tokenizer inTokens, adding numbers to a running total and printing words:

    int tokenType = inTokens.nextToken();             // Get the first token
    while ( tokenType != StreamTokenizer.TT_EOF ) {   // Process tokens 'til end of the input
        if ( tokenType == StreamTokenizer.TT_NUMBER ) {
            total = total + inTokens.nval;
        }
        else if ( tokenType == StreamTokenizer.TT_WORD ) {
            System.out.println( "Read word " + inTokens.sval );
        }
        tokenType = inTokens.nextToken();             // Go on to the next token
    }

Stream tokenizers are extremely customizable. Programmers can tell stream tokenizers which characters to treat as letters, which as digits, which as punctuation, etc. Generally, this is done on a character-by-character (or range of characters) basis, using messages to the stream tokenizer to tell it how to treat each character. Here are a few examples:

    inTokens.quoteChar( '"' );       // Treat " as a quotation mark
    inTokens.wordChars( 'A', 'Z' );  // Treat the letters A through Z as elements of words

One can also set various boolean flags within a stream tokenizer to turn certain kinds of processing on or off. For example

    inTokens.eolIsSignificant( true );     // Treat line breaks as tokens
    inTokens.slashSlashComments( true );   // Detect and skip Java-style comments
    inTokens.lowerCaseMode( true );        // Convert letters in words to lower case

For details on all of these customizations, see the entry for StreamTokenizer in the standard Java class library documentation.

Finally, StreamTokenizer is part of the java.io package, and so will need to be imported into source files that use it with lines of the form

    import java.io.StreamTokenizer;

    import java.io.*;

Many messages to stream tokenizers throw exceptions, and so methods that use stream tokenizers must either handle those exceptions or declare that they throw the exceptions themselves.

Exercise

Design and code a program that reads a text file, and counts the number of times each word that it contains appears. For example, if the file contained the first line of this paragraph (“Design and code a program that reads a text file, and counts the number of times each word that it contains appears”), the program’s output might look something like this:

    design    1
    and       2
    code      1
    a         2
    program   1
    that      2
    reads     1
    text      1
    file      1
    counts    1
    the       1
    number    1
    of        1
    times     1
    each      1
    word      1
    it        1
    contains  1
    appears   1

For purposes of this program, define a “word” by the following rules:

A word is any group of contiguous letters from the file. In other words, spaces, punctuation, and other non-letter characters separate words, and are not words themselves. This rule also means that numbers written as strings of digits (e.g., “100”) aren’t counted as words, and contractions are counted as two words separated by an apostrophe.
Capitalization alone does not make words different. For example “Design” and “design” should be considered the same word.
It is usual in programs that analyze text to ignore “trivial” words such as “a,” “the,” etc. For this exercise, consider any word with 3 or fewer letters in it to be trivial (so the list used in the example above would actually not have the entries for “a,” “the,” “of,” or “it”). This definition of “trivial” words also conveniently eliminates the silliest consequences of the suggested handling of contractions.

It is up to the student to decide what algorithms and data structures to use to keep track of which words appear in a file, and how many times. The only requirement is to produce a progam that is practical to use on real-world texts.

Final Details

Turn in your program as directed by your instructor.

Revised Aug. 28, 2005 by Doug Baldwin

Site Index