Supplemental Material for Baldwin and Scragg, Algorithms and Data Structures: The Science of Computing; Charles River Media, 2004 (Now published by Cengage Learning)
This exercise provides experience with, and insight into, how some of the ideas covered in this book apply in “real-world” settings. In particular, this exercise poses a problem to solve, but leaves the student to use their own judgement about the algorithms or data structures to use — much as a professional software developer relies on their own judgement. The data sets the program must work with are large, just as is the case in many real-world programs. While the amount of code to write for this exercise isn’t large, students do have to decide where to use existing classes, where to adapt those classes to the particular problem at hand, and where to write code from scratch — again, very much the decisions a modern software developer must make in planning a project.
Familiarity with two or more of the following sections and topics from Algorithms and Data Structures: The Science of Computing:
Computer-assisted statistical analysis of literary texts sometimes provides clues to things like authorship or age of a text. For example, a computer program might read a text and gather information on such things as which words are used and how frequently, how often particular groups of words are used in proximity to one another, the lengths of sentences or paragraphs, and similar statistics on word or grammar usage. The hope is that the resulting statistics can help resolve otherwise unsolved puzzles (e.g., perhaps it can be shown that Shakespeare’s plays were really written by someone else based on commonalities in grammatical structure between Shakespeare plays and works by the other author, perhaps the relative ages of two sections of the Bible can be determined based on patterns of word usage, etc.) This exercise involves writing a program that does simple analysis of word usage in a text file.
In order to analyze word usage in a text file, one has to read words from the file in the first place. A complete explanation of Java’s support for text files is beyond the scope of this document, but a summary of some relevant features is feasible. Keep in mind, however, that this a summary, concentrating on features that might be helpful in this particular exercise. Java can do much more with text than is described here, and there are even other ways of reading words for this exercise.
For more information on text input in Java, consult the On-Line Documentation for the standard Java class library.
Five things are typically done in order to read text from a file:
BufferedReader
that will deliver data from the file to the program.As an example, here is a Java program that opens a file named “textfile.txt”, does some (unspecified) processing of text from it, and then closes the file and stops:
import java.io.*;
class InputDemo {
public static void main( String[] args ) throws Exception {
BufferedReader in = new BufferedReader( new FileReader( "textfile.txt" ) );
// ... Code to read and process data from the file goes here ...
in.close();
}
}
This program handles exceptions by propagating them to whatever called main
(the phrase throws Exception
in the declaration for main
asserts that main
may propagate exceptions to its caller). The first line of main
creates the BufferedReader
for file “textfile.txt”. The second line is a placeholder for code to read text from the file; see below for ideas about the code that would go here. Finally, the third line sends the BufferedReader
a close
message, which closes the file.
Once the file is opened, either of two approaches to reading words from the file seems appropriate for this exercise: read text character by character, and assemble the characters into words, or use an object called a “stream tokenizer” that can break the text in the file into words automatically.
The message that reads one character from a buffered reader is read
. This message has no parameters, and returns an int
. This return value is -1 if all the characters in the buffered reader have already been read, otherwise it is a positive number that represents a character, and that can be cast to a char
. For example…
BufferedReader file = new BufferedReader(...);
int i = file.read();
if ( i != -1 ) { // If not end of file...
char c = (char)i;
// ... c now contains the character from the file ...
As it reads characters from the file, a program can decide whether or not they are part of a word, and concatenate them onto a string containing the word if they are.
The Character
class contains a number of static methods that are helpful for working with characters. Two that are likely to be particularly useful for this exercise are isLetter
, which takes a char
as its parameter and returns true if the character is a letter and false otherwise, and toLowerCase
, which takes a char
as its parameter and, if the character is an upper-case letter, returns the equivalent lower-case letter (there is also an analogous toUpperCase
method). For example
char c = ...;
if ( Character.isLetter(c) ) {
...
char lower = Character.toLowerCase( c );
...
In text processing, a “token” is, roughly, a smallest-possible linguistically meaningful fragment of text. For example, a word is a token, a punctuation symbol is a token, a number is a token, but a letter by itself (unless it is a one-letter word) is not a token. The standard Java library provides a very useful, but somewhat complicated, class named StreamTokenizer
that reads text streams as sequences of tokens.
Stream tokenizers are created from Reader
objects whose contents are to be tokenized. For example, to prepare to read tokens from file “text.txt” one might use the following Java statements:
BufferedReader in = new BufferedReader( new FileReader( "text.txt" ) );
StreamTokenizer inTokens = new StreamTokenizer( in );
After creating a stream tokenizer, read tokens from it using the nextToken
message. This message is a little tricky, in that it doesn’t directly return the next token from the reader. Instead, it returns an int
whose value indicates what kind of token was read (a word, a number, etc.), and places the actual token value in a public member variable of the tokenizer. Stream tokenizers define constants for the result codes, including
TT_NUMBER
Indicates that the token is a numberTT_WORD
Indicates that the token is a wordTT_EOF
Indicates that the end of the input has been reachedDepending on the token type, the actual value will be in one of the following member variables of the tokenizer:
nval
, If the token is a number (this member is a double
, i.e., it contains the token’s value as a genuine, immediately usable, number)sval
, If the token is a word (this member is a String
, which contains the text of the word)For example, here is code that reads tokens from stream tokenizer inTokens
, adding numbers to a running total and printing words:
int tokenType = inTokens.nextToken(); // Get the first token
while ( tokenType != StreamTokenizer.TT_EOF ) { // Process tokens 'til end of the input
if ( tokenType == StreamTokenizer.TT_NUMBER ) {
total = total + inTokens.nval;
}
else if ( tokenType == StreamTokenizer.TT_WORD ) {
System.out.println( "Read word " + inTokens.sval );
}
tokenType = inTokens.nextToken(); // Go on to the next token
}
Stream tokenizers are extremely customizable. Programmers can tell stream tokenizers which characters to treat as letters, which as digits, which as punctuation, etc. Generally, this is done on a character-by-character (or range of characters) basis, using messages to the stream tokenizer to tell it how to treat each character. Here are a few examples:
inTokens.quoteChar( '"' ); // Treat " as a quotation mark
inTokens.wordChars( 'A', 'Z' ); // Treat the letters A through Z as elements of words
One can also set various boolean flags within a stream tokenizer to turn certain kinds of processing on or off. For example
inTokens.eolIsSignificant( true ); // Treat line breaks as tokens
inTokens.slashSlashComments( true ); // Detect and skip Java-style comments
inTokens.lowerCaseMode( true ); // Convert letters in words to lower case
For details on all of these customizations, see the entry for StreamTokenizer
in the standard Java class library documentation.
Finally, StreamTokenizer
is part of the java.io package, and so will need to be imported into source files that use it with lines of the form
import java.io.StreamTokenizer;
or
import java.io.*;
Many messages to stream tokenizers throw exceptions, and so methods that use stream tokenizers must either handle those exceptions or declare that they throw the exceptions themselves.
Design and code a program that reads a text file, and counts the number of times each word that it contains appears. For example, if the file contained the first line of this paragraph (“Design and code a program that reads a text file, and counts the number of times each word that it contains appears”), the program’s output might look something like this:
design 1 and 2 code 1 a 2 program 1 that 2 reads 1 text 1 file 1 counts 1 the 1 number 1 of 1 times 1 each 1 word 1 it 1 contains 1 appears 1
For purposes of this program, define a “word” by the following rules:
It is up to the student to decide what algorithms and data structures to use to keep track of which words appear in a file, and how many times. The only requirement is to produce a progam that is practical to use on real-world texts.
Copyright © 2005 Charles River Media. All rights reserved.
Revised Aug. 28, 2005 by Doug Baldwin