SUNY Geneseo Department of Computer Science
CSci 240, Spring 2007
Prof. Doug Baldwin
Due Tuesday, April 3
The primary purpose of this lab is to give you a tangible comparison between the "fast" (i.e., Θ(nlogn)) and "slow" (i.e., Θ(n2)) sorting algorithms. There is also a secondary goal, to give you further practice making and reporting convincing empirical measurements on programs.
In Lab 7, you used a Θ(n2) sorting algorithm to help you identify the distinct words in a document, and count how many times each appears. This lab asks you to rewrite that program with a more efficient sorting algorithm, and to empirically verify that doing so lets you achieve a particular performance goal.
Reasonable sorting algorithms to use for this exercise are Quicksort and Mergesort, both discussed in Chapter 10 of our textbook.
You will also need to measure the execution time of your program, and compare that time to an asymptotic target. For information on measuring time in a program, see Chapter 4 of our textbook. Asymptotic notation, and ways of testing whether measurements are consistent with an asymptotic expression, are discussed in Section 9.2.
The word-counting program you wrote for Lab 7 is not practical to use on realistically sized documents, largely because it uses a sorting algorithm that is too slow. Now, however, you know about faster sorting algorithms, and so can use them to make the word counter practical. That is your task in this lab. More precisely, you are to change your word counting program from Lab 7 so that...
You must provide measured execution times that empirically support the idea that your program meets requirements 1 and 2 above. Include any analysis of the measured times necessary to support the O( w logw ) claim.
Your program for this lab must retain the same basic interface as the program you wrote for Lab 7. In particular, it must take a text file as input, and must output an alphabetized list of all the distinct words in the document, showing the number of times each word in that list appears in the document.
You must write your own code for whatever sorting your program does. In particular, you may not use a sorting algorithm from the Java class library, or from similar libraries.
While the empirical analysis part of this exercise is phrased as a check that your program has the expected efficiency, not as a formal experiment, a convincing check merits much the same care that a formal experiment does. Thus, for example, you should think about controlling errors while making your measurements, you should repeat measurements, you should make sure that you collect data on enough different document sizes to have a convincing check, etc.
The O( w logw ) requirement is subtle. Note that it uses "O", the asymptotic notation that means "bounded above by...." It is therefore perfectly OK for you to end up with a program that runs in time less than proportional to w logw. You may want to take this fact into account in empirically analyzing your program.
The test documents that I provided for Lab 7 are still available on the "OutBox" file server, in folder "/ComputerScience/baldwin/CSci240/WordCount." However, to get a really convincing test that your program runs in O( w logw ) time, you may want additional test documents. I recommend that you get them from Project Gutenberg, at http://www.gutenberg.org/wiki/Main_Page (this is where I got the test documents I provided). This Web site provides an incredible selection of literature, free, and in plain text form, that can be downloaded and read by your program. Most pre-20th-century western writing, and a good deal of non-western, is available; simply search by title or author for something that you expect to be of a length you need for your experiments, download it, and run your program on it. Your program's output should include an actual count of the number of words in the document, so that you can check that the document is in fact of the length you want.
All of the programs written for Lab 7 will at least need to be revised to use a faster sorting algorithm than they originally used. Some of you will need to make further revisions in order to make your programs run in the required time (the Θ(n2) sorting algorithms in the originals sometimes mask greater-than-nlogn inefficiencies elsewhere in the programs). Anyone who didn't complete a solution to Lab 7 will need to do so -- fortunately, that lab was designed to be fairly straightforward, and so even writing the program from scratch, but with an efficient sorting algorithm, isn't unreasonable for this exercise. See the Lab 7 Handout for some hints on getting started.
Finally, one of the things I discovered in preparing this lab is that textbook algorithms don't always work as well as advertised in specific applications. The empirical data you collect to test whether your program runs in O( w logw ) time can help you tell if you are in such a situation, alerting you to a need to rethink your program. Additional measurements (e.g., how does running time for the sorting alone behave? How does the time to read files behave? etc.) can help you narrow in on the locations of problems.
Hand in a written report describing your final word-counting program, the reasons why you think it achieves the O( w logw ) time bound, and your empirical analysis that it actually does so. I expect that these reports will be 2 to 4 pages long (exclusive of any code you include), although the real requirement is that they concisely say everything that needs to be said to convince a reader that your program performs as required.
Turn in your report so that I receive it before I leave on the due date above.