SUNY Geneseo, Department of Computer Science
CSci 141 , Fall 2003
Prof. Doug Baldwin
Due Monday, December 15
Machine translation, the automatic translation from one language to another, is a long-standing (and not very well met) dream of some computer scientists. It falls within the subfield of computer science known as artificial intelligence, and the branch of artificial intelligence called natural language processing.
One requirement for machine translation is a "dictionary" that translates words in one language into words in another. This is essentially the computer equivalent of dictionaries you may have used in foreign language classes, for example a French-English or Spanish-English dictionary. In algorithmic terms, the dictionary is a collection whose "search" message takes a word in one language as its argument, and that returns an equivalent word in the other language. For example, an English-to-German dictionary might be asked to search for "dog", and would return "Hund". This is the part of a machine translator that you will write for this assignment (albeit in a limited form).
As you probably suspect, word-for-word translation is the least of the problems that a machine translator has to deal with. Some other issues (which you do not need to deal with in this assignment, but may find it interesting to know about) include...
Write a program that reads words in one language from its user, and prints equivalent words in another language. Your program should read words and print translations one at a time in a loop, until the user indicates that they are finished. If the user enters a word whose translation the program doesn't know, the program should give the user an opportunity to provide a translation, which the program should add to its dictionary.
Your program will have to be built around a "dictionary" data structure in which the program looks up words in the input language and finds their translations in the target language. While your program will read words one at a time from its user, and only needs a dictionary big enough to demonstrate the concept (e.g., 10 or 20 words), you should pick a data structure that could scale to more realistic sizes (e.g., on the order of 50,000 to 100,000 words) and more automatic translation (e.g., reading and translating the text of a book or long document from a file without word-by-word human intervention).
You can use any input and target languages you like for this project. However, if you don't have favorite languages in mind, I have provided a simple set of English-German translations in a file named "german.txt". You can Download This File from the Web at http://cs.geneseo.edu/~baldwin/csci141/fall2003/german.txt. This file is a simple text file, with two words on each line -- the first word is English, the second is an equivalent German word. The words are in no particular order.
My main reason for choosing German as the language for the sample translations is that it can be written in the Latin alphabet without any diacritical marks, so it is one of the few languages I can think of other than English that is easy to produce on a standard U.S. computer. If you wish to translate to or from a language that does require non-U.S. characters or diacritical marks, you can do any of the following (in order of ambitiousness):
Your program will need some way to read input from its user. You can use any
of the techniques you have used in past exercises, e.g., java.io
classes that let you read from standard input, simpler input classes from other
libraries, or even a graphical interface.
A minimal but acceptable program for this assignment would be a program that always starts with an empty dictionary, adds words and translations to it as searches fail and the user provides translations, and throws everything it has learned away when it stops. Translations in this minimal program would be single words. More ambitious solutions might provide more sophisticated translations (e.g., multiple alternatives with some explanation of each), save dictionaries in files between runs, etc. Such extensions make the finished program more interesting, and could contribute to your program being "more than I expected" when I grade it.
In addition to the program proper, tell me what you expect the asymptotic time to translate a word to be, in terms of the dictionary size. Explain this statement in enough detail that a person familiar with the material in this course can see why you are correct. This could mean something as simple as a statement of the form "I expect the asymptotic translation time to be such-and-such, because the dictionary is such-and-such a data structure". This is appropriate if your dictionary is a data structure we have studied in class, and we have already derived asymptotic execution times for all the relevant algorithms in class. At the other extreme, you may have to include a complete derivation of your translation time, if you use data structures or algorithms that we have not studied in class. You may also fall in between these extremes, if you use familiar data structures with a mixture of new and familiar algorithms, or you combine data structures in some way.
Turn in a printout of your program. If your program uses separate data files (e.g., to initialize its dictionary), turn in printouts of them too, if they are printable text. If your program uses non-text data files, be sure there is a sufficient description of them in the program's comments that I can understand what information they contain and in what format. Finally, include your statement (and derivation, to the extent necessary) of asymptotic translation time with your program -- it can be in comments in the program, or on a separate sheet of paper.