Geneseo CSci 120 Cell Array Lesson

CSci 120, Spring 2014

Prof. Doug Baldwin

Complete by Friday, April 25

Purpose

Background

Many programs operate on collections of data. Vectors (and matrices) are familiar ways of forming collections in Matlab. However, all of the elements in a vector or matrix must be the same type (for instance, you can’t have a vector in which some elements are numbers and others are strings—either all must be numbers, or all must be strings—and in fact, strings of the same length). Cell arrays are a Matlab data structure that allow you to create things much like vectors and matrices, but with elements of different types.

One of the most popular ways of understanding databases is the so-called “relational model,” from which relational databases are derived. (If you have ever seen or heard of SQL or similar database languages, e.g., mySQL, you have seen or heard of relational databases). This model is based on the mathematical notion of a relation.

Informally, a relation is exactly what it sounds like: something that describes a relationship between two (or more) numbers or other entities. For example, “less than” is a relation on numbers: 1 and 2 are a pair of numbers that have, or are in, this relation, because 1 < 2; 3 and 1 are a pair that are not, because 3 is not less than 1.

The formal definition of “relation” captures many features that showed up informally in the “less than” example. Specifically, a relation is a set of ordered pairs (or triples, quadruples, etc.); pairs of numbers that are in the set have (or are, literally, in) the relation, pairs that are not in the set do not have it. Thus, for example, the less than relation could be described, in part, as the set

The complete set is, of course, infinitely big, but that is not a problem. However, there are still certain pairs that are not in the set, for example (2,1). Notice that the order in which items appear in a pair matters (that’s what it means that they are “ordered” pairs). Thus the pair (1,2) is not the same thing as the pair (2,1), and so it makes sense to say that (1,2) is in the set while (2,1) is not.

Notice that a number can be paired with itself in a relation. This never happens in the “less than” relation, but it often does in other relations. For example, equality is another common relation, whose set representation is in part

While many common relations are sets of pairs, you can also have relations that hold between three values (represented as sets of triples), four values (sets of quadruples), and in general between any number of values. The general term for an ordered group of things is a “tuple.” The tuples in relation can be any size, as long as all tuples in a single relation are the same size. The size of the tuples in a relation is sometimes called the relation’s “arity.” As an example of a relation with arity greater than two, consider the “is the sum of” relation that holds between groups of three numbers, a, b, and c, whenever a = b + c. This relation contains, for example, the tuple (5,3,2) because 5 = 3 + 2. It also contains (5,2,3) and (5,4,1), but not, for instance, (5,4,3). As a set, the relation could be written in part as

Finally, relations don’t have to be only between numbers. Any idea that can be represented clearly enough to put into an ordered tuple can be part of a relation. For example, a relation between objects and their principle colors might include such tuples as

Exercise

The following questions introduce you to the relational data model, and some of the operations that are common in relational databases. These operations are technically operations of the so-called “relational algebra.”

The key idea in the relational data model is that a database is a collection of relations, in the full mathematical sense (i.e., sets of tuples). In relational database terminology, a relation is sometimes called a “table,” a tuple a “record,” and an individual element within a tuple an “attribute.” Other than “attribute,” for which I don’t know a good mathematical synonym, I will generally try to stick to the mathematical terms in this handout.

Question 1

A relation can very nicely be represented by a two-dimensional cell array in Matlab, with each row representing one tuple. I have provided a small database represented in this way. The database is in a file named “database.mat,” which you can download from our “Exercises” page on myCourses. It consists of four relations: three of them (in variables mathContributions, physicsContributions, and csContributions) describe some famous mathematicians, physicists, and computer scientists, respectively, some of their major contributions to mathematics/physics/computer science, and a year (more or less, some are historically unclear) for that contribution. The fourth relation, in variable institutions, describes institutions with which some of the people in the first relations were associated, along with the years the association began and ended.

Load the database into Matlab, and look at the three relations to get a sense for what is in them and how it is represented.

Try to pick a single attribute out of a single tuple in one of the relations. Is this something better done with cell indexing or content indexing? (Be sure you understand the difference; talk with your study group partners to help you understand it.) Why?

A simple but useful operation in relational algebra is projection, which simply amounts to extracting one or more attributes from every tuple in a relation, producing a new relation with just those attributes. How could you do a projection on a relation represented by a cell array? Since projection is supposed to produce a new relation, the result of your projection should be another cell array—should you use cell indexing or content indexing? (And could that question be a hint about how to do projection?) Try your ideas on some of the relations from the database.

Another common relational algebra operation is selection, which amounts to extracting one or more tuples from a relation (again, producing a new relation containing just those tuples). How would you select from a relation represented as a cell array? Assume you know the indices of the tuples you want; don’t worry yet about how you know you want those indices. Try your ideas.

Figure out how to add a tuple to a relation represented as a cell array, and test your idea(s) on some of the relations in the database. I’ve tried to be reasonably accurate about the “facts” in this database, but feel free to rely on your own knowledge, or simple Internet searches, to find information to put in your new tuple(s).

Now try adding an entire relation to the database. The relation can represent anything you want, although it might work better with the existing relations if it has something to do with the history of math or science. As when you added tuples to existing relations, the facts stored in your new relation can come from any source you wish.

Finally, save your modified database. You can save it in a different “.mat” file from the original if you don’t want to risk changing the original.

Question 2

In question 1, you developed a way to select tuples from a relation by their indices. In practical databases, users don’t know the positions (i.e., indices) of tuples within relations, and so more sophisticated forms of selection are necessary. One is to select all tuples in which a specified attribute is equal to a specified value (for example, “select all tuples from mathContributions where the mathematician’s name is ‘Isaac Newton’” or “select all tuples from institutions where the institution is ‘Institute for Advanced Study’”). Write a Matlab function that performs this kind of selection on a relation represented as a cell array. Your function should take a relation, the column index of the attribute to be used in comparisons, and the value to compare that attribute to as parameters; it should return a new relation containing just the tuples from the input relation that have the requested value in the requested attribute.

Devise and write down some examples of selections in this style in your database, to be sure you understand what you have to do, and to provide you with some tests of your function.

(In a more powerful database, this style of selection would be extended to support other kinds of comparison besides equality, and Boolean combinations of comparisons, so that it could handle requests such as “find all the mathematical contributions made before 1900” or “find all the mathematical contributions made by Isaac Newton and after 1680.”)

Question 3

Since relations are sets, all set operations (union, intersection, etc.) can be performed on them. Union is particularly useful when the relations come from a database. For example, the database I gave you might in many cases be more useful if you could take the union of the mathContributions, physicsContributions, and csContributions relations to get a new relation that contained all the contributions.

Write a function that calculates the union of two relations, and returns the resulting relation.

(Recall that the union of two sets is a new set that contains every element that was in either, or both, of the input sets. The only thing that is tricky about doing this in a program is that if the same element—i.e., tuple in this case—appears in both sets, only one copy of it should appear in the result. Matlab has a union function, but I don’t believe it works on cell arrays as they are used in this database.)

Bonus

Figure out how to express the following database searches (or “queries” in database terminology) as compositions of projection, selection, and/or union operations.

Use your functions and ideas from questions 1 through 3 to actually code each of these queries and test it in the database. You may code you ideas about projection in a third function if doing so makes this easier for you.

Follow-Up

In your study group’s first meeting with me following your “Complete By” date above, we will talk about your solutions to the questions. I may also ask you to show me code you wrote, or to demonstrate your ideas. Please bring at least one copy of the database and any necessary “.m” file(s) to your meeting.

I plan to make the next (and last) lesson for this course be an extension of this one, in which you explore a more complicated relational algebra operation called a “join.” Keep your database and code from this lesson handy so you can use them in the next one.

Lesson 11—Cell Arrays, Files, and the Mathematics of Databases