Chapter 4



A. Understanding the Basics



1.  Graphic Organizer



Insert Figure 4-1 about here:  Chapter 4 Graphic Organizer



2.  Chapter Overview

      This chapter describes formal assessments of reading and literacy, the use of standardized and other highly structured tests.  Emphasis is placed upon wise and knowledgeable testing policies, including establishing purposes for assessment, criteria for selecting instruments, and uses and misuses of data obtained from such instruments.  Attention is given to the basic statistical information necessary for understanding standardized test scores.  Several general categories of norm-referenced reading/literacy tests, each of which serves a different function, are described and examples provided.  These general categories include the following:  Reading survey tests, general achievement tests, and reading diagnostic tests.  Some guidelines for the implementation of standardized tests are provided, with a discussion of advantages and disadvantages of this approach to assessment.  The chapter provides several opportunities for the reader to engage in analysis and interpretation of test data.


3.  Understanding Formal Assessment

       Formal assessment generally refers to the use of standardized tests or other tests designed for administration under specified, controlled conditions to measure students' reading ability or general school achievement and aptitude.  Standardized tests are often called norm-referenced because norms are provided from a reference population.  That is, the test publisher has administered the test to a sample of children (the reference population) and provided statistical information about the resulting range of scores (the norms).

      Some standardized tests are developed for individual administration, but most are developed for group administration.  One purpose of such tests is to provide a standard to which individuals' and groups' performance can be compared.   The scores obtained from standardized tests are intended to be quantitative, non-biased, and non-subjective.  Examiner judgment and other qualitative factors that might influence a pupil's score, such as attitude and interest, are minimized in these assessments.  Thus, these tests are perceived to be objective devices (Calfee & Hiebert, 1991; Pearson & Dunning, 1985).

      Standardized testing has its roots in the works of educational and psychological researchers of the early 1900's such as E. L. Thorndike, and C. Spearman.   Scientific objectivity in testing became an important issue in psychological and educational assessments during this period when researchers focused on eliminating the subjective judgment of an examiner in an attempt to be more impartial and fair to the person being examined.  Thus, fairness in testing has long been equated with objective tests in which the test scores became the primary source of data that is used to judge a person's ability. 

      To ensure their objectivity, standardized tests must be administered according to the instructions in the examiner's manual that accompanies each test.  Examiners are required to read the directions for taking the test exactly as they are stated in the examiner's manual and to adhere to time limitations for completing the test.  This procedure ensures that all test takers will receive the same instructions, participate in completing the same sample items, and have the same amount of time to complete the test. 

      Some individually administered standardized tests have guidelines for establishing a basal level and a ceiling level.  This procedure shortens the time needed for testing by eliminating some items that are too easy or too difficult for the child.  These basal and ceiling levels can only be established when the items on a test are arranged from easy to difficult.              Basal levels are based on data from the norming population which suggests that students of certain ages can successfully complete items of that difficulty level.  For example, the standard basal level, or starting point, of a test for children at a certain age might not be item #1 but might be item #15 or higher.  The examiner is instructed by the manual to make sure that the child is able to successfully answer the first few questions.  Most children at that age will be able to do that, but some lower ability students will not be able to.  Then, the examiner must proceed backwards through the test items until the student responds correctly to a designated number of items (usually 5-7).  This point is considered the basal level for that child and an assumption is made that all other items below this level would be answered correctly.

      Determination of a ceiling level allows the test to be terminated at a point at which the student is making too many errors within a given number of items.  For example, the examiner's manual may instruct that when a student makes five errors out of eight items, testing will be stopped because a ceiling for that student has been reached.  The assumption is made that the remaining items on the test are too difficult for the student.

        Objective tests have been widely used from the early 1900's to the present time and play a major part in the assessment of students' academic growth and placement in particular school programs (Pikulski, 1990;  Valencia & Pearson, 1988).  Assignment of students to a reading group, for example, is often based on test results.  Standardized tests are also used as a yardstick in measuring the effectiveness of teachers and school programs (Smith, 1991).  Neill & Medina (1989) estimate that over 105 million standardized tests are administered each year to over 39.8 million students in schools in the U.S.

Characteristics of standardized tests

      Most of the standardized tests used in schools today are group tests with multiple choice answer formats.  They are developed, published, and marketed by commercial companies that determine the test content, construct the items, and gather the norming data from the targeted population.

      Norming.  In the norming process, the test is given to a large group of students who are representative of students who will later use the test.  The publisher determines the population that will serve as the norming group.  Those involved in the norming group, or reference group, set the standards by which other individuals' performance will be measured in future administrations.  The norming sample should represent people from various geographic areas, ages and grade levels, socio-economic status, racial and  cultural groups, as well as a variety of community groups, such as urban, rural, suburban.  If the norming procedure is inadequate, as when the norming sample is too small or not representative of the target population, the test scores may be misleading.

      To help ensure that test results will be dependable, statisticians have developed several criteria that the tests should satisfy.  Among these criteria are two that all norm-referenced tests should meet:  reliability and  validity.

        Reliability.  Reliability is a measure of consistency and generalizability.  Salvia and  Ysseldyke (1991) suggest that three questions be asked when evaluating the reliability  of a test:

      1. If the test were re-administered or scored by another examiner,       would the pupil's performance yield the same or very similar score?            This is often called rater reliability.

      2. Are the scores stable over time?  Would the student's score be       essentially the same if he or she took the test again next week, next       month or two months from now?   This is often called test-retest            reliability.

       3. Can we generalize similar behaviors to other test items?  If the student were tested  on a representative sample from a subset of items from the test -- such as a sample of letter names and sounds rather than all of the names and sounds on the test-- would that student get a similar score if he or she were tested on a different sample from that same set of items?  For example, would the student get the same or similar score on the odd numbered items as on the even numbered items.  This is often called split-half reliability.  A similar reliability measure, often called alternate form reliability, is carried out using two versions of the same test.  Publishers often provide different versions, called forms, of the same test for use with the same children at different times.  For example, Form A of the test might be administered in September and Form B in June.

      Reliability is often reported by the test publisher in the form of a reliability coefficient.  This is a statistical measure of the test's reliability.  It is usually in the form of a correlation coefficient which measures consistency:  The closer to 1.0 the measure, the greater the reliability.  In general, the reliability of a test or subtest score is highly related to the number of items used to assess that score.  The more items, the higher the reliability.

       Validity.  A test has validity if it actually measures what it claims to measure.  Test publishers often report their efforts to substantiate the validity of a test in terms of a validity coefficient, in similar fashion to the reliability coefficient reported above.  A validity coefficient is usually in the form of a correlation coefficient, measuring the consistency between the test and some other appropriate criterion measure.  Common types of validity include content validity, criterion-related validity, and construct validity.

      Content  validity is determined by evaluating factors such as the appropriateness of the questions (The questions should represent the content to be measured.) and the completeness of the items sampled (The test items should cover the range of behaviors associated with the topic.).

        The content of the test should represent the common curricular requirements of schools where the test is being used.  A frequent complaint of teachers about standardized tests is that the school curriculum and the test content do not match well; that is, the test does not measure what students have been taught. 

      Criterion-related validity is based on how accurately a test score can estimate a current criterion score or how accurately a test score can predict a criterion score at a later time (Salvia & Yseldyke, 1991).  A criterion score is one that is generally recognized as representing a person's ability in a particular domain.  For example, if the test publisher wanted to demonstrate the criterion-related validity of the reading comprehension passages of a new standardized test, the publisher would  find another measure of reading comprehension that is generally agreed on as being  an accurate measure of reading comprehension.  Then both assessments, the new test and the existing accepted measure, would be administered to the same students and the results compared. 

      Two major types of criterion-related validity are known as concurrent and predictive validity.  Concurrent validity refers to how well the test compares to other established measures administered at the same time.  Predictive validity refers to how well the scores on the test correlate to established measures at a later time.  For example, the most useful test of validity for college entrance examinations would be the student's actual later performance in college.  The most useful test of validity for a traditional reading readiness test would be the child's actual later achievement in reading.  In practice, concurrent validity is usually obtained by comparing how well a new test correlates to other established tests and/or to teacher judgment.  Predictive validity is usually determined by measuring how well the scores on the new test compare to later assessments.

      Construct validity refers to how well the test measures its theoretical construct--that is, to the trait it supposedly is assessing.  For example, consider a test of vocabulary knowledge.  Of course, such a test should be strongly related to a student's ability in vocabulary.  However, general intelligence might also predict to some degree how many words a student knows, so it is possible that what appears to be a test of vocabulary is in fact a test of general intelligence.  To document the construct validity of the vocabulary test, the test designer might show that the test predicts actual knowledge better than does a test of general intelligence.

      Test format can also hinder accurate assessment of an individual's performance by being inappropriate to measure the desired trait.  Suppose students must indicate their knowledge of higher level reading by marking tiny spaces in columns on a computer-scored answer sheet.  It is up to the test publisher to demonstrate the construct validity of its test, that this artificially designed multiple choice device that bears little actual resemblance to the performance of higher level reading tasks in real life, actually does measure the construct of higher level reading.


Understanding Standardized Test Scores

      Teachers, counselors, and administrators often use and report on the results of standardized test scores.  They must be able to interpret these test scores in ways that are meaningful not only to themselves, but to those to whom the results are being reported.  They also must use the results in ways that are consistent with the purposes and limitations of the testing instruments. 

      To be useful, an educator must be able to tell how a student's score compares to the educational performance of other children.  A raw score is the number of points earned on the test, usually measured in terms of number of correct responses.  The raw score of a standardized test is usually unhelpful for the purpose of comparing performance among children.

      For example, a teacher may want to know how each of her third grade readers compares to all the third graders in the school district who took the test, a task that could involve hundreds of raw scores.  There is a need for a more economical way of comparison than by trying to make sense of all the raw scores.  The teacher must also be able to determine how the score on one test compares to the scores on similar tests.  These different tests will have been scored on different scales due to different numbers of questions and different levels of question difficulty:  The raw score on one test could be 35 and on another 500 and on another 72.  In the raw score form, the test results will not be comparable.  For comparisons to be possible, a method must be employed that will allow teachers to make sense of many raw scores resulting from many testing instruments.  This method, called standardization, is accomplished through statistical procedures.  The following sections provide a brief overview of basic statistical tools and standardized scores.

      Measures of central tendency.  The scores in a set of scores, regardless of the scoring scale used, will represent a range from high to low, called a distribution.  Descriptive statistics can be used to economically summarize the data represented in this distribution.  Among the most common descriptive statistics used for reporting test results are the measures of central tendency, which describe the way scores cluster around the middle of the distribution.  Among these measures of central tendency are the mean, the median, and the mode.

      The mean is the arithmetic average of the scores in a set of scores.  To obtain the mean, add all the scores and divide the sum by the number of scores in the set.  The median is the middle score in a set.  To obtain the median score, list all the scores in the set from highest to lowest.  The median score will be the score in the middle of the list:  Half the scores will be at or above the median, and half will be at or below it.  The mode, or modal score, is the score that appears most often in a set.  To obtain the mode, count how many times each score appears in the set.  The score that appears most frequently is the mode.

      Distribution and Standard Deviation.  Another common descriptive statistic used for reporting test results is the standard deviation, which is related to the normal curve (sometimes called the bell or  bell-shaped curve).  One assumption common to standardized tests is that among large groups of people taking the same test, the distribution of scores will include a few very high scores, some high scores, many average scores, some low scores, and a few very low scores.  This symmetrical type of distribution is called a normal curve, or normal distribution.  When it misrepresented graphically, it resembles a bell shape (see Figure 4-2). 


Insert Figure 4-2:  Normal Distribution


      In a graphical depiction of the normal distribution, the mean runs vertically through the center of the bell.  The two portions of the bell on either side of the mean can be divided through statistical procedures into standard deviation  units, which show the spread (or variability) of scores around the mean.  In a normal curve, corresponding portions of the curve, or standard deviation units, contain the same percentage of scores.  That is, the portion of the normal curve up to one standard deviation immediately to the left of the mean (-1) and the portion up to one standard deviation immediately to the right of the mean (+1) will each have the same percentage of scores.  As you move farther away from the mean on each side of this normal curve, the pattern of equal percentages of scores will hold for the remaining pairs of mirror-image standard deviation units. 

      In a normal curve, approximately 68%  of the scores will fall between +1 and -1 standard deviations of the mean, with roughly 34% of the cases in each standard deviation (see Figure 4-2).  Another approximately 28%  of the scores will fall between +1 and +2 and between -1 and -2 standard deviations from the mean, with approximately 14% of the cases in each of these standard deviations.  Another approximately 4% of the scores will fall in the third outermost pairs of standard deviations, with about 2% of the cases in each of these standard deviations. 

      In an administration of a given test, the larger the testing population, the closer the distribution of results will be to the normal curve described above.  The samples on which standardized tests are normed are almost always large enough to produce nearly perfect normal curves.  Also, most standardized tests are designed so that very few students will achieve a perfect score--a further assurance of a normal curve for the test results.

      The standard deviation is valuable in that it shows how the scores in any test will cluster around the mean.  That is, the standard deviation shows how close the test scores are to the average score and to one another.  For example, let us suppose that the same standardized reading test was given in two different schools.  In both schools, the two sets of test scores happened to yield the same mean.  With only the mean for use, an administrator might conclude that the schools were basically equivalent in achievement. 

      But let us further suppose that, for some reason, the distribution of the scores in the two schools was very different.  By having access to the standard deviation, as well as the mean, the administrator could recognize this difference.  Suppose School A and B both had a mean score of 65.  But School A had a standard deviation of 4, while School B had a standard deviation of 14.  The standard deviation of 14 would indicate that the range of raw scores was very broad for the scores in that set, with a large number of low and very low raw scores and a large number of high and very high raw scores.  Conversely, the small standard deviation of 4 would indicate that the raw scores in School A had a much narrower range, with a much smaller interval between the higher and lower sets of scores.

      This difference in variability, as indicated by the difference in standard deviation, would lead to an administrator drawing different conclusions about school performance.  As stated above, in any normal distribution, approximately 68% of the scores will fall within the two standard deviations closest to the mean--the ones on either side of the mean (from -1 to +1).  A student in School B whose raw score is 56 will be within the range in which 68%, over two-thirds, of the students in the school scored.  We can verify this statement by subtracting the student's raw score of 56 from the school mean of 65.  The result, 9, is less than School B's standard deviation of 14.  Thus these student's score is within one standard deviation below the mean, between the mean and -1 standard deviation.  Any student scoring in that same middle range as 68% of the other students in the school is probably performing at an average level of achievement. 

      On the other hand, any student from the other school, School A, whose raw score is 56 is much more seriously at risk for adequate performance.  Even though such a student would be 9 raw score points below the mean, just as in School B, the smaller standard deviation at School A would indicate that this student's score was much further outside the mainstream of School A student performance.  With School A's standard deviation of 4, a score 9 raw score points below the mean would place this student more than 2 standard deviations below the mean.  This student would be performing among the 2% lowest achievers in the school. 

      In actual practice, two schools' standard deviations would infrequently differ by so much, but the point should be taken that mere knowledge of the test mean is inadequate for appropriate interpretation of the test results.  One school might serve a very homogeneous population and obtain smaller variance, while another school might serve a very heterogeneous population and obtain a larger variance. 

      Knowledge of test variance as shown by the standard deviation is also quite important when comparing results of different tests.   Tests usually differ significantly in their variance.  A five point difference in raw scores between two students might be insignificant on a test with a large standard deviation, but be quite important on a test with a smaller variance.

      Standard error of measurement.   All standardized tests are subject to error.  A person taking a test today would probably have a different score if he or she took the test again next week simply because changing conditions affect performance.  To provide for adequate understanding of this possibility of error, test publishers use statistical procedures to determine a range of scores around any given student's score.  This range is called a standard error of measurement (SEM).  The standard error of measurement is defined as the amount of error that is expected for a score on a particular standardized test.  It is viewed as the difference between the score obtained by an actual student and that student's hypothetical "true" score.

      For example, if a student obtains a raw score of 20 and the standardized test standard error of measurement is 5, the student's hypothetical "true" score would be between a possible low of 15 and a possible high of 25.  Tests with high reliability have low SEMs, and tests with poor reliability have high SEMs.

      The standard error of measurement has very practical implications for teachers, especially when using a test to differentiate student performances.  On the test described above, where the SEM is 5, if George obtains a raw score of 59 (indicating that his "true" score lies somewhere between 54 and 64) and Susan a raw score of 57 (indicating that her "true" score lies within a range of 52 and 62, which largely overlaps that of George), the teacher would be ill-advised to differentiate instruction for the two students based on only that testing information.  The odds are fairly high that Susan's "true" score is the same or even higher than George's.

      Out-of-level testing.  Out-of-level testing involves the administration of a test that is intended for students at another grade level (Baumann, 1988).  Out-of-level testing is usually reserved for those students who are reading at a level far below their peers.  Administering an out-of-level test can provide a more accurate picture of such a pupil's reading performance.   

      For example, a fifth grade student who reads at the second grade level would quickly become frustrated and discouraged by the standardized reading test designed to be administered to fifth graders.  The difficulties with the test would be so great that the student's answers would show little beyond almost random choice of answers.  Any multiple choice test that has, for example, five answer choices per question practically guarantees students that they will get at least about 20% of the questions correct, even if all answers are randomly chosen.  The closer a student's obtained raw score comes to that chance level, the less value the results to the teacher, since determination of how much is random chance and how much is adequate performance is difficult.

      Standardized test scores are most stable when pupils answer between 1/3 and 3/4 of the items correctly (Roberts, 1988).  Less able students who answer less than this may have unstable scores that credit them with higher reading levels than they actually possess.  

      Some tests, such as the Gates-MacGinitie Reading Tests (MacGinitie, 1989), provide teachers with a booklet of out-of-level norms for selected levels of the test.  If Sam, a seventh grader with serious reading problems, was administered the test level designed for third graders, it would be possible to use the out-of-level norms booklet to find scores that compare him to other seventh graders taking that lower level test.

      For many other standardized tests, no out-of-level norming has been carried out by the publisher.  In those cases, to insure accuracy of interpretation, reporting out-of-level scores requires that the pupil's actual grade level be reported as well as the intended grade level of the test taken.  If Sam, a seventh grader took the third grade standardized test and scored in the 43rd percentile, his results might be reported as follows:  "Sam, a seventh grader, scored in the 43rd percentile on the test intended for third graders.  His score indicates that he did as well as or better than 43% of the third graders who took the same test, placing him in the average range of third grade readers."

      Teachers should be aware that the value of administering a standardized test to students who are reading well below or well above their grade level is often questioned because the norms do not apply to such students.   Most standardized tests have not been designed for or normed for seriously at-risk students (Fuchs, Fuchs, Benowitz, & Barringer, 1987).


Derived scores

      When  students take standardized tests, their answer sheets are often sent to a district central office or to the test publisher for scoring and analysis.  Occasionally, the answer sheets are scored by hand by the teacher, who uses the test manual or microcomputer software to analyze the scores.  In the scoring process, the students' raw scores are transformed into derived scores, scores based upon the raw scores but changed into different formats to facilitate interpretation.  There are several types of derived scores, including percentiles, grade equivalents, or stanines. 

      Percentiles (%iles).  Percentile ranks are derived scores that show a student's position relative to the norming sample, using a system in which the distribution of scores is divided into sections, each of which has 1/100 of the scores in the total distribution.  Students with a percentile rank of 50 have scored at the median of the norming group, since 50/100 (that is, 50 percent, or 1/2) of the scores are above and 50/100 (that is, 50 percent or 1/2) of the scores are below.  A student with a percentile rank of 40 scored as well as or better than 40% of the sample, and poorer than 60%. 

      One difficulty with interpreting percentiles is that the intervals between percentile scores are not equal, unlike the standard deviations described above.  The percentile scores will cluster tightly around the mean because this is where most of the raw scores fall.  As a result, a small change in a raw score near the mean can produce a large change in percentile.  The same change in raw score at the extreme ends of the range will produce little change in percentile ranking, as the following illustration shows:

Raw Score      :      55      60      65      70      75      80      85

Percentile:       1      2      16      50      84      98      99

A change in a student's raw score from 65 to 70, which is not much of a real difference on a test with 100 questions, will result in a shift from the 16th to the 50th percentile.  The same five-point increase in raw score from 80 to 85 will move the student only from the 98 to the 99th percentile.

      Thus use of percentiles in interpreting test results can easily lead to inaccurate conclusions about average students, in which educationally insignificant differences seem to be blown out of proportion.  On the other hand, educationally significant differences at the extreme high and low ends of the ability scale seem to be minimized.

      Grade equivalent scores.  Grade equivalent scores are derived scores that show a student's position relative to the norming sample, using a system which is based on the average scores obtained by students at given grade levels in the sample  population on which the test was standardized.  Usually the grade levels are divided into tenths, such as 2.1, 2.2, 2.3 and so forth.  Grade equivalent scores are widely used among teachers and administrators, in part because they appear to provide direct comparisons of student achievement with difficulty levels of materials.  For example, it would seem that a third grade child who scores at the 6.5 grade equivalent should be assigned reading materials with readability at the sixth grade.  These comparisons are misleading, however, and as a result of the almost insurmountable confusions arising from use of grade equivalent scores, some professional organizations, including the International Reading Association, have criticized their use (Harris & Hodges, 1995). 

      The confusions are particularly serious when test results are reported to parents, who are for the most part unfamiliar with test standardization procedures.  For example, many do not understand that in deriving grade equivalents from a statistically normal standardization sample, a significant portion of children's scores will almost always be in grade levels lower than that of their actual classroom assignment  (An exception might be, for example, in testing initial consonant recognition with tenth graders.  Virtually all will score perfect scores, placing the grade equivalent of a perfect raw score at the tenth grade level.  All those students will be on grade level.).   Having one-third of your school's students scoring below grade level--and being utterly unable to do much about that because of the inherent structure of the grade equivalent scoring system--is frustrating to educators and politically damaging. 

      Grade equivalent scores are not only confusing to parents, but they can be confusing to educators, as well.  As mentioned above, grade equivalent scores are based upon the average score of pupils at different grade placements.  The grade equivalent score represents how the raw score of one pupil compares with the scores obtained by the average pupils at each grade level in the standardization sample.  Why should not the third-grade student mentioned above, who scored 6.5 grade equivalent on a reading test, be assigned sixth grade level reading assignments?  The answer is simple:  A 6.5 grade equivalent score does not mean that the student is capable of doing sixth grade work.  It is much more likely that this student has been very successful in mastering the third grade work provided in the test items, and so has scored as well as the average sixth grader would on that third grade test.  To use the test score to simply skip the content of the reading curriculum in fourth and fifth grades for this child would be a gross misuse of grade equivalent scores.

      Grade level equivalent scores are established by testing pupils at several grade placements with the same test and finding the average score for each group.  The raw scores are plotted and mathematically extrapolated to find scores above and below the averages found until grade equivalent scores are determined for the level of each published test (Lyman, 1986).  For example, if a reading test was given to a fourth-grade norming group, which produced an average raw score of 40, this would be set as the grade equivalent for grade four.  Similarly, the average achieved by a fifth-grade group might be 50.  Points between are assigned decimals and reported as "months."  A score halfway between 40 and 50 would be 4.5, or equivalent to fifth month of fourth grade.

       The concepts underlying grade equivalent scores have to be well understood by both those who report and those who receive the scores. Since such understanding is rare, these scores should not be used by teachers to report a pupil's progress or reading level (Berk, 1981).  Several other caveats about using grade level scores include:

      1. The use of grade level scores assume that learning progresses in a uniform way during the school year, that the increase from level 2.1 to 2.2 is the same as from 2.9 to 3.0.  Learning does not occur in such a mathematically precise way.

       2. Since grade level scores are based on average scores, some (in fact, most) children perform better or worse than the average.  This is a difficult concept for many to grasp and is made politically volatile by the grade equivalent reporting system.

      3.  Grade equivalent scores from different test publishers will not necessarily be the same, nor will grade equivalent readability scores of reading materials necessarily match the scores of test publishers.

       Standard scores.  Standard scores are the results of a statistical procedure in which raw scores are transformed so that they have a given mean and a given standard deviation.  They express how far a child's results lie from the mean of the distribution in terms of that given standard deviation (Sattler, 1988, p. 32).  A common standard score is the T score.  It has a mean of 50 and a standard deviation of 10.  Another is the z-score, which has a mean of zero and a standard deviation of 1.  Z scores range from -3 to +3.

      Stanines.   Stanines are also standard scores and are more widely used than T scores or z-scores.  The term is derived from "standard nine."  Stanines range from 1 (the lowest ability level) to 9 (the highest), with a mean of 5 and a standard deviation of 2.  When raw scores are converted to stanines, each stanine represents half of a standard deviation.  Stanine scores, however, are always reported as whole numbers.  Therefore, a student who has a stanine score of 7 (that is, one standard deviation above the mean) actually fell somewhere from .75 to 1.25 standard deviations above the mean.

       Normal curve equivalents.  Normal Curve Equivalents (NCEs) are based on the basic concept underlying percentile ranks, of dividing the scale of scores into 1/100 units.  But NCEs offer some advantages over percentiles in terms of clarity of interpretation.  As mentioned above, use of percentile scores can lead to misinterpretation due to the unequal size of the 1/100 units in their scale.  NCEs have been transformed into equal units across their scale.  Like percentiles, they range from 1 to 99, with a mean of 50.  However, all units in the scale, whether near the mean or at either end, are equivalent.  Thus, unlike with percentiles, if two poorer ability students differ by 5 NCEs, and two average students differ by 5 NCE's, the educational significance of those differences is about the same.  Generally, NCE's are "more spread out at the ends and less spread out in the middle" (Gorlund & Linn, 1990).

       Many schools now use NCEs for interpretation and reporting of test scores.  They have become popular because they avoid some of the difficulties inherent with use of grade equivalents and percentile ranks.



B.  Examining Specific Factors:

      Formal Assessment Devices for Reading and Literacy


        A wide variety of standardized, norm-referenced tests are used for assessment of reading and literacy.  Because of the variety, categories often overlap.  The following categorization system will describe norm-referenced tests in two major categories:  Achievement survey tests of reading and literacy provide general information of most value when assessing groups of students.  They include both reading survey tests and the general achievement batteries used in many schools.  Diagnostic reading tests provide detailed information about reading skills appropriate for use in designing literacy curricula for individual students. 

      There are three additional varieties of formal assessments, all of which are designed to fulfill either a survey or a diagnostic purpose or both, that will be described:  Criterion-referenced tests, the Degrees of Reading Power (DRP) Tests (College Entrance Examination Board, 1983), and performance-based assessment tests.  Each type of test has a different purpose and format and taps different aspects of reading.  Formal tests of reading readiness and of emergent literacy are described in Chapter 5--"Assessing Emergent Literacy". 


Reading Achievement Survey Tests

      Reading survey tests are screening devices used to evaluate reading

development in a general rather than a specific way.  The tests are designed for group administration.  A reading teacher might use such a test for preliminary screening of children for possible inclusion in Chapter 1 remedial programs or for identification of students who might be making slow progress in reading.  Survey tests are efficient in terms of amount of teacher  preparation and effort needed to administer the test and student time needed to complete it.

      Description.  Reading survey tests provide information about a student's basic vocabulary and comprehension.  One example of a reading survey test that uses a unique cloze approach to assessing literacy is the Degrees of Reading Power Test (DRP) (College Entrance Examination Board, 1983), which is described in detail later in this chapter.  An example of a more traditional reading survey which has long been used in reading assessment is the Gates-MacGinitie Reading Test (MacGinitie & MacGinitie, 1989).  These multiple choice, survey tests consist of several test levels designed for preschool through grade 12.  On the vocabulary subtest for third grade, for example, students are given a target word and must choose an appropriate match from a list.  On the comprehension subtest, students read a short passage and answer questions about it.  As with many norm-referenced tests, student scores are reported in terms of stanines, NCEs, percentile ranks, and grade equivalents.  Standard scores called Extended Scale Scores are also provided.

      The Nelson-Denny Reading Test (Brown, Fishco, & Hanna, 1993) is a survey test designed specifically for high school and college students.  Both the Stanford Reading Tests (Psychological Corporation, 1989) and the Metropolitan Achievement Tests Reading Survey Tests (Psychological Corporation, 1985) are actually the reading-related subtests of broader scale general achievement tests (described below), printed separate from the larger test battery for purchase by those who are interested specifically in reading assessment.

      Advantages and disadvantages.  Reading survey tests are effective in screening large numbers of pupils in a short time period and in identifying those whose reading behaviors may fall above or below certain levels.  They are best used for preliminary screening.  In preliminary screening, poorer students are tentatively identified for further assessment, which will be designed to verify the preliminary results and provide more detailed information appropriate for instructional decisionmaking.

      Survey tests are not diagnostic in form or content.  The norm-referenced scores that are obtained alert the examiner to those pupils who score very high, at the average, or very low.  It is important to remember that survey tests are global measures of reading, intended to  measure group achievement or a student's general rank within a group  rather than an individual's achievement.

      Reading survey tests are often chosen by reading specialists for use in school reading programs in preference to general achievement tests (see below).  The reading survey tests offer attention directly targeted to the needs of reading programs.  In addition, reading survey tests such as the Gates-MacGinitie (MacGinitie & MacGinitie, 1989) offer out-of-level norms for better interpretation of the reading performance of disabled readers. 

General achievement tests

      General achievement tests (sometimes called batteries instead of tests--indicating that a large group of subtests are being administered) are used by schools to evaluate achievement in a variety of subject areas, such as mathematics, English usage, spelling, reading, study skills, social studies, and science. 

      Achievement tests are general rather than specific and, in the assessment of reading and literacy, function in similar fashion to the reading survey tests described above.  In fact, the reading subtests of such tests have the same function as reading survey tests; the general achievement tests simply include additional subtests on other skill and subject areas.  Like survey tests, the results of general achievement tests are used to assess growth of individuals and to compare the performance of students within and across classes in the same school and in different schools across the nation. 

      These tests are commonly administered by schools as end of the year tests in May.  Due to the time, effort, and expense needed for their administration, many schools administer them every second or even third year.  They are used for screening purposes, to identify those pupils who demonstrate very high or very low scores in comparison with their peers.  They are also used for general progress evaluation of given classes or schools.  Teacher and school accountability for adequate performance is an important factor when the overall results of general achievement tests are released to the public. 

      Description:  The test content is developed by test authors based on common curricula found at all grade levels throughout the country.  Group achievement tests have several different levels which represent the curricula content common to specific grades.  Students are assigned different levels of the test based on their grade in school.

      Some common achievement tests include:  The California Achievement Test (CAT) (Macmillan/McGraw-Hill School Publishing Company, 1992), the Iowa Tests of Basic Skills (ITBS) (Riverside Publishing Company, 1993), the Metropolitan Achievement Tests (MAT) (Psychological Corporation, 1992); SRA Achievement Series, (Science Research Associates, 1978); the Stanford Achievement Test Series (Psychological Corporation, 1991) and the Comprehensive Test of Basic Skills (Macmillan/McGraw-Hill Publishing Company, 1990).  In addition, some states have developed their own general achievement tests for periodic evaluation of their students.

      Advantages and disadvantages.  As with reading survey tests, scores can be used to compare an individual's overall achievement to his or her peers.  A group or class's overall achievement can be compared to other groups or classes in the same school.  School performance can also be compared with other schools.   In tests designed to obtain survey information, however, the teacher does not learn much information of value in dealing with the instructional needs of specific children:  Seldom is information obtained about how a particular pupil answered specific items on the test, why the student didn't respond correctly, and what thinking processes or reading difficulties might have interfered with his or her response.

      In interpreting scores from these tests, it is important to be aware of the nature of each subtest.  Subtests on different achievement test batteries might have similar names, such as inferential comprehension or paragraph comprehension, but the way the skill is assessed may vary from test to test.   It is important to examine the content of the individual tests before making a judgment about students' achievement in the skills associated with reading.

      Test publishers have expanded their reporting services to provide schools with computer printouts showing the scores of all the pupils within a class.  They also provide individual profiles of pupils and include a computerized analysis of individuals.  These analyses  identify general strengths and weaknesses in the various skills and content areas.  While these can be helpful, instructional decisions and diagnostic statements about individual students must be interpreted with caution, since the tests are more appropriate for decisionmaking about groups than about individuals. 

      Individual achievement tests.  As noted above, most general achievement tests are designed for large scale group administration to entire classes or schools.  Individual achievement tests, however, are designed to be administered one-on-one to a student.  These achievement batteries cover a broad range of skill and content areas and are often administered to students who have special schooling needs.  If teachers are concerned for students who might not follow the directions or who might not perform to the best of their ability on a group administered achievement test, an individually administered achievement test is appropriate.  These tests are often used by special education teachers and reading teachers as a means of obtaining a more accurate estimate of at-risk students' growth in reading and other subject areas.

      As with general achievement tests, individual achievement batteries include a variety of subtests such as word recognition, word analysis, reading comprehension, spelling, mathematics and written language usage.  Commonly used individual achievement test batteries include:  The Basic Achievement Skills Individual Screener (BASIS) (Psychological Corporation, 1983); the Diagnostic Achievement Battery (DAB) (Newcomer, 1990), the Kaufman Test of Educational Achievement(KTEA),  (American Guidance Service, 1985); the Peabody Individual Achievement Test-Revised (PIAT-R, (Markwardt, 1989), and the Wide Range Achievement Test (WRAT) (Wilkinson, 1993).  The Diagnostic Achievement Test for Adolescents (DATA) (Newcomer & Bryant, 1993) is designed for older students.  

      Teacher-student interaction is an advantage of Individually administered tests.  Behaviors of the student on certain subtests and items within tests can be noted so that the test is not only used to measure growth but to measure strengths and weaknesses in functioning.  This diagnostic aspect, however, does not play as key a role as on actual diagnostic tests, described below.  Another advantage to individually administered tests is that the teacher can draw conclusions as to how seriously the student worked on the test and whether its results are valid, a serious problem with at-risk students taking a group test.  In fact, the one-on-one nature of individual testing usually insures that students do take them seriously.   

Diagnostic tests

      Diagnostic reading tests are designed to provide a detailed profile of individuals' reading strengths and weaknesses.  They are more specific than survey and general achievement tests, they assess specific reading skills that should be mastered at certain grade levels.  Diagnostic reading tests differ from survey tests in that the former have a larger number of subtests to evaluate a wider array of reading skills and more items within each subtest than survey tests to provide greater reliability.

       Diagnostic tests can be designed for either group or individual administration.  While all diagnostic reading tests provide a profile of individuals' skill development in reading, their specific skills strengths and weaknesses, group tests can also help identify skill weaknesses characteristic of the group, such as a class or school, that might be contributing to that group's poor reading performance.

      Description.  Diagnostic reading tests include several subtests that tap specific skills thought to be important in reading.  Diagnostic reading tests are based on the assumption that reading is a skills based process and that teaching to skill weaknesses  will ultimately improve the student's reading ability.  Gronlund & Linn (1990) have warned that the difficulty level of diagnostic tests appears to be lower than survey tests, because the former are intended for students who are experiencing difficulty in acquiring the skills and abilities needed for reading.

      Like all norm referenced tests, the scores obtained will reflect how well a particular pupil scores in relation to others who took the same test and are of the same age or grade level.  They provide information about a pupil's reading in a number of different skill areas, such as:  Auditory discrimination, visual discrimination, letter identification, listening, oral reading and fluency, spelling, blending, phoneme/grapheme identification at a variety of level (such as, initial consonants, final consonants, short vowels, long vowels, and so forth), sight words, context analysis, structural analysis, syllabication, vocabulary, comprehension (both oral and silent) at a variety of levels (such as, literal, inferential and evaluative), and reading rate.  Group tests do not include oral reading tasks and have a multiple-choice format.

      Individual reading diagnostic tests are often composed largely of tests of oral and silent reading in which graded passages are read with accompanying comprehension assessment.  In this sense, the tests are similar to Informal Reading Inventories (see Chapter 6--"Informal Reading Assessment"), but IRI's are not normed against a standardization sample.  In addition to the oral and silent reading, the diagnostic batteries include subtests on additional skills.   The Diagnostic Reading Scales (Revised ed.) (DRS), (Spache, 1982), the Durrell Analysis of Reading Difficulty (DARD), (Durrell & Catterson,1980), and the Gates-McKillop-Horowitz Reading Diagnostic Tests (Gates, McKillop & Horowitz, 1981) are examples of formal reading diagnostic batteries.

      The Woodcock Reading Mastery Tests-Revised (Woodcock, 1987) is another example of a diagnostic reading test.  This test includes a battery of six subtests:  Visual-auditory learning, letter identification, word identification, word attack, word comprehension, and passage comprehension.  Three cluster scores are developed from the six subtests:  Readiness, basic skills and reading comprehension.  Conversion of raw scores to standard scores is a complex process for the examiner, involving several steps.  Scores are reported in terms of a Relative Performance Index, which is the range or band of performance indicated by the test's Standard Error of Measurement.  Scores are also converted to percentiles, grade equivalents, and age equivalents (based on the average scores obtained by students at given ages in the sample  population on which the test was standardized).

     Advantages and disadvantages.   Standardized reading diagnostic tests tend to focus on assessing discrete reading skills rather than reading as a wholistic process. Thus, remedial procedures are often based on skill weaknesses, while other reading processes are neglected.

      In creating group diagnostic tests such as the Stanford Diagnostic Reading Tests (Karlsen & Gardner, 1984), test publishers have sacrificed two important means of obtaining information, namely oral reading and teacher-student interaction.  Group reading diagnostic tests lack oral reading tasks.  They offer the teacher few indications of the processes a child has used in selecting answers.  Teachers may sometimes be able to infer the thinking processes used by children, based on their own experiences with reading.  However, students who have difficulty acquiring reading don't necessarily use the same processes as normal readers or adults.  Manzo (1994), for example, argued that disabled readers tend to use reader-based, top-down skills and relied more on context, because many of their problems involve word recognition.

      Another difficulty with group diagnostic reading tests involves administration time.  These tests attempt to provide a thorough, reliable analysis of skill areas associated with reading by providing a substantial number of subtests, each of which takes a significant amount of time. They are tedious for students to complete and require a long administration time. 

      Publishers of group diagnostic tests may state in their test manuals that the results can be used for individual instructional and program planning.  But care should be taken in making such judgments based on derived scores, without the insights available from one-on-one sessions with the student (Salvia & Ysseldyke, 1991).  Careful analysis of individual patterns of response is necessary for the tests to be used in program planning.    

      Individual diagnostic reading tests require that the examiner have closely studied and practiced administration of the test.  Like group diagnostic tests, they are time-consuming, but at any point in the administration the examiner can decide to administer only those subtests which are considered necessary for understanding the child's reading behaviors.  Thus, time of testing can be shortened without jeopardizing the results of the test.  The grading of the test after administration is also time consuming for the examiner.

      Although individual diagnostic reading tests are considered formal tests, in that administration occurs under closely controlled conditions, the standardization procedures are sometimes not of the magnitude of most norm-referenced, standardized tests designed for group administration.  Norming often involves a small, limited standardization sample, which does not meet the criteria of a test that has been rigorously standardized.  In fact, the tests may not have been normed in any formal sense at all, but rather simply field-tested with populations.

      Specific diagnostic tests.  Some standardized reading tests are designed to closely assess a specific aspect of reading and literacy, rather than provide a battery of subtests for assessment of the range of reading and literacy skills.  Silent reading comprehension tests are such tests, in that only a student's comprehension under silent reading conditions is assessed.  Since they are norm-referenced and standardized, a student's score can be compared to other students of the same age or grade level.

      The Test of Reading Comprehension (TORC), (Brown, Hammill, & Wiederholt, 1987) is a group silent reading comprehension test.  Eight reading comprehension subtests make up this silent reading battery.  Three subtests are in the General Reading Comprehension Core:  General Vocabulary, Syntactic Similarities, and Paragraph Reading.  Diagnostic supplements include:  Mathematics Vocabulary, Social Studies Vocabulary, and Science Vocabulary.  Other subtests include Sentence Sequencing, and Reading the Directions of Schoolwork.   An overall Reading Comprehension Quotient can also be obtained. 

      The TORC test manual indicates that it was constructed according to psycholinguistic theory, with its emphasis on the syntactic and semantic components of comprehension (Smith, 1978).  But most of the subtests are traditional in format and yield little more information than a survey reading test.  The General Vocabulary subtest and the content area vocabulary subtests, for example, have students choose an answer that best matches a list of target words, and the Paragraph Reading subtest has students answer questions based on reading a short paragraph.  In the Syntactic Similarities subtest, however, students examine several sentences, all of which have similar vocabulary in varied syntactic arrangements.  They must choose the two sentences which mean most nearly the same thing.  In a sample exercise, for example, "Sam plays" and "Sam is playing" would be chosen, not "Sam is going to play."  In the Sentence Sequencing subtest, students read several sentences which represent a story, but the sentences are out of order. Students must decide in what order the sentences should be rearranged. 

      The Gray Oral Reading Test (GORT) (Wiederholt & Bryant, 1986) is an individualized diagnostic test of oral reading and comprehension.  Students read paragraphs at different difficulty levels.  A Passage Score is generated from a combination measurement of reading rate and number of miscues (that is, oral reading errors, called deviations from print in the GORT test manual).  A Comprehension Score is based on multiple choice answers to comprehension questions.  The examiner can also carry out a categorization activity to determine patterns of miscues in the student's oral reading.


Criterion-referenced tests

      Criterion-referenced tests assess learning in terms of the kinds of behaviors or skills that have been mastered at a given level.  As noted earlier in the chapter, norm-referenced tests provide information about an individual's performance in relationship to the norming sample.  Criterion-referenced tests, on the other hand, provide information about an individual's performance in relationship to his or her ability to perform a given task.  For example, a first grade child might have scored "at mastery level" (that is, above a specified criterion score selected by the test publisher) on a given task such as letter identification, but "below mastery level" on another task such as sight word identification.

      Criterion-referenced tests of reading are used primarily to measure individual students' ability to perform in reading and literacy skill tasks, not to compare students' reading behavior with a norming group.  Reading is viewed as an accumulation of skills that can be taught and measured.    Determining whether students have "mastered" certain skills associated with reading is an important function of criterion-referenced tests.

      Description.  Generally criterion-referenced tests have components that assess specific instructional objectives of importance to the curriculum.  A first grade test, for example, might have components assessing such skills as letter recognition, sight word identification, initial consonants, final consonants, initial consonant blends, digraphs, short vowel sounds and others.  All of these are key skills learned at the first grade level.

      Assessment of each objective typically occurs in terms of whether the child has achieved mastery.  Mastery does not mean perfect performance of the objective.  Rather, a predetermined criterion score is used to determine whether a child has mastered the objective.  Often that score is between 75% and 80% correct on items pertaining to the specific objective (Lyman, 1980), but this can vary depending upon the constraints of the test publisher, the school district, or even the classroom teacher.

      Criterion-referenced tests can be either formal or informal, depending upon the efforts made in their construction.  At the one extreme, classroom teachers can and do easily devise informal devices which test the skills they are teaching.  At the other extreme, a test publisher can devise a formal criterion-referenced test and norm it against a standardization sample, or vice versa, thereby providing users of the test both with a list of skills which the children have and have not yet mastered and with derived scores such as percentile ranks, grade equivalents, and NCEs.  The reading subtests of the Stanford Achievement Test (Psychological Corporation, 1989), for example, are  traditional norm-referenced tests and provide teachers with a report on derived scores, such as percentile ranks, stanines, and grade equivalents.  Items on the tests have been further classified according to a variety of skills to yield a criterion-referenced report, in which students are reported to be above, at, or below mastery level for each specified skill. 

      Some teachers, unhappy with the limited information yielded by survey reading and general achievement tests, have developed item classification systems similar of their own, based on normed tests (Fantauzzo, 1995).  The results can yield information similar to that of a formal criterion-referenced test.   For example, errors on individual items in a test of word recognition might be analyzed and classified as to student strategies (see Figure 4-3). 


Insert Figure 4-3:  Informal Classification of Student Strategies

 on Errors in a Formal Word Recognition Test


      Most published criterion-referenced tests lie somewhere in the middle of these two extremes:  They have been field-tested, but they have not been normed.  Because of the relatively few questions for each skill on such tests, the reliability would be very low, too low for use in instructional decisionmaking without significant corroboration from other assessment sources (Calfee & Hiebert, 1990).     

      One example of a standardized criterion-referenced test is the Prescriptive Reading Inventory Reading System (PRI/RS), CTB/McGraw Hill, 1980.  This test measures 171 objectives in four major reading skill areas: 1) Oral Language and Oral Comprehension, 2) Word Attack and Usage, 3) Comprehension, and 4) Reading Applications.

      Advantages and disadvantages.  As mentioned above, reliability for scores on the different skills assessed can be very low.  On some tests, for example, only 3 or 4 items are used to assess each skill.  Since reliability of a score is highly related to the number of items used to assess that score, the reports on mastery and non-mastery of skills would be highly unreliable on such a test.  Important instructional decisions should not be made on the basis of unreliable information.

      The major advantage of criterion-referenced assessment is the close match between the test and the classroom curriculum.  A norm-referenced report that states that Mary, a second grader, is in the 60th percentile for second graders on word recognition skills is of limited use in planning instruction for Mary.  A reliable criterion-referenced report, on the other hand, might suggest that Mary has demonstrated mastery at the second grade level for recognition of consonant blends, digraphs and for syllabication skills, but not for structural analysis, short and long vowel sounds, and diphthongs.  Such a report would be extremely useful in planning her future instruction.

      Some criterion-referenced tests include complex management systems in which students' skills that are in need of improvement are matched to published resources for teaching these skills.  This can aid teachers in selection of materials.

Degrees of Reading Power Tests (DRP)

      Purpose.  The Degrees of Reading Power Test (College Entrance Examination Board, 1983) was developed to  assess students' comprehension of short selections by offering a substantially different psychological construct for test taking from that typically taken in standardized tests.  These tests were designed to assess comprehension by requiring readers to integrate content knowledge with their semantic and syntactic uses of language.

      Description.  The DRP is a reading achievement survey test.  As such, it is used to evaluate reading development in a general rather than a specific way.

      The format consists of short paragraphs.  Selected words have been deleted and replaced with blank spaces.  For each blank, students are given a multiple choice item with five target words from which to choose.  The students are required to replace each deleted word in the selection with one that fits the content of the selection and meets the syntactic and semantic requirements of the passage.  

      Degrees of Reading Power tests (DRP) are described by Kibby (1981) as, "highly sophisticated, highly developed, formalized, informal reading inventories."  Results are reported in DRP units, a  derived score that is a unique construct of the test publisher based upon field-testing of the test. The publisher also provides a comprehensive list of readability measures of content texts, basal readers, and literature, all reported in DRP units.  According to the publisher, students' DRP unit scores from the test can be used to locate reading material that is a match with the student's instructional level. 

      Advantages and disadvantages.  A major advantage to the The Degrees of Reading Power Test is that it does not yield typical norm-based scores, such as grade equivalent and percentile rank scores, that are so readily misinterpreted.  Instead, the test yields raw scores that are converted to DRP scores, which are uninterpretable except in terms of making matches with the list of reading and instructional materials.  This makes availability of the list for teachers and for anyone to whom the results are reported a necessity.

         Another advantage of the DRP is that the format of the test requires readers to use their knowledge of language (grammar, word meanings, sentence meanings, and passage meanings) to guide their responses.  Also, teachers can easily examine patterns of student word choice selections for closer evaluation of the students' strengths and weaknesses.


Performance-based assessments

      Traditional formal testing of reading and literacy attempts to test student learning through indirect means.  To test reading, for example, children fill in dots on a computer-scorable answer sheet after reading short passages and questions that have limited similarity to actual real life reading tasks.  To test writing, children do not write.  Instead, again they fill in dots in answer to questions someone else has written.

      Slowly over the past few decades, educators have increasingly recognized the limitations of such indirect testing, as well as negative effects involved in putting pressure on teachers to "teach to the test" when the test is not representative of desired reading and writing tasks.  This increasing awareness has led to adoption of a variety of alternative assessment strategies, such as teacher observation, portfolio assessment, and performance-based assessment.

      Description.  Performance-based assessment involves the direct assessment (as opposed to indirect assessment in traditional testing) of student knowledge and ability (Berk, 1986; Stiggins, Conklin & Bridgeford, 1986).  It is based on student performance which integrates processes, skills and concepts and allows the teacher to gauge the depth of student knowledge (DeLain, 1995). Performance-based assessments require students to construct their own answers rather than to identify a correct response on a test (Valencia, 1992).  

      In reading, for example, the assessment task is designed to simulate a real-life reading task as closely as possible.  At first, most performance-based assessment was, in essence, instructional assessment or diagnostic teaching carried out by the classroom teacher (see Chapter 8--"Instructional Assessment").  The teacher would design a task especially for the purpose of evaluation, though it would often have important teaching and learning value, as well, since it would be closely related to the classroom curriculum.  Then teacher would implement the assessment with children and evaluate them on its performance. 

      For example, perhaps the class was studying a unit on spiders, and the teacher wanted to evaluate the children's ability to find important information from science content readings.  The teacher could design a task in which students read a selection about the characteristics of spiders, then listed the important characteristics in a semantic web.  Evaluation would be based upon how well each child succeeded in including the important characteristics in the web.  In observing the children at work during the task, additional information could gleaned by the teacher pertaining to such issues as independence of student functioning, word recognition ability, vocabulary ability, and ability to bring background knowledge to bear on science reading.  The teacher might use an observational checklist to carry out such observational assessment.

      With the rise of interest in performance-based assessment in the late 1980's, test publishers developed their own formal versions of such devices.  Published performance-based assessments provide the teacher with a script for dialogue and guidelines for interacting with pupils throughout the assessment.  Scoring is designed to be carried out by the teacher and usually involves considerably more effort than traditional multiple choice assessments.

      Some, such as the California Achievement Test/5 K/1 Assessment Activities (Macmillan-McGraw Hill, 1992) and the Performance Assessments for the Iowa Test of Basic Skills (Riverside Publishing Company, 1993) are designed as add-ons to traditional tests.  Others, such as GOALS: A Performance-Based Measure of Achievement (Psychological Corporation, 1992) and The Riverside Performance Assessment Series (Riverside Publishing Company, 1993) are designed as stand-alone tests.

      Only a few performance-based assessments have been normed, including the add-on Performance Assessment series by Riverside Publishing and GOALS from the Psychological Corporation.  Raw scores from these assessments can be transformed to scaled scores, percentile ranks, stanines, grade equivalents and normal curve equivalents. 

      Advantages and disadvantages.  Performance-based assessments provide a bridge between formal and informal testing.  These tests offer a unique format in which students can demonstrate their competence by applying what they have learned in a practical situation.  But publishers face a serious problem in developing instruments that are simultaneously sufficiently generic for use across the country and sufficiently specific to fit an individual teacher's curriculum.  In fact, much of the advantage of informal performance-based assessment is lost in the formalization process.  Readings, for example, are no longer directly tied into classroom curriculum, and the teacher's role becomes less fluid and responsive to observed behaviors and more structured by the assessment routines (Wiggins, 1993).  

      Performance-based assessments offer teachers a choice in how to assess their students, but choice requires reflection about one's purpose for testing.  Teachers should ask some questions before using performance-based assessments:  1) Does the assessment reflect your curriculum?; 2) Does the assessment reflect your beliefs about what should be learned?  3)  Does the assessment have an impact on your instructional decisions? (Valencia, 1992).  In addition, careful attention should be paid to reliability of assessment results, which remains a serious problem with performance-based testing.

       Performance-based testing is not a replacement for the information gleaned from more traditional formal tests.  One's choice of assessment instrument should be based on the information that is needed about a student's learning. 


      Implementation of Standardized Testing

Advantages and disadvantages

      Much of the controversy about standardized tests in our schools stems from three sources:  The misuse and overuse of standardized tests, the misinterpretation of test results, and the varying purposes for assessment among people with differing relationship to the schools.

      First, there is no doubt but that standardized tests have been misused and overused.  Children and teachers spend a very substantial amount of school time on standardized testing.  Yet in many cases, there is little or no direct educational benefit to the children from those tests.  Test results are filed away without any impact on the planning of instruction.  Sometimes the results are irrelevant to classroom curricula, and sometimes the delay in reporting results makes them out-of-date.  In addition, tests are often used as weapons against students and teachers under the guise of accountability--holding students and teachers accountable for the educational progress in the classroom.  The test becomes the enemy, a hurdle to be overcome by any means necessary--educationally valid or not--even if that means teaching to the test by means of rote drill and practice multiple choice exercises instead of authentic reading and writing experiences.

      Second, as noted repeatedly in this chapter, the reporting of test results to students, teachers, parents, administrators, and governmental and educational policy makers is filled with peril.  Numbers and statistics based on supposedly objective assessment can appear deceivingly important--conclusive and scientific-- when in fact those numbers and statistics are very limited in their ability to summarize the sum total of what goes on in a child's mind or in a classroom.  Standardized testing is a starting point in assessment, not the final conclusion.  In addition, appropriate interpretation of those numbers and statistics requires a clear understanding not only of statistics and educational measurement, but also of the relationship between standardized testing and classroom practice.

      Third, different people have different uses for assessment.  These differences, especially the differences in purpose between teachers and educational policy makers, have led to some of the most heated controversies in recent years (Calfee & Hiebert, 1993).

      For teachers, major concern is with:

      1. Validity  (Does the assessment match what we are teaching in the classroom, and the way we are teaching it?)

      2.  Suitability  (Do the methods used to carry out the assessment fit our purposes?

      3.  Availability  (Will the information I need to know about my children be there when I need it?)

      For policy makers, on the other hand, major concern is with:

      1.  Reliability  (Does the evidence have adequate statistical, scientific backing?)

      2.  Efficiency  (How inexpensively can the assessment be carried out?)

      3.  Aggegability  (Can the information be expressed simply, in a few numbers that can be easily understood and interpreted?)          

      Standardized tests have a place in our schools.  Poor use of the tests, however, does far more harm than good.  Standardized reading tests should never be used as the sole criterion for assessing an educational program, whether for a child, a class, a school, or a district.  They should not be overused, nor over-interpreted.  They should not be misused nor misinterpreted (Farr, 1987). 

      Testing reform movement.  Dissatisfaction with the use of standardized tests as the primary tool for assessing student growth and a growing movement that seeks to identify the uniqueness and individuality of learning has led to a move to bring about reforms in testing and assessment.  Some test publishers have developed instruments that seek to compare individuals' performance as well as highlight their unique development.  Performance-based assessments and Degrees of Reading Power tests, as mentioned earlier in this chapter, represent some of the changes in testing due to the reform movement.  In an attempt to add authenticity to the assessment experience, some standardized tests now use literature selections as the basis of the content.

      There has also been a widespread movement for the use of portfolio assessment in schools (See chapter 7 for a complete discussion).  In their broadest form, student portfolios include samples of several types of assignments, informal tests, and written expressions.  This accumulation of student work provides an additional means to assess individual growth in a wholistic way.

Special concerns for formal reading assessment

      Item construction.  As mentioned above, different publishers assess the same theoretical construct (such as reading comprehension or word recognition) in different ways.  Reading teachers should be especially attuned to these differences, as each approach to assessment may tap into different aspects of the theoretical construct.  For example, in Matthew's case, the classroom teacher and reading specialist were puzzled about the disparate reading comprehension scores this remedial third grade student had received.  On the Gates-MacGinitie Reading Test (MacGinitie & MacGinitie, 1989), he had done fairly well, but on an individually administered test, where he had been asked to read a selection and answer questions about it, he had done poorly.  After observation and instructional assessment, the teachers concluded that he had developed comprehension strategies that allowed him to increase his success despite the constraints of his reading difficulties:  He used key words in printed questions to look back in the text in order to answer the questions.  On the Gates-MacGinitie, he was able to employ these strategies fairly successfully.  On the individualized test, he had not had that opportunity.  Without multiple assessments, these teachers would not have gained such insights into Matthew's comprehension strengths and weaknesses.

      Student behaviors during testing.  Group tests perform best when assessing groups of students.  A small number of students exhibiting inappropriate behaviors during the testing session will not necessarily invalidate the results, since the effect on overall group scores may be insignificant and since some degree of similar inappropriate behaviors probably occurred during administration to the original norming sample.  However, such behaviors could very well completely invalidate the assessment of the students involved.  If a teacher were to examine test results of such a student without knowing about the inappropriate behavior, an incorrect assessment would almost certainly result.                 Examiners should pay close attention to student behaviors during a testing situation and make reports as necessary.  Inappropriate testing behavior is most likely among the very population whose scores might be individually examined.  Examiners should be concerned about issues such as:

      Was there any indication that a student was sharing answers?

      Did a student significantly delay start of the test or take a break in the middle to sharpen a pencil or look around the room?

      Did a student exhibit signs of anxiety, such as lots of erasing, noises of frustration, unease?

      Did a student complete the test too quickly, indicating lack of attention to the questions or answers even at the easy level?

      Did a student exhibit signs of sleepiness?


C.  Demonstrating by Example:  Case Studies


Norm-referenced and criterion-referenced testing:  Sid

      Sid is 8 years, 10 months of age.  He attended kindergarten, pre-first grade, first grade and is currently enrolled in 2nd grade.  While a perusal of his scores suggests that he is developing at an average rate, the fact that he attended pre-first grade suggests that he was not progressing at a normal rate in kindergarten.  Reports from his kindergarten teacher indicated that he was immature, did not socialize with his classmates, did not engage in reading and writing activities such as shared reading and journal writing, and could not identify alphabet letters. His teacher was concerned about his oral language development and considered him to be a high risk pupil. 

      Pre-first grade provided Sid opportunities to engage in non-threatening oral and written language activities.  He progressed nicely and was promoted to first grade.  His first grade teacher stated that he made slow but steady progress in learning sight words and sound/symbol relationships of consonants.  He wrote daily in his journal and used invented spelling. 

      His second grade teacher is concerned about his slow progress in reading, especially his difficulty in recalling sight words and inability to use medial vowels to decode.  She decided to examine the scores from the Stanford Achievement Test (Psychological Corporation, 1989) administered in May of Sid's first grade year, to determine if she could identify areas of strengths and weaknesses.

      Examine the  scores from the Stanford Achievement Test  (see Figure 4-4).  Be sure to note the number of items within each test and Sid's raw scores (RS). 

      Read the information for the first derived score, percentile rank.  Recall that the percentile rank represents how well Sid performed compared to the norming group that was used for this test.  For example, the first item, Total Reading, indicates that Sid is in the 30th percentile in this category, which is a compiled score from the three subtests listed immediately below it.  This can be interpreted in the following way:  Sid did as well or better than 30% of the students of the same age and grade level in the norming group that took this test.  Continue by reading the rest of the percentile information. 

      Recall that stanines consist of nine bands.  Those whose scores fall within stanines 1, 2, and 3 are at the low end of the scale; those whose scores fall within 4, 5, and 6 are average; and those whose scores fall within 7, 8, and 9 are above average.  Sid's stanine score for Total Reading is 4, which places him at the lower end of the average range.  Continue examining Sid's stanine scores for the rest of the subtests on the Stanford Achievement Test.  What information about Sid's reading development can be found from the percentile and stanines?  What would you say about Sid's reading in relation to his classmates?

      Compare Sid's scores on his reading subtests to those on his math, Language, Spelling, Environment and Listening subtests?  What can you say about Sid's performance in these areas?  What area of development seems to be strongest for him, and which is the weakest?


Insert Figure 4-4:  Test Profile for Sid


      Now examine the criterion-referenced test scores that are listed as Content Clusters.  On the Stanford Achievement Test, the items in the norm-referenced subtests are further classified according to skill, allowing the provision of a criterion-referenced analysis.  In this section, Sid's raw scores on specific skill categories are listed and given a rating of Below Average, Average, Above Average. 

      Interpretation.  An examination of Sid's grade equivalent scores in reading suggest that Sid was performing about 3-5 months below grade level at the end of first grade and his stanine scores placed him at the average to low average range.  Sid fell below the 50th percentile on all reading measures. 

      Sid's scores on language and spelling tests fall within these same parameters.  It can be tentatively concluded from these scores that Sid's language functioning is in the low-average range for a pupil in his grade level, with one major exception in the area of listening.  Note his high-average percentile and stanines scores in this area.

      Sid's math scores show stanines and percentile ranks to be above average for national norms.  These math scores are much higher than his language scores.

      Further analysis of specific skill scores on the criterion-referenced report shows that structural analysis appears to be a weak area for Sid.  His understanding of vowels may be weak.  Even though the test ranks him at the "average" level in this skill, he only answered 6 of 12 items correctly.  Punctuation also is identified as a weak area.  Most of Sid's skill ratings in the language area are in the average range, even those in the area of listening.  The criterion-referenced report does not provide enough information about specific skills to differentiate low-average from middle-average, as might be done in reporting stanines.  Perhaps the criterion-referenced measures do not have sufficient reliability for such differentiation in reporting.

      Sid does better in mathematics tasks than he does in reading.  At this grade level he probably doesn't have to read story problems in verbal form.  His lower performance in reading may affect his mathematics performance in later years, if it is not remediated.

      The results of this test battery indicate that Sid is functioning at the low-average to average range for reading at his grade level.  This data might be misleading, given his advanced age and his year in pre-first grade.  Sid is a pupil who should be carefully monitored to be sure that learning is progressing.  Keep in mind that, from the information provided here, we do not know how skills were tested in this battery, nor do we know the publisher's operational definition for reading, math, language, and so forth.  Until we examine the test, we cannot be sure what the publisher means by structural analysis, reading comprehension, and word reading, nor do we know how they were measured.   We have very little information regarding how and what we should teach Sid.  We are only dealing with comparisons to norming groups.  With only this formal testing information available, the teacher should conduct informal assessments to form an instructional strategy.


Norm-referenced and criterion-referenced testing:  Aileen

      Examine the scores of Aileen, a fifth grade pupil, on the Comprehensive Tests of Basic Skills (CTBS) (Macmillan/McGraw-Hill School Publishing Company, 1990) in Figure 4-5.


Insert Figure 4-5:  Test Profile for Aileen


      Aileen's teacher has been concerned with her general achievement in the fifth grade classroom.  It is October of the school year and Aileen does not seem to be working at the same level as her peers.  Her teacher decided to examine her standardized test scores from the CTBS/4 given in May of fourth grade.  The standardized test report also indicated that when compared to the national norm group, her scores in the total battery were below the fiftieth percentile, the national average.  In Total Reading her scores were as good as or better than approximately 15% of the norming group.  Her Total Language scores were as good as or better than 13% of the norming group.  Her Total Math scores were as good as or better than 34% of the norming group. 

      An accompanying printed report, provided by the test scoring service, suggested a need to develop her skills in the following:   Using paragraph context to infer word meaning, spelling consonant sounds in words, analyzing and interpreting passages, and interpreting written forms and techniques.  These suggestions are based on a criterion scoring system derived from an analysis of the test items according to subskills.

      The scores labeled "range" indicate the band of percentile rank score reliability, based on the test's standard error of measurement.  For example, her national percentile on reading vocabulary is 27, indicating that she performed as well or better than 27% of those in the norming population at the fourth grade level. The 27th percentile is her obtained score.  When the test's statistical error factor is taken into account through the standard error of measurement (SEM), her hypothetical true score should lie within a band from the 19th percentile to the 38th percentile.  

      Examine Aileen's and determine what you can conclude about her reading, language and mathematics abilities from these scores.   Compare areas of skill strengths and weaknesses.  Look at the normal curve diagram of derived scores in Figure 4-.  Where do Aileen's scores place her on the NCE scales, the percentile rank scales, and stanines?  How would you describe her performance as a fourth grader, and how would you project she will perform in fifth grade?  What other information would you like to have before planning instruction or remediation?


      Interpretation:  Aileen performed well below the 50th percentile in the reading and language subtests.  Examining the normal curve it appears that she falls from 1 to 1/2 standard deviations from the mean on most reading and language measures.  This suggests that she is well below her peer group on these measures.  Aileen's scores suggest that she is a remedial reader who has serious difficulty handling written language.  Her mathematics scores are much higher than her reading/language scores.  Her high score in Mathematics Computation suggests that she can do basic work with numbers.  But the lower scores in Mathematical Concepts and Applications indicates she might have difficulty when she reads math problems or works with higher level mathematical concepts.

      Her teacher can use these test scores to confirm her observations of Aileen's difficulties in the fifth grade classroom.  However, the scores on this test, as on all general achievement tests, are general rather than specific.  From the printed, verbal skills report from the scoring service, we can tentatively conclude that Aileen has reading and written language difficulties in the areas of higher level reading comprehension, use of context, interpreting written forms and techniques, and spelling consonant sounds.  But we are unsure as to the reliability of these reports, and the vaguely described "interpreting written forms and techniques" statement is unhelpful.

      We do not know if Aileen has a word recognition problem involving grapheme-phoneme correspondence, but we might infer that she does, based on the low spelling score.  We do not have enough information to decide whether Aileen has difficulty understanding what she reads because she can't decode the words, because she doesn't understand the author's message, or because of some other reason.  We know, however, that her reading difficulty did not start in fifth grade.  We also know that her teacher should continue to assess Aileen's reading to determine specific strengths and weaknesses.


D.  Applying Through Practice


Selecting tests.  Read the following descriptions and decide what type of assessments you would use to best meet your purposes, and why you would choose them.


      1.  Mr. Slater.  Mr. Slater is the new reading coordinator in a K-12 school with a total population of 1500.  The school had been without a reading coordinator for several years and the principal noted that the standardized test scores in reading had been dropping a few points each year.  Now, he was concerned about provision for improved reading instruction in both the developmental reading/language arts and the remedial reading programs.  As a first step in this process, the principal asked Mr. Slater to identify those pupils who needed extra help in reading.  Given the discussion of standardized tests in this chapter, as well as your own background knowledge about such tests, what type of testing would you institute to determine which pupils were in need of extra reading instruction.


      2. Ms. Kornell.  Ms. Kornell, a reading coordinator, has been concerned about a group of 20 third graders in her school who scored well below grade level on the reading and language sections of the Iowa Test of Basic Skills (Riverside Publishing Company, 1993), May testing.  Although she has general vocabulary and comprehension information, she doesn't have information about their specific skill strengths and weaknesses. She wants to have a plan in place for these pupils when they return to school in the Fall.  What standardized tests would be appropriate for these purposes?


      3. Mark.  Mark, a third grade student, is not making progress in his classroom reading program.  It is January of the school year and his teacher, Mrs. Ives, wants to know his level of reading.  She feels that he might need special reading instruction with the remedial reading teacher or that he might be learning disabled and need a special education placement.  Mrs. Ives asks the reading teacher to administer a reading test to Mark that will provide her with some of the information she needs to proceed with a referral for remedial reading or special education help.  What standardized tests might be appropriate for Mark?


4. Barry.   Barry is a fifth grade pupil who has been receiving special reading help for the past two years.  His classroom teacher has recently noticed a positive change in his reading, in that word recognition no longer seems to be a problem.  Barry's classroom teacher asked the reading teacher to administer a test to determine Barry's present reading level and areas of skill development that were still needed.  What test(s) would you administer to Barry?


Reporting scores to parents

      Stewart, a third grade student in the school at which you are a reading specialist, took the Metropolitan Achievement Tests (Psychological Corporation, 1992) in May of the school year.  His scores are listed in Figure 4-6.


Insert Figure 4-6

Test Profile for Stewart


Stewart's mother is concerned about his reading and has planned to conference with you this week. She received this printout but doesn't know how to interpret the information.  She is well educated and is insistent on being provided a clear explanation of the various scores.   How will you explain the various scores, and what will you say about Stewart's reading ability based on the test information?  In addition, consider how you will report the scores to Stewart's classroom teacher and to the building principal.


      E.  Reviewing What you've Learned

      Formal assessment involves the use of standardized tests or other tests designed for administration under specified, controlled conditions.  Standardized tests have been administered to a norming population to provide statistical information on the test's reliability and validity, as well as on derived scores:  Percentile ranks, normal curve equivalents grade equivalents, standard scores, and stanines are described in detail.

      Reading/literacy tests are categorized most broadly as reading survey, general achievement, and reading diagnostic.  For students with special needs, some general achievement and diagnostic tests are designed for individual administration.  Some formal tests of reading focus on specific aspects of the process, such as silent or oral reading.

      Criterion-referenced tests have attempted to better match testing to curriculum by providing measurement of student performance on specific skills as compared to a cut-off criterion score that indicates mastery of the skills.  Performance-based assessment is a more recent development that also attempts to better match testing to curriculum by providing tasks that better simulate classroom activities than the traditional multiple choice format.

      While standardized tests provide the user with information about groups' and individuals' performances, they do not provide information about how readers process print or select answers.  Data from standardized tests is mostly quantitative.  Teachers have limited ability to make qualitative inferences about pupils' strengths and weaknesses in strategies or in specific skill areas. 


F.  Further Reading: For your information

Annotated bibliography of major studies

      Farr, R. & Carey, R.(1986).  Reading: What can be measured?  Newark, DE: International Reading Association.

      This monograph has seven chapters related to assessing reading comprehension, word recognition, vocabulary, study skills and rate.  The authors include a chapter on validity and reliability in reading assessment, as well as accountability in assessment. 


      Baumann, J. (1988).  Reading Assessment: An instructional decision-      making perspective.   Columbus, Ohio: Merrill Publishing Co.

      This is an excellent reference for both classroom and clinic.   The author provides interesting chapters on interpreting standardized reading tests, evaluating reading tests, and using test data to make instructional decisions.  He includes a chapter about informal reading inventories and a chapter on making instructional decisions from the data.  There are many examples in this book that help a teacher understand the use of tests for reading assessment.


      Kibby, M. (1981).  The degrees of reading power. Journal of Reading, 24, 416-427.

      The author describes the development of the DRP and its relationship to cloze testing.  A detailed report of the scoring and interpretation, administration, and the reliability and validity of the instrument can be found in this article.


      Baumann, J. & Stevenson, J. (1982).  Understanding standardized       reading test scores.  The Reading Teacher, 35, 648-655.

      This article explains how to interpret scores obtained from standardized reading tests: grade equivalent, stanines, and percentiles.  The authors provide a detailed explanation of grade equivalent scores, how they are developed and the use of interpolation and extrapolation in obtaining such scores.  Percentiles and stanines are also explained in terms of their construction and how they can be interpreted.  The advantages and limitations of each score are also discussed.


      Ysseldyke J. & Marston, D. (1982).  A critical analysis of standardized reading tests. School Psychology Review. 11, 257-266.

      Although this article was written for school psychologists, it is also relevant for the reading teacher or classroom teacher who is responsible for test selection. The test analysis is based on a bottom-up model of reading: units, skills, and knowledge which the authors use as a framework to assess the content of commonly used standardized tests of reading.  Charts are provided with  coefficients of criterion validity, reliability, and ratings of norming procedures.  Standardized tests are

also rated on the usefulness for screening, placement, instructional planning, pupil evaluation and program evaluation.  Even if the reader is not in agreement with the reading model or the uses for such tests, it is important to understand that data from standardized tests are often used in such a straightforward manner.


      Valencia, S. & Pearson, P.D. (1988).  Principles for classroom         comprehension assessment. Remedial and special education, 9, 26-35.


      The authors base this paper on an interactive model of reading.  They state positive and negative aspects of several types of assessment instruments such as standardized tests, criterion-referenced tests, and teacher made tests.  They recommend appropriate uses for different types of tests.  They provide 5 principles of reading comprehension assessment:  1) reading assessment must acknowledge the complexity of the process,  2) reading assessment should focus on the orchestration of many kinds of knowledge and skills, 3)  reading assessment must allow teachers to assess the dynamic quality of the comprehension process, 4) the teacher is the advocate for students in  their progress toward expert reading, 5) teachers must employ a variety of measure for making instructional decisions.


      Farr, R. (1987).  New trends in reading assessment: Better tests, better uses.   Curriculum Review, Pp. 21-23.

      Farr presents an interesting article about the dilemma of testing in our schools.  He raises the issue of tests as accountability instruments and discusses the widespread misinterpretation of test scores.  He suggests the need to develop better tests, to use caution in interpreting scores, to develop additional procedures to collect information, and to assess the process of reading rather than the result.