Generally speaking, there are many ways in which test items, as well as tests, can be biased or unfair to individual test takers or groups of test takers. As far as tests are concerned, the possibility of bias can be investigated by ascertaining whether test scores have the same meaning for members of different subgroups of the population (see Chapter 5). The question of test fairness, on the other hand, is a more complex and controversial issue. Whereas there is general agreement that unfair uses of tests must be avoided, exactly what constitutes fairness in testing is a matter of considerable debate (AERA, APA, NCME, 1999, pp. 74-76). Nevertheless, test users have a major responsibility in implementing fair testing practices through a thoughtful consideration of the appropriateness of instruments for their intended purposes and for potential test takers (Chapter 7).
At the level of test items, questions concerning bias and unfairness are more circumscribed and are usually taken up while a test is under development. To this end, test items are analyzed qualitatively and quantitatively throughout the process of test construction. Naturally, the extent to which test items undergo these reviews is related to the intended purpose of a test. Special care is taken to eliminate any possible bias or unfairness in the items of ability tests that are to be used in making decisions that have significant consequences for test takers.
The qualitative evaluation of test items from the point of view of fairness is based on judgmental procedures conducted by panels of demographically heterogeneous individuals who are qualified by virtue of their sensitivity to such matters and, preferably, by their expertise in the areas covered by a test as well. Typically these reviews occur at two stages. During the initial phase of test construction, when items are written or generated, they are examined in order to (a) screen out any stereotypical depictions of any identifiable subgroup of the population, (b) eliminate items whose content may be offensive to members of minority groups, and (c) ensure that diverse subgroups are appropriately represented in the materials contained in an item pool. In this initial review, individuals who are familiar with the linguistic and cultural habits of the specific subgroups likely to be encountered among potential test takers should also identify item content that may work to the benefit or detriment of any specific group, so that it may be revised. The second stage of qualitative item review occurs later in the process of test construction, after the items have been administered and item performance data have been analyzed separately for different subgroups. At this stage, items that show subgroup differences in indexes of difficulty, discrimination, or both are examined to identify the reasons for such differences and are either revised or discarded as warranted.
The quantitative assessment of item bias has sometimes been linked simply to the differences in the relative difficulty of test items for individuals from diverse demographic groups. However, this interpretation of item bias is viewed as naive by testing professionals who do not consider differences in the relative difficulty of an item for different groups to be sufficient evidence that the item is biased (see, e.g., Drasgow, 1987). Instead, from a psychometric standpoint, an item is considered to be biased only if individuals from different groups who have the same standing on a trait differ in the probability of responding to the item in a specified manner. In tests of ability, for instance, bias may be inferred when persons who possess equal levels of ability, but belong to different demographic groups, have different probabilities of success on an item. Thus, in the testing literature, item bias is more properly described as differential item functioning (DIF), a label that more pointedly denotes instances in which the relationship between item performance and the construct assessed by a test differs across two or more groups.
Classical procedures for the quantitative analysis of DIF involve comparisons of the item difficulty and item discrimination statistics for different groups. For example, if a test item has a low correlation with total test score (i.e., poor discrimination) and is more difficult for females than for males, it would obviously be suspect and should be discarded. However, the analysis of DIF by means of simple comparisons of the item-test correlations and p values for different groups is complicated by the fact that groups of various kinds (e.g., sex groups, ethnic groups, socioeconomic groups, etc.) often differ in terms of their average performance and variability, especially on ability tests. When group differences of this kind are found in the distributions of test scores, (a) item difficulty statistics become confounded by valid differences between groups in the ability that a test measures and (b) correlational indexes of item discrimination are affected by the differences in variability within the groups being compared. Because of these complicating factors, traditional item analysis statistics have not proved very helpful in detecting differential item functioning.
The proper assessment and study of DIF requires specialized methods and a number of them have been proposed. One of the most commonly used is the Mantel-Haenszel technique (Holland & Thayer, 1988) which expands on traditional item analytic procedures. In this type of analysis each of the groups in question (e.g., majority and minority groups) is divided into subgroups based on total test score, and item performance is assessed across comparable subgroups. Although this method is more refined than the simple comparison of item analysis statistics across groups, the Mantel-Haenszel procedure still relies on an internal criterion (total score) that may be insensitive to differences in item functioning across groups, and its ability to detect DIF is substantially dependent on the use of very large groups (Mazor, Clauser, & Hambleton, 1992).
Item response theory provides a much better foundation for investigating DIF than classical test theory methods. In order to establish whether individuals from different groups with equal levels of a latent trait perform differently on an item, it is necessary to locate persons from two or more groups on a common scale of ability. The IRT procedures for accomplishing this goal start by identifying a set of anchor items that show no DIF across the groups of interest. Once this is done, additional items can be evaluated for DIF by comparing the estimates of item parameters and the ICCs obtained separately for each group. If the parameters and ICCs derived from two groups for a given item are substantially the same, it may be safely inferred that the item functions equally well for both groups. Not surprisingly, IRT procedures are becoming the methods of choice for detecting DIF (see Embretson & Reise, 2000, chap. 10, for additional details).
As noted in Chapter 3, the use of IRT methods in test development and item calibration does not preclude the normative or criterion-referenced interpretation of test scores. In fact, because of its more refined methods for calibrating test items and for assessing measurement error, IRT can enhance the interpretation of test scores. Although IRT cannot provide solutions to all psychological measurement problems, it has already helped to bring about a more disciplined and objective approach to test development in the areas in which it has been applied.
At present, IRT methods are being applied most extensively in developing computerized adaptive tests used in large-scale testing programs, such as the SAT and the ASVAB. Development of tests of this type requires input from individuals with considerable technical expertise in mathematics and computer programming in addition to knowledge of the content area covered by the tests. More limited applications of IRT methods have been in use for some time. For instance, the assessment of item difficulty parameters through IRT methods has become fairly common in the development of ability and achievement batteries, such as the Differential Ability Scales, the Wechsler scales, the Wide Range Achievement tests, and the Woodcock tests. Item-response-theory models are also being used increasingly in the assessment of DIF in cognitive tests. Although IRT methods hold promise in the field of personality testing, their application in this area has been much more limited than in ability testing (Embretson & Reise, 2000, chap. 12).
Was this article helpful?