Search This Blog

Papers on Different Areas of Psychology by j.w.gibson, MS, PhD student. All material on this site is copyright protected. Please feel free to contact author about reprinting.

Why the SAT Is Full of Crap

by j.w.gibson, 2006

Abstract

Standardized tests, such as the SAT and Graduate Record Examination (GRE) have been used extensively by higher education facilities to aid in the selection of prospective students. The usefulness of such aptitude tests has been hotly debated by both the testing industry and the educational research community. Because the SAT is so heavily loaded with general intelligence (g) type questions which may be biased in one way or another. Reliance on the SAT to adequately screen out potential students may well restrict a number of groups who would otherwise be successful in college. Arguments of intelligence and statistical validity of the SAT are discussed.

Group Differences, Standardized Testing, and
Intelligence: The Invalidity of the SAT

The SAT is the nations oldest and most widely used college entrance exam. Thousands of colleges require it for the purposes of admissions selection. Students spend over $225 million for the SAT test and preparatory materials annually (FairTest, 2001a). Testing companies such as Educational Testing Services (ETS) and the College Board, which publish and administer, respectively, the SAT and the Graduate Record Examination (GRE) have built a monopoly out of the notion that these tests, and in general standardized tests, are good measures of a students’ abilities; and furthermore, that they reliably predict future success in college. Considering the importance of receiving a college education on lifetime earning potentials, access to and success in college is a highly sought goal for many Americans.

However, research has called into question the reliability and usefulness of standardized tests, especially the SAT, in affecting the admissions selection of qualified applicants. ETS no longer calls the SAT an Achievement or Aptitude test even though it has been and continues to be seen as one. Instead they refer to it simply as the SAT, unsure as to what exactly it does measure (FairTest, 2001b). All types of aptitude or achievement tests are substantially related to intelligence measures, specifically that of the general factor of intelligence g (Jensen, 2000). Research is fairly conclusive also that IQ measures differ between groups in the American population. If this is indeed the case, then the reputation of the SAT may be called into question.

College Admissions offices seek to select the most able and talented students. If they are lead to believe that the SAT adds significant predictive power when added to High School Grade Point Average (HGPA) and class rank they are likely to require certain minimum level of scores for acceptance standards. It seems apparent then that the admissions boards and some colleges will unnecessarily overlook many well-qualified applicants in groups that the SAT is biased against.

This paper attempts to discredit the utility of standardized tests such as the SAT and GRE on statistical grounds with specific respect to the variability of cultural experiences which lead to differential scores between groups, but which do not fully represent the ability of students to succeed in higher education. Following Sacks (1999) this paper will demonstrate that standardized tests have a dubious ability to predict future academic success because: (1) standardized tests are proxy measures for IQ which has been shown to be biased against non-Caucasian cultural experiences, (2) standardized tests scores are most highly correlated with socioeconomic status, and that (3) standardized tests do not add significantly to the predictive power that colleges seek for enrollment decisions. Statistical arguments demonstrating the validity of the SAT will be dissected. The issue of group differences on IQ and SAT scores will be addressed as it applies to the frequency distribution of SAT scores.

How the IQ Test and SAT Came to Be

To understand the fascination with standardized testing in American society today, it is necessary to present some background relevant to the birth of intelligence testing. Mental measurement began when Charles Spearman adopted Galton’s view of intelligence as the notion of “general ability” as opposed to intelligence in different domains (Sacks, 1999) and set out to measure it. In 1904 Spearman published a paper delineating his discovery of g or the “general factor” of intelligence (Sacks, 1999), which unsurprisingly happened to contain measures of the classics in Education. He was so sure of this general factor that he elevated it to a law, one of the natural scientific discoveries of the century (Sacks, 1999).

Setting out to “prove” his description of g, Spearman setup a series of experiments. Using school children, he measured their abilities in a number of different domains such as English, Math, French, Sensory Discrimination, and Music. He then utilized factor analysis to extract g from the subject areas and showed correlations between English and French to intelligence as high as r=.83 (Sacks, 1999). In further studies Spearman found an almost perfect correlation between the classics of Greek and Latin and general intelligence at r=.99; irrefutable evidence that g existed as a general factor of intelligence. Unfortunately, Spearman’s own cultural and social status biased this understanding of measurement. This bias remains imbedded in the legacy of intelligence testing today.

Alfred Binet and a physician named Simon followed Spearman’s g and developed the first “true” IQ measure, the Binet-Simon Scale (Sacks, 1999). After importation to the U.S., the Binet-Simon scale was changed slightly by Terman of Stanford University. The new Stanford-Binet Intelligence Scale was born, reigning as one of the most commonly used intelligence tests in the United States today(Carter, 2002)(Sacks, 1999). The Stanford-Binet Intelligence Scale is based on the same style of questions as its predecessor. Situations are presented and the respondents are required to choose the best possible answer in a multiple-choice-like format. The Binet-Simon Scale attempted to measure the most difficult situation a person could solve and then calculated that against chronological age (Sacks, 1999). The Stanford-Binet Scale attempts to measure intelligence in a number of domains: Verbal, Analytical Reasoning, etc.

The major problem with this strategy for assessing intelligence is that the situations presented reflect the aristocratic cultural experience of the test makers’. The bias is buried in the types of questions that were asked. Sacks (1999) provides an excellent sample question from the original Simon-Binet scale:
“When the house is on fire, what must one do?” The three answers they provided are as follows. Set 1: Call the fireman—Telephone. Set 2: Save oneself.—Run into the street—One must run so as not to be burned. Set 3: One must get away.—One must put out the fire” (p.23).

Coming up with the right answer inherently requires a healthy amount of the appropriate social and cultural knowledge of the test writers. How could a poor nomad farmer respond properly to this question? The correct answer, according to Simon-Binet, is to call the fire department (Sacks, 1999). However, that demands knowledge of a firehouse, telephone, and other information that is absolutely bound by socio-cultural experience.

Terman did little to affectively take the cultural bias out of the “new” intelligence scale. Indeed, if anything he passed it on to Bringhman, an Army Psychologist. After World War I, the Armed forces saw an explosion in the need for tests that could adequately sort potential recruits, however, the paranoia of “others” persisted and these tests were notoriously racist and anti-immigrant (FairTest, 2001). Following his experience with the Army’s testing program, Bringhman returned to Princeton University’s Admissions Office where he created the first Scholastic Aptitude Test from the Stanford-Binet Intelligence Scales (Sacks, 1999). As long ago as 1991, the number of schools requiring the SAT was impressive. As of 1991, “of the more than 3500 four-year universities in the United States, at least 1600 now use the test [SAT] as part of their admissions process” (Crouse & Trusheim, 1991).

Standardized testing has incited debate since the early part of the last century. Even though researchers from a broad perspective have argued against its validity, usefulness, and denigrating effect on the growth of educational theory, ETS and others persist in their marketing of the SAT as a necessary tool to University Admissions Programs. In a bold address, Richard C. Atkinson, the President of the University of California system, called for several distinct changes in admissions processes including the abandonment of mandatory SAT score analysis and formulaic processing of student information (Atkinson, 2001). Perhaps challenges to the utility of “aptitude” testing can soon be laid to rest.

Standardized Tests As Proxy Measures of Intelligence

Acceptable definitions of general intelligence are hard to come by. The Encarta World English Dictionary (2004 Microsoft Office) defines intelligence as “the ability to learn facts and skills and apply them, especially when this ability is highly developed.” If this is indeed acceptable then there is a serious disconnect between what the SAT professes to measure and what it actually measures. The SAT began as an aptitude test, meaning that it tries to assess the likelihood of students being able to acquire new sets of knowledge and skills. “Acquire new sets of skills” and “ability too learn facts and skills” are essentially the one and the same. The SAT was developed from the Stanford-Binet Intelligence Scale and so naturally is highly correlated with the general measures of IQ (Jensen, 2000) as opposed to what students have achieved.

Researchers have early on identified the differences in IQ scores between races. In the classic rank, Asian Americans are first, followed by Caucasian Americans, then Hispanic and African Americans (Onwuegbuzie & Daley, 2001). Additionally, the difference between Asian and Caucasian Americans is a mere 3 points while 15 points separates African Americans (Onwuegbuzie & Daley, 2001). What can account for this well documented phenomenon? Even Alfred Binet cautioned about the “brutal pessimism” that was possible if IQ tests were misread as immutable measures of potential (Sacks, 1999).

But what exactly is this mysterious g that Americans are so obsessed about, and indeed colleges feel inclined to measure before most students can be accepted? The g factor has been said to permeate all other intellectual abilities and is thought to explain the differences between individuals in performance on any range of mental tests (Gottfredson, 1998). The g factor may well represent some biological construct that helps humans to reason and adapt to complexity.

One definition that comes from an evolutionary perspective is that g is another evolved psychological mechanical designed specifically to solve evolutionarily novel problems(Kanazawa, 2004). Evolutionary Psychologists (EPs) generally agree that the brain contains numerous domain specific (function specific) modules or “decision rules” including language acquisition, mate selection, cheater detection, and face recognition to name a few (Cosmides & Tooby, n.d.; Barkow et al., 1992; Buss, 2003). If Kanazawa (2004) is right, as his evidence suggests then the purpose of a general problem solving mechanism would be to spontaneously reason out solutions to new situations. Kanazawa (2004) then goes on to discuss that if this is the case, we might expect to see a difference in g as measured on IQ tests around different parts of the world. Specifically, it is predicted that IQ will tend to rise according to the distance from the ancestral homeland of the human species, or IQ will correlate with distance from Africa. The reasoning behind this is that the closer one is to the ancestral homeland, the less likely novel problems are because of the stability of the environment over time. As one increases their distance and variability between ancestral environments and new environments the number and type of evolutionarily novel problems are likely to be substantially increased; eventually selecting for a noticeable change in the g factor.

In one analysis of the geographic distribution of “IQ scores across the world, it was found that the [mean population] IQ’s in sub-Saharan Africa are significantly lower than those in the rest of the world (68.8 vs. 89.1), t (183)=15.88, p<. 001 (Kanazawa, 2004, p.521). The sub-Saharan environment is far more familiar to the human brain than the rest of the world so it makes sense that IQ would evolve more quickly in environments with a greater availability of novel situations. Some might react against the racial argument that is almost implied, but in fact, race probably has less or noting to do with IQ. Kanazawa, (2004) also noted that Africans in the Caribbean and South Pacific have significantly higher IQ’s than those in Sub-Saharan Africa (68.8 vs. 80.5), t (68)=10.12, p<.001. Consider what this means in terms of scores on standardized tests in America. What is it that colleges are actually measuring?

The Problem of Group Differences in Standardized Testing

Psychometricians have questioned the statistical validity of the SAT to predict students’ success in college for at least 50 years (Crouse & Trusheim, 1991). The major complaint has centered on the different median and standard deviations of scores for different groups. Most analysis has tended to emphasize the differences between racial groups, most significantly, the differences between black and white Americans. However, new evidence suggests that the most outstanding differences in mean scores between groups’ lies between socioeconomic classes (Sacks, 1999).
The number of questions answered on the SAT does not correspond directly with the final score,. making the measurement scale ordinal. In other words one cannot find a direct “translation or equal interval” between two different scores. For instance the difference between 400 and 440 or 700 and 740 do not represent an equidistant change in ability (Halpern, 2000). This is further complicated by the fact that the SAT is supposed to measure what is learned; yet it clearly is more correlated with IQ, than achievement prompting an earlier name change to aptitude.

What then are the greatest objections to standardized testing? First, the predictive power, which is the validity of the SAT, to help colleges sort students is exceedingly low. Secondly, it has been demonstrated time and time again “that identifiable groups in our society have different distributions of test scores, all of them more or less normally distributed but with different means and different standard deviations, and with means of some groups differing by more than one standard deviation” (Jensen, 2000, p. 121).

Colleges use a combination of test scores and Grade Point Average to determine whether to admit or deny students. Testing companies such as the ETS claim that test scores (SAT) are useful to colleges because they help to more accurately predict a student’s opportunity for success in college. However, the SAT I has been shown again and again to be predictive of two things only; first year grades (Carter, 2002) and how well one will do on other standardized tests (Sacks, 1999). Analysis of correlation coefficients, the degree of variation between variables (Howell, 2004), for the SAT is .2 to .5 at its highest (Carter, 2002), although this is misleading because the true variability is the r-squared measure (.04 to .25). In essence, the SAT accounts for somewhere around 14% of the variability in freshman grades. In other words it cannot account for 86% of the difference in freshman grades. Using class rank and SAT scores leads to 1 to 3% fewer errors in prediction then class rank alone (Carter, 2002), making the actual utility of the SAT disappear almost completely.

When colleges use cut-off scores combined with Grade Point Average (GPA) from high school grades they are likely to miss a large number of capable students from different groups when they select only the highest score combinations. For example, the University of California admits only those applicants whose combined GPA and SAT scores are in the top 12.5% of high school graduates (Jensen, 2000). Group distribution patterns fall in somewhat different but overlapping regions of the frequency distribution. When this happens cut-off scores are likely to adversely affect any groups misrepresented on the lower tail of the distribution, with lower mean scores. In effect, this disproportionate “selection” of the number of high group members will bias the test in favor of one group over another.
For instance, imagine that two groups share a reasonably similar frequency distribution and standard deviation (SD), but the group means differ by only one SD. If the cut-off score were placed at the mean of the higher scoring group, nearly 50% of that group would surpass the cut-off score while only 16% of the other group would surpass (Jensen, 2000). This scenario becomes more severe when the cut off is placed at one SD above the mean for the higher scoring group, resulting in only 3% of the lower group making the cut off.

Studies on the GRE have shown even poorer results to predict future success if measured by the number of articles published by students who graduated from PhD programs. “In chemistry, the correlation of number of articles and book chapters with GRE-Verbal was -.02; with GRE-Quantitative it was -.01; and with GRE-Advanced it was .15…For all historians, these correlations were -.24, -.14, and .00. For all psychologists, the correlations were -.05, -.02, and .02” (Carter, 2002, p. 2). One explanation for this may be the type of thinking required on standardized tests, which tend to reward route memorization and other types of superficial thinking. Graduate programs, on the other hand, tend to emphasize highly creative and analytical skills, which are rarely covered on standardized tests.

If colleges are truly interested in recruiting and training the best students then these correlation coefficients show a substantially low power. Dropping the SAT I from college admissions requirements that use the Achievement based SAT II tests results in an explained variance from 21.0 to 21.1 percent, a “trivial increment” (Atkinson, 2001).

What Standardized Tests Really Measure

ETS states that the SAT I, which does not stand for anything, is not an aptitude or achievement test, but reliably measures first year college grades only (ETS, 2003, #20395). There is considerable concern about what standardized tests actually measure. Do they measure what has been learned or the ability to learn? General intelligence (g) or IQ has been highly correlated with standardized test ability. Jensen (2000) has explicitly stated that the g factor is the main “active ingredient” in a mental tests’ practical predictive validity. This includes especially aptitude tests, which have been constructed in the cultural language of educational experience and not necessarily on aptitude (Sacks, 1999). It is disconcerting that the largest correlation of scores on the SAT is family income followed by parental education level, specifically that of the father (Sacks, 1999). Over the last forty years, SAT scores have been positively correlated with family income (Carter, 2002). For instance, the average SAT score associated with a family income between 30-40 thousand dollars a year is 885 compared to the average score of 1000 for students from families earning over 70 thousand dollars a year (Carter, 2002). Even a former director of minority affairs of ETS has stated that the SAT is a much more reliable predictor of students in the top 90th percentile than those in the 10th percentile (r= .48 vs. r= .17, respectively) (Bridgeman et al., 2004).

The difference in relative correlation coefficients for income levels presents a difficult problem for both college boards and researchers when trying to deal with designing or utilizing standardized tests. If there is a substantial difference in the ability of the SAT to measure poor students potential, regardless of their actual potential, then adjustments must be made for the interpretation of SAT scores. However, this is not necessarily done in practice.

Statistical Problems with Support For SAT Validity

Despite almost 100 years of controversy and academic debate, there is no clear answer to the question of the SAT validity question. Researchers have attempted to prove that there is no reliable evidence of internal bias against different groups on standardized tests. However, the problem with this conclusion may be based in the inadequacy of certain statistical methods for isolating significant relationships.

For instance, after a review of numerous studies for internal bias, a respected researcher concluded “there is no evidence of internal bias in standardized tests” (Camilli & Shepard, 1987). Most of those studies were conducted using Analysis of Variance (ANOVA). However, what the famous researcher (see Jensen, 1980) failed to account for was the inability of the ANOVA method for detecting test bias. ANOVA provides an omnibus test of item p-value differences (Camilli & Shepard, 1987). In other words it illustrates differences in population means, however, it cannot characterize the differences in meaning of the means (Levine & Stephan, 2005). In one simulated example, the bias of one group mean was found to be nearly double from 1.16 to 2.11 standard deviations, causing the difference to be interpreted as a “large” and “real” difference because the group variance is more than five times as large as the interaction variable (Camilli & Shepard, 1987).

The largest difference between groups has historically existed between black and white Americans. On standardized tests that consist of cultural biased questions, differences between group means almost completely disappear when those questions are removed. One study conducted by researchers found that when seven [biased] items were removed from a 32-item math test, the mean difference between black and whites was reduced from .91 to .81 (Camilli & Shepard, 1987). In other words, the simulated examples tested showed that built in bias adds more to the “between-group variance than to the interaction variance” (Camilli & Shepard, 1987, p.97).

The GRE is the big brother to the SAT; however, lower validity scores seem to plague the GRE. Kuncel, Hezlett, and Ones (Kuncel et al., 2001) published a meta-analysis of the GRE’s performance for predicting graduate school success. They concluded that utilizing a combination of subtests, that the predictive validity coefficients actually range from .41 to .56 depending on the combination of subtests (i.e. verbal, quantitative, analytical). Analyzing over 1700 independent studies, they conclude that the GRE is generally a valid predictor. Subgroups in the article are not discussed. Despite the tremendous body of work acknowledging the real difference in scores between groups, Kuncel states “Both IRT and regression analyses of prediction for different racial and ethnic groups does not demonstrate any predictive differences across groups. A score on the test results in equal prediction across racial and ethnic groups” (Kuncel, 2005, #23486). Sadly, the greatest weakness of the meta-analysis is its failure to treat bias inherent in the original studies. If one is looking at studies that have primarily used ANOVA methods, then underlying biases are likely to be ignored.

Conclusions

Access to higher education has a direct effect on the lives and possible earning potential of millions of individuals in the United States. Requirements of Admissions Offices are necessary to ensure that candidates are capable of taking advantage of their educational resources. Education is an investment. Colleges have every right to protect the resources that they offer. However, use of biased means of sorting students is not only unethical, it destroys the ability of colleges to admit potentially successful candidates on the basis of biased test scores.

Because all aptitude tests tie into the general intelligence factor g, and because g is not adequately understood, and further, because there are real group differences in measured g, the use of the SAT as an aptitude test should be suspended. The relatively insignificant amount of predictive power that ETS argues the SAT has is not worth the unreliability. College entrance should be granted by actual achievement, something that the SAT I is incapable of measuring.
Statistical arguments supporting the use of the SAT are flawed because they have not adequately dealt with the real differences in group means and standard deviations. These differences create an unbalanced number of otherwise acceptable applicants below the cut-off values that many colleges are mandated to set. Use of ANOVA and other statistical methods may generate faulty relationships because of the inability to adequately deal with the bias in testing. Evolutionary perspectives might generate new avenues of research into the functional development of g and explanations for the measured differences between different groups.

Advanced statistical methods are likely for the complex relationship that seems to exist between general intelligence and other aspects of information processing.
If colleges truly seek to admit a wide range of capable students then they need to analyze very carefully the procedures by which they make selection decisions. In addition to racial group differences on the SAT, there are also class differences. It may be that new assessment methods with higher correlation coefficients may be developed. For instance, use of biographical data (biodata) and a situational judgment inventory (SJI) has shown a significant improvement in student selection regarding female and minority applicants who tend to perform less well than their male counterparts (Oswald et al., 2004).


References
Atkinson, R. C. (2001). A California perspective on the SAT. Rethinking the SAT: The Future of Standardized Testing in University Admissions. Santa Barbara
Barkow, J. H., Cosmides, L., & Tooby, J. (1992). The adapted mind: Evolutionary psychology and the generation of culture.
Bridgeman, B., Pollack, J., & Burton, N. (2004). An intuitive approach to predicting college success. American Education Research Association. San Diego
Buss, D. M. (2003). Sex, marriage, and religion: What adaptive problems do religious phenomena solve? Psychological Inquiry, 13, 201-38.
Camilli, G., & Shepard, L. A. (1987). The inadequacy of ANOVA for detecting test bias. Journal of Educational Statistics, 12, 87-99.
Carter, C. (2002). The case against standardized tests. from http://www.homestead.com/testcritic/files/Standardized_Tests.html
Cosmides, L., & Tooby, J. (n.d.). Evolutionary psychology: A primer. from http://www.psych.ucsb.edu/research/cep/primer.html
Crouse, J., & Trusheim, D. (1991). How colleges can correctly determine selection benefits from the SAT. Harvard Educational Review, 61, 125-47.
FairTest. (2001a). SAT I: A faulty instrument for predicting college success. Retrieved June 19, 2005, from http://www.fairtest.org
FairTest. (2001b). The SAT: Questions and answers. Retrieved June 19, 2005, from http://www.fairtest.org
Gottfredson, L. S. (1998). The general intelligence factor. Retrieved June 20, 2005, from http://www.psych.utoronto.ca/~reingold/courses/intelligence/cache/1198gottfred.html
Howell, D. C. (2004). Fundamental statistics for the behavioral sciences. Belmont, CA: Brooks/Cole--Thomson Learning.
Jensen, A. R. (2000). Testing: The dilemma of group differences. Psychology, Public Policy, and Law, 6, 121-27.
Kanazawa, S. (2004). General intelligence as a domain-specific adaptation. Psychological Review, 111, 512-23.
Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). A comprehensive meta-analysis of the predictive validity of the graduate record examinations: Implications for graduate student selection and performance. Psychological Review, 127, 162-81.
Levine, D. M., & Stephan, D. F. (2005). Even you can learn statistics. Upper Saddle River: Pearson Education, Inc.
Onwuegbuzie, A. J., & Daley, C. E. (2001). Racial differences in IQ revisited: A synthesis of nearly a century of research. Journal of Black Psychology, 27, 209-20.
Oswald, F. L. et al. (2004). Developing a biodata measure and situational judgment inventory as predictors of college student performance. Journal of Applied Psychology, 89, 187-207.
Sacks, P. (1999). Standardized minds: The high price of America's testing culture and what we can do to change it. Cambridge: Perseus Books.