Chapter 3 Testing Applications

If my future were determined just by my performance on a standardized test, I wouldn’t be here.
— Michelle Obama

Standardized tests are often used to inform high-stakes decisions, including decisions that limit access to educational programs, career paths, and other valuable opportunities. Ideally, test scores supplement other relevant information in these decisions, from sources such as interviews, recommendations, observations, and portfolios of work. In many situations, standardized test scores are considered essential because they provide a common objective measure of knowledge, achievement, and skills.

Unfortunately, standardized test scores can be misused when the information they provide is inaccurate or when they have an undue influence on these decisions. A number of studies and reports refer to bias in standardized tests and an over-reliance on scores in high-stakes decision making (e.g., Sternberg and Williams 1997; Santelices and Wilson 2010). As noted in the quote above, some people do not perform well on standardized tests, despite having what it takes to succeed.

This chapter gives an overview of the various different types of tests, including standardized and unstandardized ones, that are used to support decision-making in education and psychology. Chapter 2 referred to a test’s purpose as the starting point for determining its quality or effectiveness. In this chapter we’ll compare types of tests in terms of their purposes. We’ll examine how these purposes are associated with certain features in a test, and we’ll look again at how the quality or validity of a test score can impact the effectiveness of score interpretations.

Learning objectives

  1. Provide examples of how testing supports low-stakes and high-stakes decision-making in education and psychology.
  2. Describe the general purpose of aptitude testing and some common applications.
  3. Identify the distinctive features of aptitude tests and the main benefits and limitations in using aptitude tests to inform decision-making.
  4. Describe the general purpose of standardized achievement testing and some common applications.
  5. Identify the distinctive features of standardized achievement tests and the main benefits and limitations in using standardized achievement tests to inform decision-making.
  6. Compare and contrast different types of tests and test uses and identify examples of each, including summative, formative, mastery, and performance.
  7. Summarize how technology can be used to improve the testing process.

3.1 Tests and decision-making

As mentioned in Chapter 2, tests are designed for different purposes, for example, to inform decisions impacting accountability, admissions, employment, graduation, licensing, and placement (see Table 2.1). Test results can also impact decisions regarding treatment in mental health and counseling settings, interventions in special education, and policy and legal issues.

3.1.1 Educational decisions

Educational tests support decision-making in both low-stakes and high-stakes situations. These terms refer to the consequences and impact of test results for those involved, where low-stakes testing has low impact and high-stakes can have large or lasting impact. Low-stakes tests address decision-making for instructional planning and student placement. The myIGDI testing program described in Chapter 2 involves lower-stakes decisions regarding the selection of instructional interventions to support student learning. PISA could be considered a low-stakes test, at least for the student, as scores do not impact decisions made at the student level. Teacher-made classroom assessments and other tools for measuring student growth are also considered lower-stakes tests, to the extent that they are not used in isolation to determine future learning opportunities.

Let’s look at one of the oldest and best known standardized tests as an example of high-stakes decision making. Development of the first college admissions test began in the late 1800s when a group of universities in the US came together to form the College Entrance Examination Board (now called the College Board). In 1901, this group administered the original version of what would later be called the SAT. This original version consisted only of essay questions in select subject areas such as Latin, history, and physics. Some of these questions resemble ones we might find in standardized tests today. For example, from the physics section:

A steamer is moving eastward at the rate of 240 meters per minute. A man runs northward across her deck at the rate of 180 meters per minute. Show by a drawing his actual path and compute his actual velocity in centimeters per second.

The original test was intended only for limited use within the College Board. However, in 1926, the SAT was redesigned to appeal to institutions across the US. The 1926 version included nine content areas: analogies, antonyms, arithmetic, artificial classification, language, logical inference, number series, reading, and word definitions. It was based almost entirely on multiple-choice questions. For additional details, see sat.collegeboard.org.

The College Board notes that the SAT was initially intended to be a universal measure of preparation for college. It was the first test to be utilized across multiple institutions, and it provided the only common metric with which to evaluate applicants. In this way, the authors assert it helped to level the playing field for applicants of diverse socio-economic backgrounds, to “democratize access to higher education for all students” (College Board 2012, 3). For example, those who may have otherwise received preferential treatment because of connections with alumni could be compared directly to applicants without legacy connections.

Since it was formally standardized in 1926, the SAT has become the most widely used college admissions exam, with over 1.5 million administrations annually (as of 2015). The test itself has changed substantially over the years, though its stated purpose remains the same (College Board 2012):

Today the SAT serves as both a measure of students’ college readiness and as a valid and reliable predictor of college outcomes. Developed with input from teachers and educators, the SAT covers core content areas presented as part of a rigorous high school curriculum and deemed critical for success in college: critical reading, mathematics, and writing. The SAT measures knowledge and skills that are considered important by both high school teachers and college faculty.

As we’ll see in Chapter 9, test developers such as the College Board are responsible for backing up claims like these with validity evidence. Colleges must also evaluate whether or not these claims are met, and, if not, whether admission decisions can be made without a standardized test. Colleges are responsible for choosing how much weight test scores have in admissions decisions, and whether or not minimum cutoff scores are used.

Criticism of the SAT, primarily regarding potential bias in item content and scoring (e.g., Santelices and Wilson 2010) and impact on equity in admissions (e.g., Geiser 2017), has led a number of colleges to drop it as an admissions requirement. These colleges base admissions decisions on other information, such as in-person interviews with applicants (Miller and Stassun 2014).

A 2009 survey of 246 colleges in the US found that 73% used the SAT in admissions decisions (Briggs 2009). Of those colleges using the SAT, 78% reported using scores holistically, that is, as supporting information within a portfolio of work contained in an application. On the other hand, 31% reported using SAT scores for quantitative rankings of applicants, and 21% reported further to have defined cut-off scores below which applicants would be disqualified for admission.

Another controversial high-stakes use of educational testing involves accountability decisions within the US education system. For reviews of these issues in the context of the No Child Left Behind Act of 2001 (NCLB; Public Law 107-110), see Hursh (2005) and Linn, Baker, and Betebenner (2002). Abedi (2004) discusses validity implications of NCLB for English language learners.

3.1.2 Psychological decisions

Decision-making in psychological testing can also be associated with low and high stakes. As in educational testing, the level of stakes depend on the potential for impact, whether positive or negative. We should also clarify that impacts may occur at different levels of reporting. A test may have minimal impact at the individual level, where decisions are not being made based on test results, but the same test may have serious implications when results are aggregated across individuals. For example, consider a test used to gauge changes in symptoms related to post-traumatic stress over the course of a treatment program. If the test is used not to evaluate and change treatment for individual participants, but to evaluate the effectiveness of the treatment overall, the stakes may be high for the developers of the treatment, where funding and publicity may depend on a positive result.

For another example of decision-making in the context of psychology, we’ll look at one of the most widely used standardized personality tests. In the 1930s and 1940s, two researchers at the University of Minnesota pioneered an empirical and atheoretical method for developing personality and pathology scales. This method involved administering hundreds of short items to people with known diagnoses. Items measuring specific personality traits and pathologies were identified based on consistent patterns of response for individuals with those traits and pathologies. For example, individuals diagnosed with depression responded in similar ways across certain items. Regardless of the content of the items (hence the atheoretical nature of the method), if individuals responded in consistent ways, the items were assumed to measure depression. Table 3.1 contains names and descriptions for the original clinical scales of the Minnesota Multiphasic Personality Inventory (MMPI; for details, see wikipedia.org).

Table 3.1: Original Clinical Scales of the MMPI
Scale What is measured Items
Hypochondriasis Concern with bodily symptoms 32
Depression Depressive symptoms 57
Hysteria Awareness of problems and vulnerabilities 60
Psychopathic Deviate Conflict, struggle, anger, respect for rules 50
Masculinity/Femininity Stereotypical interests and behaviors 56
Paranoia Level of trust, suspiciousness, sensitivity 40
Psychasthenia Worry, anxiety, tension, doubts, obsessiveness 48
Schizophrenia Odd thinking and social alienation 78
Hypomania Level of excitability 46
Social Introversion People orientation 69

The MMPI and other personality and psychopathology measures are used to support decisions in a variety of clinical settings, including diagnosis and treatment and therapy planning and evaluation. They are also used in personnel selection, where certain personality traits have been shown to correlate strongly with certain aspects of job performance. However, because of its emphasis on pathology and implications regarding mental illness, the MMPI can only be used in the US in employment decisions for high-risk or security-related positions such as police officer and fire fighter. Measures such as the MMPI can also be used in forensics, criminal investigations, and in court (Pope, Butcher, and Seelen 2006). Given their impact on mental health outcomes, career choices, and legal proceedings, these could all be considered high-stakes decisions.

3.1.3 Stakes and test quality

The stakes associated with a test will provide a useful frame of reference as we evaluate the quality of a test for a given purpose. In lower-stakes situations, we are more lenient when a test is shorter, less reliable, or produces more measurement error. We are more understanding when norming results are based on smaller sample sizes if we know that a test will have minimal impact on test takers or other stakeholders. In contrast, in higher-stakes situations, our standards are more demanding. We expect stronger reliabilities, larger norming samples, and more robust psychometric evidence in support of results that could have a lasting impact. To address these distinctions, guidelines for interpreting test results in later chapters will sometimes differ depending on the stakes associated with a test.

The stakes of a test also have implications for the test development process and for the validity of results. Higher stakes can be expected to produce more interest in test scores among test takers and other stakeholders, such as teachers, counselors, or administrators. When test results can influence career trajectories or the standing of a program, there may be an incentive to compromise results through socially desirable responding or cheating. These issues are also discussed in future chapters.

Learning check: Consider the decisions in your education or career that have been informed by test results. Did these decisions involve low-stakes or high-stakes? How did the stakes involved influence your participation in the test?

3.2 Test types and features

Over the past hundred years numerous terms have been introduced to distinguish different tests from one another based on certain features of the tests themselves and what they are intended to measure. Educational tests have been described as measuring constructs that are the focus of instruction and learning, whereas psychological tests measure constructs that are not the focus of teaching and learning. Thus, educational and psychological tests differ in the constructs they measure. Related distinctions are made between cognitive and affective tests, and achievement and aptitude tests. Other distinctions include summative versus formative, mastery versus growth, and knowledge versus performance.

3.2.1 Cognitive and affective

Cognitive testing refers to testing where the construct of interest is a cognitive ability. Cognition includes mental processes related to knowledge, comprehension, language acquisition and production, memory, reasoning, problem solving, and decision-making. Intelligence tests, achievement tests, and aptitude tests are all considered cognitive tests because they assess constructs involving cognitive abilities and processing. Other examples of cognitive tests include educational admissions tests and licensure and certification tests.

In affective testing, the construct of interest relates to psychological attributes not involving mental processing. Affective constructs include personality traits, psychopathologies, interests, attitudes, and perceptions, as discussed below. Note that cognitive measures are often used in both educational and psychological settings. However, affective measures are more common in psychological settings.

3.2.2 Achievement and aptitude

Achievement and aptitude describe two related forms of cognitive tests. Both types of tests measure similar cognitive abilities and processes, but typically for slightly different purposes. Achievement tests are intended to describe learning and growth, for example, in order to identify how much content students have mastered in a unit of study. Accountability tests required by NCLB are achievement tests built on the educational curricula for states in the US. State curricula are divided into what are called learning standards or curriculum standards. These standards operationalize the curriculum in terms of what proficient students should know or be able to do.

In contrast to achievement tests, aptitude tests are typically intended to measure cognitive abilities that are predictive of future performance. This future performance could be measured in terms of the same or a similar cognitive ability, or in terms of performance on other constructs and in other situations. For example, intelligence or IQ tests are used to identify individuals with developmental and learning disabilities and to predict job performance (e.g., Carter 2002). The Stanford Binet Intelligence Scales, originally developed in the early 1900s, were the first standardized aptitude tests. Others include the Wechsler Scales and the Woodcock-Johnson Psycho-Educational Battery.

Intelligence tests and related measures of cognitive functioning have traditionally been used in the US to identify students in need of special education services. However, an over-reliance on test scores in these screening and placement decisions has led to criticism of the practice. A federal report (US Department of Education 2002, 25) concluded that,

Eliminating IQ tests from the identification process would help shift the emphasis in special education away from the current focus, which is on determining whether students are eligible for services, towards providing students the interventions they need to successfully learn.

The myIGDI testing program discussed in Chapter 2 is one example of a special education screening and placement measure that focuses on intervention to support student learning. Note that the emphasis on student learning in this context has resulted in tests that measure both aptitude and achievement, as they predict future performance while also describing student learning. Thus, tests for screening and placement of students with disabilities are now designed for multiple purposes.

Tests that distinctly measure either achievement or aptitude usually differ in content and scope as well as purpose. Achievement tests are designed around a well defined content domain using a test outline (discussed further in Chapter 4). The test outline presents the content areas and learning standards or objectives needed to represent the underlying construct. An achievement test includes test questions that map directly to these content areas and objectives. As a result, a given question on an achievement test should have high face validity, that is, it should be clear what content the question is intended to measure. Furthermore, a correct response to such a question should depend directly on an individual’s learning of that content.

On the other hand, aptitude tests, which are not intended to assess learning or mastery of a specific content domain, need not be restricted to specific content. They are still designed using a test outline. However, this outline captures the abilities and skills that are most related to or predictive of the construct, rather than content areas and learning objectives. Aptitude tests typically measure abilities and skills that generalize to other constructs and outcomes. As a result, the content of an aptitude question is not as relevant as the cognitive reasoning or processes used to respond correctly. Aptitude questions may not have high face validity, that is, it may not be clear what they are intended to measure and how the resulting scores will be used.

Learning check: Compare and contrast achievement with aptitude testing using your own examples. What features of tests, in your experience, help differentiate between the two?

3.2.3 Other distinctions

In the 1970s and 1980s, researchers in the areas of education and school psychology began to highlight a need for educational assessments tied directly to the curriculum and instructional objectives of the classroom in which they would be used. The term performance assessment was introduced to describe a more authentic form of assessment requiring students to demonstrate skills and competencies relevant to key outcomes of instruction (Mehrens 1992; Stiggins 1987). Various types of performance assessment were developed specifically as alternatives to the summative tests that were then often created outside the classroom. For example, curriculum-based measurement (CBM), developed in the early 1980s (Deno 1985), involved brief, performance-based measures used to monitor student progress in core subject areas such as reading, writing, and math. Reading CBM assessed students’ oral reading fluency in the basal texts for the course; the content of the reading assessments came directly from the readings that students would encounter during the academic year. These assessments produced scores that could be used to model growth, and predict performance on end-of-year tests (Deno et al. 2001; Fuchs and Fuchs 1999).

Although CBM and other forms of performance assessment remain popular today, the term formative assessment is now commonly used as a general label for the classroom-focused alternative to the traditional summative or end-of-year test. The main distinction between formative and summative is in the purpose or intended use of the resulting test score. Formative assessments are described as measuring incrementally, where the purpose is to directly encourage student growth (Black and Wiliam 1998). They can be spread across multiple administrations, or used in conjunction with other versions of an assessment, so as to monitor and promote progress. Thus, formative assessments are designed to inform teaching and form learning. They seek to answer the question, “how are we doing?” Wiliam and Black (1996) further assert that in order to be formative, an assessment must provide information that is used to address a need or a gap in the student’s understanding; in other words, beyond an intent, there must be an attempt to apply the results. Note that it is less appropriate to label a test as formative, and preferable to instead label the process or use of scores as formative.

On the other hand, summative assessments, or assessments used summatively, measures conclusively, usually at a single time point, where the intention is to describe current status. Summative assessment encourages growth only indirectly. In contrast to formative, it is designed to sum or summarize, to answer the question, “how did we do?” Cizek (2010) describes summative as an assessment that is administered at the end of an instructional unit the purpose of which is “primarily to categorize the performance of a student or system.” Wiliam and Black (1996) go further by saying that summative assessments cannot be formative, by definition.

Despite debate over what specifically constitutes a formative assessment (e.g., Bennett 2011), numerous studies have documented at least some positive impact resulting from the use of assessments that inform instruction during the school year (Black and Wiliam 1998). Formative assessments have become a key component in many educational assessment systems (Militello, Schweid, and Sireci 2010).

3.3 Technology in testing

Developments in computing technology have facilitated improvements in the way we now create, administer, and score educational and psychological tests. Some major improvements include:

  • computer administration, which enables a wider variety of innovative item types, for example, ones that present video prompts or collect responses via touch screen or voice recognition;
  • computerized adaptive testing, where items or tasks are selected algorithmically and tailored to the needs of each test taker based on past performance; and
  • automated scoring, including machine scoring for selected-response items and computerized essay grading with natural language processing procedures.

The costs of these improvements have tended to limit their application to large-scale and high-stakes testing programs, for example, in admissions, certification/licensure, and state testing. Consider the GRE (ets.org/gre), an admissions test for graduate programs. In the GRE, an adaptive algorithm evaluates test taker performance on an initial set of items and then presents more difficult items to test takers with higher scores and less difficult items to test takers with lower scores. This adaptiveness, discussed further in Chapter 7, makes testing more efficient, as test takers respond to more items that are targeted to their level on the construct and fewer items that are not. The GRE also employs computerized essay scoring, where each essay is scored by a human rater as well as a computer algorithm. When a discrepancy arises between the two scores, an additional human rater provides a third score for verification.

Online platforms have also incorporated computing technologies to support lower-stakes formative assessment that integrates with learning resources, for example, in cognitive tutoring systems (e.g., Ritter et al. 2007) and other web applications (e.g., khanacademy.org). These platforms employ complex algorithms, often following cognitive and psychometric models, that account for user interactions with online content in addition to assessment performance so as to adapt future content to user needs.

As will be discussed in Chapter 9, we should incorporate new technology into test development, administration, or scoring only if doing so can be shown to improve the testing process, or at least not detract from it. In other words, technology should be used to enhance the validity of our results (Huff and Sireci 2001). In many situations, technology meets this aim by allowing for a more authentic and realistic representation of the construct we are trying to measure. An example of this would be a certification exam that assesses the coding skills of software engineers, where computer administration would let us present and evaluate computer code in real-time. If the construct itself involves technology, our measurement should improve when we implement that technology during testing. Technology also meets the aim of enhancing validity by removing barriers and increasing access to test content. Computers simplify the provision of accommodations in testing, for example, with instructions that can be read aloud, magnification of small text, and translation into other languages.

Learning check: Use the terms from this section to describe how technology has improved testing in your experience.

3.4 Finding test information

Information for commercially available tests can usually be found by searching online, with test publishers sharing summaries and technical documentation for their tests. Repositories of test information are also available online. ETS provides a searchable data base of over 25,000 tests at www.ets.org/test_link. A search for the term “creativity” returns 213 records, including the Adult Playfulness Scale, “a personality measure which assesses the degree to which an individual tends to define an activity in an imaginative, non-serious or metaphoric manner so as to enhance intrinsic enjoyment, involvement, and satisfaction,” and the Fantasy Measure, where “Children complete stories in which the main character is a child under stress of failure.” In addition to a title and abstract, the data base includes basic information on publication date and authorship, sometimes with links to publisher websites.

The Buros Center for Testing buros.org also publishes a comprehensive data base of educational and psychological measures. In addition to descriptive information, they include peer evaluations of the psychometric properties of tests in what is called the Mental Measurements Yearbook. Buros peer reviews may be available through university library subscriptions, or can be accessed online for a fee.

3.5 Summary

This chapter provides an overview of how different types of tests are designed to inform a variety of decisions in education and psychology. For the most part, tests are designed merely to inform decision-making processes, and test authors are often careful to clarify that no decision should be made based solely on test scores. Online data bases provide access to descriptive summaries and peer reviews of tests.

Although a variety of terms are available for describing educational and psychological tests, many tests can be described in multiple ways depending on their use. The myIGDI measure was mentioned as an example of how achievement and aptitude tests can be difficult to distinguish from one another. Summative and formative tests often overlap as well, where a test can be used to both summarize and inform learning. In the end, the purpose of the test should be the main source of information for determining what type of test you are dealing with and what that test is intended to do.

3.5.1 Exercises

  1. Is it appropriate for colleges and graduate programs to have minimum cutoffs when reviewing standardized test scores in the admissions process? How would you recommend that scores from admissions tests be incorporated into admissions decisions?
  2. Describe the challenges involved in using a single test for multiple purposes, such as to measure both achievement and aptitude, or both status and growth, or both formative and summative information.
  3. Would you describe admissions tests like the SAT and GRE as aptitude or achievement tests? Explain your reasoning.
  4. For a test you have taken that relied on technology for administration, summarize the consequences of removing this technology, for example, by administering via paper instead of computer or in-person instead of online.
  5. Conduct an online search for information on one of the tests referenced in this chapter or in Chapter 2. Look for details on the publication date, authors, and accessibility of the test. From the available information, summarize the test using the terms presented in this chapter.

References

Abedi, Jamal. 2004. “The No Child Left behind Act and English Language Learners: Assessment and Accountability Issues.” Educational Researcher 33: 4–14.

Bennett, Randy Elliot. 2011. “Formative Assessment: A Critical Review.” Assessment in Education: Principles, Policy & Practice 18 (1): 5–25.

Black, P., and D. Wiliam. 1998. “Inside the Black Box: Raising Standards Through Classroom Assessment.” Phi Delta Kappan 80: 139–48.

Briggs, D. C. 2009. “Preparation for College Admission Exams.” Arlington, VA: National Association for College Admission Counseling.

Carter, S D. 2002. “Matching Training Methods and Factors of Cognitive Ability: A Means to Improve Training Outcomes.” Human Resource Development Quarterly 13: 71–88.

Cizek, G. J. 2010. “An Introduction to Formative Assessment.” In Handbook of Formative Assessment, edited by H. L. Andrade and G. J. Cizek, 3–17. New York, NY: Routledge.

College Board. 2012. “The SAT Report on College and Career Readiness: 2012.” New York, NY: College Board.

Deno, S. L. 1985. “Curriculum-based measurement: The emerging alternative.” Exceptional Children 52: 219–32.

Deno, S. L., L. S. Fuchs, D. Marston, and J. Shin. 2001. “Using curriculum-based measurement to establish growth standards for students with learning disabilities.” School Psychology Review 30: 507–24.

Fuchs, L. S., and D. Fuchs. 1999. “Monitoring student progress toward the development of reading competence: A review of three forms of classroom-based assessment.” School Psychology Review 28: 659–71.

Geiser, S. 2017. “Norm-Referenced Tests and Race-Blind Admissions: The Case for Eliminating the SAT and ACT at the University of California.” Center for Studies in Higher Education, University of California, Berkeley.

Huff, Kristen L, and Stephen G Sireci. 2001. “Validity Issues in Computer-Based Testing.” Educational Measurement: Issues and Practice 20 (3): 16–25.

Hursh, D. 2005. “The Growth of High-Stakes Testing in the USA: Accountability, Markets, and the Decline in Educational Equality.” British Educational Research Journal 31: 605–22.

Linn, R. L., E. L. Baker, and D. W. Betebenner. 2002. “Accountability Systems: Implications of Requirements of the No Child Left Behind Act of 2001.” Educational Researcher 31 (3–16).

Mehrens, W. A. 1992. “Using performance assessment for accountability purposes.” Educational Measurement: Issues and Practice 11: 3–9.

Militello, Matthew, Jason Schweid, and Stephen G Sireci. 2010. “Formative Assessment Systems: Evaluating the Fit Between School Districts’ Needs and Assessment Systems’ Characteristics.” Educational Assessment, Evaluation and Accountability 22 (1): 29–52.

Miller, C., and K. Stassun. 2014. “A Test That Fails.” Nature 510: 303–4.

Pope, K S, J N Butcher, and J Seelen. 2006. The MMPI, MMPI-2, & MMPI-A in Court: A Practical Guide for Expert Witnesses and Attorneys (3rd). Washington, DC: American Psychological Association.

Ritter, Steven, John R Anderson, Kenneth R Koedinger, and Albert Corbett. 2007. “Cognitive Tutor: Applied Research in Mathematics Education.” Psychonomic Bulletin & Review 14 (2): 249–55.

Santelices, M. V., and M. Wilson. 2010. “Unfair Treatment? The Case of Freedle, the SAT, and the Standardization Approach to Differential Item Functioning.” Harvard Educational Review 80: 106–34.

Sternberg, Robert J, and Wendy M Williams. 1997. “Does the Graduate Record Examination Predict Meaningful Success in the Graduate Training of Psychology? A Case Study.” American Psychologist 52 (6): 630–41.

Stiggins, R. J. 1987. “The Design and Development of Performance Assessments.” Educational Measurement: Issues and Practice 6: 33–42.

US Department of Education. 2002. “A New Era: Revitalizing Special Education for Children and Their Families.” Washington, DC: US Department of Education.

Wiliam, D., and P. Black. 1996. “Meanings and consequences: A basis for distinguishing formative and summative functions of assessment?” British Educational Research Journal 22: 537–48.