Thoughts on State Assessments [updated]

Update: Jay P. Greene makes a case for norm-referenced tests.

Matt Townsley posed a question to me and blogger Chris Liebig on Twitter last weekend: What are your thoughts about state assessments that are norm referenced versus criterion referenced?

My first thought: wow, that’s too much question to respond to 140 characters at a time.

My next thought: there’s a really good question in there, the answer to which ought to be central to every conversation we have around state (accountability) assessments but won’t be. Because the answer to that questions depends upon the purpose of the state assessment (how’s that for a foreseeable lawyerly answer). Why are we administering them? What question are we trying to answer?

When we know what the question is, we will know whether norm referenced or criterion referenced is the better choice.

It seems to me though, that we have a very real problem with a while-we’re-at-it mentality when it comes to state assessments. If we are going to be testing for student proficiency, we might as well get a growth measure while we are at it. And results we can use to evaluate teachers. And results we can use to inform instruction. And results we can use to identify gifted students. And results we can use to determine if students are ready for college, without remediation. And results that tell us how our students are doing compared to students in other schools, states, and countries. And the tests shouldn’t just measure, but should help students learn. And so on. And the longer the assessments, the more it seems to make sense to do all of these things while we are at it.

However, it is my understanding that assessments should be designed and used for a single purpose. And if that single purpose is to determine proficiency, I think a criterion referenced assessment would make sense.

This is all assuming, of course, that the standards make sense in terms of reasonable grade level expectations which is, really, an enormous conversation in and of itself: what is grade level and how much of it do you have to be able to do to be proficient?

I have no idea what the specific answers to those questions are, by the way (and I like to think that I have been paying attention), which may be why it seems that more than knowing that our students are proficient as measured against the standards, we (a general we, not necessarily me or Matt) want to know that our students are performing at a higher level than students in other schools, other states, and/or other countries. If we aren’t sure what grade level proficient means (or that the bar is set high enough), we can at least take comfort in our students ranking higher than others–at least as long as our students aren’t the ones ranked at the bottom, of course. Hence all the anxiety about global competitiveness and wanting to take the same assessments as other states so that we can directly compare scores. In which case, a proficiency cut score on a norm referenced assessment might make more sense.

That being said, someone needs to do a better job limiting the use of state assessment scores to the purpose for which the state assessments have been designed.


5 thoughts on “Thoughts on State Assessments [updated]

  1. Chris

    Thanks, Karen. I had a similar reaction to the question: not only, “Why are we administering these tests?” but also, “and how exactly will we use the results to improve our kids’ education?” I’m not saying that there are no possible answers; if anything, there are multiple possible answers, and until we know what they are, it’s hard to make any decisions about the tests. One small example: if the test results are just going to be used to make adjustments in how things are taught the following year, then why pay a premium to get the results back in days rather than weeks?

    What’s the theory? That somehow my kid’s score will be used to benefit her individually? How? Or just that the aggregate scores will enable the school to make adjustments in the future? Again, how? If they do more of X and Y in response to the test scores, won’t they also have to do less of Z? (And if the goal is just information about aggregate performance, why not just use random sampling, thus saving millions of dollars?) Or is it so they can fire teachers whose classes have lower scores, thus somehow benefitting future classes (with inexperienced new hires who have no job security)? I really don’t get what the theory is. I wonder if Matt could post a little about how schools try to translate standardized test scores into concrete benefits for students.

    If the answer to “Why should we spend millions upon millions of dollars on new standardized tests” is just “because parents want to know how their kids are doing,” I find that answer very unsatisfying. Unless the information can help the school improve kids’ education in some concrete way, I don’t think the expense is justified (and even then we’d still have to talk about whether the benefit is worth it). Just satisfying our curiosity isn’t enough.

    I don’t have any faith in the school system’s ability to determine what kids should all know at what ages, especially if it’s done in some centralized, top-down way.

    Finally, to me, it makes sense to allows different states and school districts to pursue different philosophies of education and different goals, based on what that community values in education. Using standardized test results as a measure of how kids are doing ends up dictating a uniform definition of educational success, which undermines any kind of pluralistic approach to education. Maybe our kids get lower math scores because we’ve decided to pace them differently than your state does, or because we’ve decided not to cut recess and lunch like your state did, or because we think it’s important to make room in the schedule for history and the humanities, or because we have a foreign language immersion program, or because our curriculum isn’t driven by teaching to the test, etc. A standardized test score doesn’t measure the benefit of any of those things, or the cost of not doing them, so how is it useful, rather than misleading?

    Wait, what was the question again? 🙂

  2. Matt Townsley (@mctownsley)

    This is going to be a long response!

    I did some investigating to find out what Iowa Testing claims to be the purpose of the Iowa Assessments. The Iowa Assessment Profile Narratives sent home to parents suggest the assessment results “provide information about your student’s achievement level in core subject areas.” They also measure “student achievement and growth.” The purpose of the profile narrative is to “identify strengths and weaknesses, proficiency levels, inform placement decisions and make comparisons.” Other reports noted in the same pdf claim to “determine college readiness, inform instruction and implement response to intervention” (Source:

    Karen and Chris – you’re both on to something — there does not appear to be a clear purpose of the current state assessment.

    A quick look at the Smarter Balanced Assessments shows a similar theme: multiple purposes. “Accurately describe both student achievement and growth of student learning as part of program evaluation and school, district, and state accountability systems;
    Provide valid, reliable, and fair measures of students’ progress toward, and attainment of the knowledge and skills required to be college- and career-ready; and…” (Source:

    My take:
    I don’t believe state assessments can be used in meaningfully formative ways, that is, to inform future instruction. If there’s an illustration of summative measures in the dictionary, it should include a reference to state accountability assessments.

    I also believe a purpose of education is to help as many students be successful. A norm-referenced assessment designed to sort students seems contrary to this belief.

    I also think it could be useful in this conversation to think about the history of state accountability assessments. I can’t confirm the next paragraph without some in-depth research, so take it for what its worth.

    Once upon a time, many states did not have standards or state assessments. Enter: No Child Left Behind. Schools and districts are to be labeled things like “in need of assistance” and “persistently low achieving” so that parents can be provided options when their neighborhood school is “failing.” Sanctions, labels, etc. threaten schools to do better…or else. Every state is required to have some sort of accountability assessment to sort and label schools. Many, but not all, states quickly react and tighten up their state standards. After all, if the tests are high stakes, we might as well communicate the ideas that will be tested! Iowa was the slowest to react. The Iowa Tests of Basic Skills and Iowa Tests of Educational Development were local and in place, so they became our state assessment. Each Iowa school district is creating their own standards, but schools were being labeled and sorted based on their students’ performance on these assessments. Some districts were spending thousands of dollars each year forming committees to determine their local standards (I was a part of one as a teacher) and trying to purchase text books aligned to them. Due to NCLB’s accountability requirements, schools started to also look at the ITBS and ITEDs to determine what their students were being asked to demonstrate, so they can have a better chance of staying off the watch list and need of assistance lists. While the topics covered on ITBS and ITEDs were not state standards written into law anywhere, they served as a quasi-target for school districts across Iowa. In a weird sort of way, it made sense to adopt state standards, because the required accountability measures have to be based on something.

    This brings me back to the original question for Karen about norm-referenced vs. criterion-referenced assessments. My perspective is that if we’re going to have required accountability assessments, they should be criterion-referenced. Let’s find out how well students understand the state standards. A school’s goal could be 100% of students proficient…and because the criterion-referenced tests would not be designed to “sort,” 100% proficiency (NCLB requirement: could be attainable.

  3. Karen W Post author

    Matt, I’m with you. Let’s do the driver’s license style, minimum competency statewide standardized assessment as a criterion-referenced test. If schools are approaching 100% proficiency, they get left alone to do whatever the community otherwise wants to do with the instructional program.

    But here’s what I’m thinking about–and as I write this, I think it may be a chicken-or-the-egg sort of problem, but we’ll see–how do we know that common core got the grade level standards right and how do we know where to set the proficiency cut score (assuming that proficiency isn’t DOK 4 on 100% of the grade level standards) without some sort of norm-reference? That is, if, for example, only 40% of third graders are deemed to be proficient, isn’t it possible the writers of the standards got the third-grade grade level expectations wrong? Shouldn’t third-grade grade level work be defined as something that some substantially large percentage of actual third graders can actually do? Isn’t it possible the proficiency cut score was set too high for the same reason? How do we sort those reasons out from schools need-instructional-improvements reasons?

    Of course, I can see that if one thinks that, with instructional improvements, third grade students could achieve higher rates of proficiency, one might not want a norm-referenced assessment to validate the standards until after those improvements have been made.

    So, what are we to do? Do you have confidence–perhaps based on professional judgment and experience–that the writers of the common core standards got it right (that grade level standards are reasonably attainable for most children enrolled in those grade levels), that those who will choose the proficiency cut scores will get that right too, and we can just carry on with using criterion-referenced scores?

  4. Matt Townsley (@mctownsley)

    Quick caveat: I have limited psychometrics experience (some undergraduate coursework, a few graduate courses in stats, two summers working for educational assessment companies scoring and vetting assessments), so take these comments with this context in mind.

    From my experience, regardless of the assessment author (classroom teacher or test writers at educational companies), there will likely be some sort of validity or reliability concerns. You’re right — chicken and eggs.

    One way assessment prompts can be validated without affecting reported results is through using pilot items. If a students perform well on a pilot item (that does not affect the overall score) over time, it can be included in a future form of the assessment. Again, based on my limited knowledge, this is a way items *could* be piloted while keeping a criterion-referenced focus.

  5. Chris

    Matt’s argument for criterion-referenced tests is persuasive, but I agree that the problem lies in setting the criteria. I can’t get past the inevitably partial nature of any assessment. A math assessment may be able to tell us whether kids have certain math skills, but it can’t assess the opportunity cost of achieving those skills: What was dropped from the curriculum to reach those scores? Did the teaching techniques have harmful effects in other ways? Were the kids deprived of free play time or a decent lunch period to achieve those scores? Did the teaching achieve short-term success at the cost of creating a long-term aversion to math? Did the school have to start using behavioral control systems that teach authoritarian values? Did the kids also learn that learning is a joyless drudgery to be avoided as soon as they’re free of compulsion?

    The scores tell you nothing about those things. What good is that kind of partial information? If you went to a doctor who used that kind of “empiricism,” you’d soon be dead from the unexamined side effects of his or her prescriptions.

    Norm-referencing won’t help that problem at all. Who knows what scores most fourth graders could achieve if we made their lives miserable enough.

Comments are closed.