This is not a blog about educational policy. However, often findings in my field, educational psychology, have a direct bearing on policy debates. In those cases, particularly when the consequences are great, it would be irresponsible for me not to speak out.

Value added measures of teacher performance are being widely adopted across the country. This adoption is occurring with very little discussion about the validity of these measures. I believe that these measures, at least as conceived today, are invalid.

A measurement can be defined as taking some property in the world and representing it as a number. An invalid measure is one that does not accurately reflect the property it is supposed to represent.

In the past few weeks I have been analyzing data from a research project. The topic is not important for our discussion here, the methodology, however, is. The approach I am using is called a gain score analysis. Participants are assigned to one of two groups, each group will receive a different intervention. For each group we measured our outcome variable at baseline, that is before treatment. After the intervention we will measure our outcome variable again. Gain score is defined as the final measurement minus the baseline measurement. In other word the magnitude of the change. By focusing on the magnitude of the change we don’t have to worry about the fact that the baseline scores were not identical. We use a statistical test to see if one group gained significantly more that the other.

A value added measure of teaching is also a gain score analysis. They measure the students’ performance at the beginning of the year and then measure their performance again at years end. The difference would be the gain score or, as it is called in education, the value added. The average gain score for a group of students is said to be the value added by the teacher.

What is wrong with this approach? After all it seems to be identical to what my colleagues and I are doing in our research. Unfortunately, there is a crucial difference. In my study the participants were randomly assigned to the two groups. **A gain score analysis can not be valid if the group assignments are not random. **

If students are not randomly assigned to schools and classrooms, and, of course they are not, then value added measures are invalid for comparisons between teachers.

We know that students learn at different rates. We know this because in research where teaching is kept constant, such as in programmed instruction, students will complete at different rates. What ever the source of these differences in learning rate it means that a teacher’s value added score will, in part, be a function of student characteristics not under control of the teacher. Thus, any policy based on value added measures is invalid and, by extension, unfair.

I am not opposed to measurement in education. Indeed, I know that properly used measurement can benefit both students and teachers. But to base policy on a measurement that we know to be invalid is senseless.

###### Related articles

- What Do We Know About Using Value-Added to Compare Teachers Who Work in Different Schools? (tutoringtoexcellence.blogspot.com)
- New Book: VAM Is Invalid (dianeravitch.net)

There is no doubt that this is clear, but legislatures across the country – and Arne Duncan’s own Department of Education – are demanding the insertion of VAM in teacher evaluation. The question is how to make the reality of statistics (the scary older brother of already scary math) clear to Senator Tweedledum and Assemblyman Tweedledumber so they knock it off.

One can avoid the necessity of going into details of statistics and experimental design (with politicians) in such situations by considering ‘more’ appropriate designs. The VAM provides a good lesson in itself of how a good basic idea can turn to a disaster because of refusals (up and down the line) to consider variations in the overall design. For example, let’s consider changing the dependent and independent variables and perhaps the measurement procedure, itself.

VAM currently relies on taking indirect measurements of student progress using ‘high stakes’ tests, administered semiannually. Note, first, that these test measure student performance and not teacher performance. Consequently, they create the necessity of creating a ‘formal’ model for getting from the empirical student assessments to teacher assessments. Aside from introducing additional variance to the measure (student educational status) which is already replete with the effects of current-year exogenous variables, this formal model to translate from student scores to teacher scores introduces substantial additional exogenous variability. The model chosen by the VAM scholars is the simple difference operator applied between this and last years (or last-semester’s) student scores. Also, some of the exogenous variable affects, say, changes in the test questions from one year to the next, may vary across students since some students may not have taken a course in the same subject the previous semester whereas the majority did. That is, VAM averages the scores over all test-questions, some of which may have changed from the base year to the current year for some students and that effect is not easily measured because all students who advanced to the current class may not be currently in classes taught by the same teacher.

Thus, this step muddies the water (increases variance due to exogenous variables) to a much greater degree. Plus, at this point, you will have lost the ability to go back to an individual student to check to see if there might have been something unusual that affected his/her performance. Let’s also remember here that we are dealing with human beings, whose performance on tasks varies greatly depending on the conditions in their lives on the test day as well as all the days between the two tests.

I am sure that Frederick Taylor, the hero of the Industrial Revolution (who invented ‘Time and Motion’ methods for standardizing time allowed for specific production operations, would approve. “Breakdown the job into sub operations, which can be measured with more precision, and then add them together to get total task time”, he would say (poor Lucille Ball). Unfortunately, the learning and teaching processes are not quite that simple to disaggregate, although the ‘reformers’ seem to assume that they are.

OK, let’s consider a re-design. Let’s take a more direct approach where the children rate the teacher directly. Let’s use the children’s assessment of the teacher’s performance and their own assessment of what they thought they learned during the year to determine our measure of teacher effectiveness. Note, that we could run such evaluation tests monthly or after each teaching module which, aside from increasing the number of measurements which greatly reduces the variance of the final estimates, also eliminates the nasty problem whereby some students may not have taken a math course or an ELA course in the year preceding the current test year.

The children could be asked to choose all the things they enjoyed about the class, specific things they learned, how well they thought the teacher knew the subject and was able to answer questions, the number of times that the teacher seemed to lose her patience, etc. with this format. Indications that they learned about a subject could lead to other questions about that particular subject. One could even lightly sprinkle in some key substance questions like did your teacher teach you how to solve a quadratic equation (followed by a request that they list the steps) that, because it is a questionnaire about the child’s opinion of the teacher, would not be particularly threatening. Another good question for 3rd through 5th graders would be: ‘did you learn about using abstract thinking to solve problems’ or, maybe more appropriately: ‘did she teach you much about how math might be used to solve problems’.

Test scores for the different teachers could be compared with those of other teachers teaching the same subject in and between schools. Furthermore, many of the questions from the RTTT tests could be salvaged as discussion questions (even as examples as to how one might ‘reason’ the correct answer based on how it relates to the other choices given (e.g., these two answers are essentially the same, hence neither of them can be the real answer); these guessing procedures are good logical thought exercises plus they present a different way of engaging the children to ask questions like, “what’s the difference between these two answers?” and “how is this answer the best one?”

This would also only require minor editorial changes, like from ‘VAM’ to ‘CAVAM’ or to ‘CATVAM’.