Simple Validity Statistics for Teachers

My primary area of interest (besides statistics) is personality psychology. If there’s one thing you’ll notice about personality psychologists, it’s that we’re kind of obsessed with questionnaire measurement – and usually rely on some pretty complicated statistics to really be satisfied that a questionnaire is suitable for our purposes. Really though, we’re usually interested in two things:

Are the questionnaires reliable? That is, does the questionnaire produce consistent results under similar conditions?

Are the measurements valid? That is, does the questionnaire actually measure what it’s supposed to measure?

So when I started teaching courses, I started thinking about how I might build assessments that were both reliable AND valid for my students. After all, some research suggests that teachers have a pretty poor track record on developing reliable and valid ways to grade student performance. Besides, many of the assessments I use (e.g., exams, essays) share a lot in common with questionnaires, so many of the same principles should apply. In this post, I’m going to focus on convergent and divergent validity. This will require some knowledge of the correlation coefficient.

Convergent validity means that there is a strong, positive correlation between two measures that ARE supposed to be correlated with each other. If this were a scientific study, you might correlate two questionnaires that are supposed to be related to each other (say, positive affect and life satisfaction). In the context of teaching, you might correlate two assessment tools that are supposedly measuring the same thing (e.g., quizzes and exams). In this case, a large correlation provides evidence for convergent validity. Practically speaking, correlations larger than r = .30 provide acceptable evidence, and correlations greater than r = .50 provide excellent evidence.

Divergent validity means there is a small, or non-existent correlation between two measures that are NOT supposed to be correlated with each other. In a teaching context, you might expect little correlation between exam and oral presentation grades, since they measure different things (e.g., critical thinking versus communication skills). Practically speaking, you would hope for correlations smaller than r = .30 to support divergent validity, with a non-significant correlation being the strongest support.

Below is a sample correlation matrix from a 3000-level course I’ve taught in the past (Research Methods in Clinical Psychology). In this class, students complete two essays and two exams.

Statistics for Teachers table1

N = 39, all correlations significant at p < .05

Assuming that these assessment tools are valid, I’d expect three things:

a) Grades on the two essays will be highly correlated with each other

b) Grades on the two exams will be highly correlated with each other

c) The inter-correlations between exams and essays will be large, but not as large as the correlations between assessments measured in the same way. This is because exams and essays tap overlapping – but still probably discrete – skillsets.

A brief review of the correlation matrix above supports all three contentions, and gives me a bit more confidence in the validity of my assessment tools. If these correlations were a lot lower (< .30) I’d need to investigate if it’s simply a different skillset being measured, or if my measurement was poor.

There are many more ways that teachers can incorporate statistics into their teaching practice, without needing to be a statistics expert, but this is an easy one that anybody can implement.