teaching Tutorials

How an effect size can simultaneously be both small and large

There’s been a bunch of interesting posts on Bluesky lately on effect sizes (and some interesting blog posts), with a few people discussing standardized effect sizes which got me thinking about this weird little example dataset that has both a very large and very small effect size, depending on how you define it.

It’s a toy dataset today from the jamovi sample data. It’s a simple pre-post design (n = 20) where students take an exam at time 1 and then take a second exam at time 2. Grades are measured on a 0-100% scale. The dataset has some unusual properties that highlight differences in effect size measures in a sort of dramatic way.

Below is a plot of the data depicting the results analyzed with a paired t-test using the ggstatsplot() package.

We can think about the effect size in a few ways:

Unstandardized Effect Size (it’s small):  Exam 1 (M = 56.98) had a slightly lower mean score than exam 2 (M = 58.38), for a mean difference of 1.4pts. In context (i.e., percentage points on a test) this is a very small effect size. Not so small to be completely negligible, but given how many students have elected to get the free 1.5% worth of bonus research participation points in my classes I think it’s fair to say it’s not an amount of points that matters much to students.

Standardized dz (it’s large): In the plot created by the ggstatsplot() package and lots of other software, the default cohen’s d for a paired t-test is what Lakens (2013) refers to as “dz”. In this dataset, dz = 1.45 suggesting the two groups differ by 1.45 standard deviations. Though this is convenient in the sense that it useful for power analyses and has a direct conversion formula from the t-test in the form t/SQRT(N), it’s clearly at odds with the unstandardized effect size since a dz of 1.45 is incredibly large.  It’s large because the denominator of cohen’s d formula used (mean difference / standard deviation) is the standard deviation of the difference column (i.e., time1-time2) which happens to be very small relative to the mean difference (i.e., 1.4 / 0.97 = 1.44, off a little just due to rounding).

Standardized dav (it’s small):  Again, using Lakens’ (2013) notation. We could use the standard deviations at time 1 and time 2 respectively using the same formula as we would with an independent t-test (i.e., using the standard deviations from time1 and time2 respectively, rather than the difference column). If we do this, then dav = 0.22 or the two groups differ by 0.22 standard deviations. This is a very small effect size, and now is in line with the unstandardized effect size. Formula below:

cohens_d <- function(x, y) {
  md  <- abs(mean(x) - mean(y))        ## mean difference (numerator)
  psd <- (var(x) + var(y))/2           
  psd <- sqrt(psd)                     ##Pooled SD
  cd  <- md/psd                        ## cohen's d

res <- cohens_d(x = mydata2$grade_test1, 
                y= mydata2$grade_test2)

Rank Biserial Correlation (it’s large): Ok, well what if we analyzed it with a non-parametric test, like the Wilcoxon sign-rank test? We could calculate an effect size for that with a rank-biserial correlation. Broadly, we can think about this effect size as % of favorable pairs – % of unfavorable pairs (see Kerby, 2014). The rank-biserial correlation is r = 0.98 or 99% favorable pairs and 1% unfavorable pairs, which is almost as large as is theoretically possible!  So large it initially feels like an error until we look at the raw data more closely.

There are only 20 observations, and we need to look at each participant’s pattern of results. In 1 of 20 cases, it’s a tie; this gets cut out of the calculations. In 18 of 20 cases, the score on exam 2 is higher. In 1 of 20 cases, the score on exam 2 is lower. Moreover, in this single anomalous case the absolute value of the change (0.04) is one of the smallest scores in the “diff” column, so has one of the smallest ranks (2 of 19). Thus, the rank-biserial correlation tells us that almost every student saw an INCREASE in their grades from test 1 to test 2. So, by this metric, the effect size is extremely large.

Multilevel R2 (it’s both small and large): Let’s leave the paired t-test behind and analyze the data using a linear mixed model now. Reformat the data to long format, and now the outcome is grade, the predictor is test (Exam 1 vs. Exam 2) with random intercepts and fixed slopes (there’s only two timepoints, so we can have only random intercepts OR random slopes but not both):


m1 <- lmer(data = longdata, value ~ name + (1|id))


The slope of 1.41 is the exact same unstandardized mean difference from the paired t-test. What’s new are the marginal R2 and conditional R2 values (see Nakagawa et al., 2012 but also Rights & Sterba, 2019) . The marginal R2 looks just at the fixed effects (i.e., the effect of test), which explains a paltry 1.2% of the variance (a small effect). The conditional R2 also incorporates the random intercepts (i.e., individual differences in performance at exam performance at exam 1). After that, the conditional R2 is 98.9% which is again almost as large as is theoretically possible! This makes sense when you look at the plot back at the beginning though: There is a lot of variability between people at exam 1, but only very slight changes from exam 1 to exam 2.

Conclusion: There you have it, depending on how you define “effect size” in this toy dataset, the effect size is both large and small.  The moral of this story is that you wouldn’t be able to explain the results fully with any single effect size measure: In this case, it is notable that almost every participant in this dataset did better on the second exam within a pretty narrow range of improvement; however, the amount of improvement for any individual student was small. 


Simple Validity Statistics for Teachers

My primary area of interest (besides statistics) is personality psychology. If there’s one thing you’ll notice about personality psychologists, it’s that we’re kind of obsessed with questionnaire measurement – and usually rely on some pretty complicated statistics to really be satisfied that a questionnaire is suitable for our purposes. Really though, we’re usually interested in two things:

Are the questionnaires reliable? That is, does the questionnaire produce consistent results under similar conditions?

Are the measurements valid? That is, does the questionnaire actually measure what it’s supposed to measure?

So when I started teaching courses, I started thinking about how I might build assessments that were both reliable AND valid for my students. After all, some research suggests that teachers have a pretty poor track record on developing reliable and valid ways to grade student performance. Besides, many of the assessments I use (e.g., exams, essays) share a lot in common with questionnaires, so many of the same principles should apply. In this post, I’m going to focus on convergent and divergent validity. This will require some knowledge of the correlation coefficient.

Convergent validity means that there is a strong, positive correlation between two measures that ARE supposed to be correlated with each other. If this were a scientific study, you might correlate two questionnaires that are supposed to be related to each other (say, positive affect and life satisfaction). In the context of teaching, you might correlate two assessment tools that are supposedly measuring the same thing (e.g., quizzes and exams). In this case, a large correlation provides evidence for convergent validity. Practically speaking, correlations larger than r = .30 provide acceptable evidence, and correlations greater than r = .50 provide excellent evidence.

Divergent validity means there is a small, or non-existent correlation between two measures that are NOT supposed to be correlated with each other. In a teaching context, you might expect little correlation between exam and oral presentation grades, since they measure different things (e.g., critical thinking versus communication skills). Practically speaking, you would hope for correlations smaller than r = .30 to support divergent validity, with a non-significant correlation being the strongest support.

Below is a sample correlation matrix from a 3000-level course I’ve taught in the past (Research Methods in Clinical Psychology). In this class, students complete two essays and two exams.

Statistics for Teachers table1

N = 39, all correlations significant at p < .05

Assuming that these assessment tools are valid, I’d expect three things:

a) Grades on the two essays will be highly correlated with each other

b) Grades on the two exams will be highly correlated with each other

c) The inter-correlations between exams and essays will be large, but not as large as the correlations between assessments measured in the same way. This is because exams and essays tap overlapping – but still probably discrete – skillsets.

A brief review of the correlation matrix above supports all three contentions, and gives me a bit more confidence in the validity of my assessment tools. If these correlations were a lot lower (< .30) I’d need to investigate if it’s simply a different skillset being measured, or if my measurement was poor.

There are many more ways that teachers can incorporate statistics into their teaching practice, without needing to be a statistics expert, but this is an easy one that anybody can implement.