There’s been a bunch of interesting posts on Bluesky lately on effect sizes (and some interesting blog posts), with a few people discussing standardized effect sizes which got me thinking about this weird little example dataset that has both a very large and very small effect size, depending on how you define it.
It’s a toy dataset today from the jamovi sample data. It’s a simple pre-post design (n = 20) where students take an exam at time 1 and then take a second exam at time 2. Grades are measured on a 0-100% scale. The dataset has some unusual properties that highlight differences in effect size measures in a sort of dramatic way.
Below is a plot of the data depicting the results analyzed with a paired t-test using the ggstatsplot() package.
We can think about the effect size in a few ways:
Unstandardized Effect Size (it’s small): Exam 1 (M = 56.98) had a slightly lower mean score than exam 2 (M = 58.38), for a mean difference of 1.4pts. In context (i.e., percentage points on a test) this is a very small effect size. Not so small to be completely negligible, but given how many students have elected to get the free 1.5% worth of bonus research participation points in my classes I think it’s fair to say it’s not an amount of points that matters much to students.
Standardized dz (it’s large): In the plot created by the ggstatsplot() package and lots of other software, the default cohen’s d for a paired t-test is what Lakens (2013) refers to as “dz”. In this dataset, dz = 1.45 suggesting the two groups differ by 1.45 standard deviations. Though this is convenient in the sense that it useful for power analyses and has a direct conversion formula from the t-test in the form t/SQRT(N), it’s clearly at odds with the unstandardized effect size since a dz of 1.45 is incredibly large. It’s large because the denominator of cohen’s d formula used (mean difference / standard deviation) is the standard deviation of the difference column (i.e., time1-time2) which happens to be very small relative to the mean difference (i.e., 1.4 / 0.97 = 1.44, off a little just due to rounding).
Standardized dav (it’s small): Again, using Lakens’ (2013) notation. We could use the standard deviations at time 1 and time 2 respectively using the same formula as we would with an independent t-test (i.e., using the standard deviations from time1 and time2 respectively, rather than the difference column). If we do this, then dav = 0.22 or the two groups differ by 0.22 standard deviations. This is a very small effect size, and now is in line with the unstandardized effect size. Formula below:
cohens_d <- function(x, y) {
md <- abs(mean(x) - mean(y)) ## mean difference (numerator)
psd <- (var(x) + var(y))/2
psd <- sqrt(psd) ##Pooled SD
cd <- md/psd ## cohen's d
}
res <- cohens_d(x = mydata2$grade_test1,
y= mydata2$grade_test2)
res
Rank Biserial Correlation (it’s large): Ok, well what if we analyzed it with a non-parametric test, like the Wilcoxon sign-rank test? We could calculate an effect size for that with a rank-biserial correlation. Broadly, we can think about this effect size as % of favorable pairs – % of unfavorable pairs (see Kerby, 2014). The rank-biserial correlation is r = 0.98 or 99% favorable pairs and 1% unfavorable pairs, which is almost as large as is theoretically possible! So large it initially feels like an error until we look at the raw data more closely.
There are only 20 observations, and we need to look at each participant’s pattern of results. In 1 of 20 cases, it’s a tie; this gets cut out of the calculations. In 18 of 20 cases, the score on exam 2 is higher. In 1 of 20 cases, the score on exam 2 is lower. Moreover, in this single anomalous case the absolute value of the change (0.04) is one of the smallest scores in the “diff” column, so has one of the smallest ranks (2 of 19). Thus, the rank-biserial correlation tells us that almost every student saw an INCREASE in their grades from test 1 to test 2. So, by this metric, the effect size is extremely large.
Multilevel R2 (it’s both small and large): Let’s leave the paired t-test behind and analyze the data using a linear mixed model now. Reformat the data to long format, and now the outcome is grade, the predictor is test (Exam 1 vs. Exam 2) with random intercepts and fixed slopes (there’s only two timepoints, so we can have only random intercepts OR random slopes but not both):
library(lme4)
libary(sjPlot)
m1 <- lmer(data = longdata, value ~ name + (1|id))
tab_model(m1)
The slope of 1.41 is the exact same unstandardized mean difference from the paired t-test. What’s new are the marginal R2 and conditional R2 values (see Nakagawa et al., 2012 but also Rights & Sterba, 2019) . The marginal R2 looks just at the fixed effects (i.e., the effect of test), which explains a paltry 1.2% of the variance (a small effect). The conditional R2 also incorporates the random intercepts (i.e., individual differences in performance at exam performance at exam 1). After that, the conditional R2 is 98.9% which is again almost as large as is theoretically possible! This makes sense when you look at the plot back at the beginning though: There is a lot of variability between people at exam 1, but only very slight changes from exam 1 to exam 2.
Conclusion: There you have it, depending on how you define “effect size” in this toy dataset, the effect size is both large and small. The moral of this story is that you wouldn’t be able to explain the results fully with any single effect size measure: In this case, it is notable that almost every participant in this dataset did better on the second exam within a pretty narrow range of improvement; however, the amount of improvement for any individual student was small.