teaching Tutorials

How an effect size can simultaneously be both small and large

There’s been a bunch of interesting posts on Bluesky lately on effect sizes (and some interesting blog posts), with a few people discussing standardized effect sizes which got me thinking about this weird little example dataset that has both a very large and very small effect size, depending on how you define it.

It’s a toy dataset today from the jamovi sample data. It’s a simple pre-post design (n = 20) where students take an exam at time 1 and then take a second exam at time 2. Grades are measured on a 0-100% scale. The dataset has some unusual properties that highlight differences in effect size measures in a sort of dramatic way.

Below is a plot of the data depicting the results analyzed with a paired t-test using the ggstatsplot() package.

We can think about the effect size in a few ways:

Unstandardized Effect Size (it’s small):  Exam 1 (M = 56.98) had a slightly lower mean score than exam 2 (M = 58.38), for a mean difference of 1.4pts. In context (i.e., percentage points on a test) this is a very small effect size. Not so small to be completely negligible, but given how many students have elected to get the free 1.5% worth of bonus research participation points in my classes I think it’s fair to say it’s not an amount of points that matters much to students.

Standardized dz (it’s large): In the plot created by the ggstatsplot() package and lots of other software, the default cohen’s d for a paired t-test is what Lakens (2013) refers to as “dz”. In this dataset, dz = 1.45 suggesting the two groups differ by 1.45 standard deviations. Though this is convenient in the sense that it useful for power analyses and has a direct conversion formula from the t-test in the form t/SQRT(N), it’s clearly at odds with the unstandardized effect size since a dz of 1.45 is incredibly large.  It’s large because the denominator of cohen’s d formula used (mean difference / standard deviation) is the standard deviation of the difference column (i.e., time1-time2) which happens to be very small relative to the mean difference (i.e., 1.4 / 0.97 = 1.44, off a little just due to rounding).

Standardized dav (it’s small):  Again, using Lakens’ (2013) notation. We could use the standard deviations at time 1 and time 2 respectively using the same formula as we would with an independent t-test (i.e., using the standard deviations from time1 and time2 respectively, rather than the difference column). If we do this, then dav = 0.22 or the two groups differ by 0.22 standard deviations. This is a very small effect size, and now is in line with the unstandardized effect size. Formula below:

cohens_d <- function(x, y) {
  md  <- abs(mean(x) - mean(y))        ## mean difference (numerator)
  psd <- (var(x) + var(y))/2           
  psd <- sqrt(psd)                     ##Pooled SD
  cd  <- md/psd                        ## cohen's d

res <- cohens_d(x = mydata2$grade_test1, 
                y= mydata2$grade_test2)

Rank Biserial Correlation (it’s large): Ok, well what if we analyzed it with a non-parametric test, like the Wilcoxon sign-rank test? We could calculate an effect size for that with a rank-biserial correlation. Broadly, we can think about this effect size as % of favorable pairs – % of unfavorable pairs (see Kerby, 2014). The rank-biserial correlation is r = 0.98 or 99% favorable pairs and 1% unfavorable pairs, which is almost as large as is theoretically possible!  So large it initially feels like an error until we look at the raw data more closely.

There are only 20 observations, and we need to look at each participant’s pattern of results. In 1 of 20 cases, it’s a tie; this gets cut out of the calculations. In 18 of 20 cases, the score on exam 2 is higher. In 1 of 20 cases, the score on exam 2 is lower. Moreover, in this single anomalous case the absolute value of the change (0.04) is one of the smallest scores in the “diff” column, so has one of the smallest ranks (2 of 19). Thus, the rank-biserial correlation tells us that almost every student saw an INCREASE in their grades from test 1 to test 2. So, by this metric, the effect size is extremely large.

Multilevel R2 (it’s both small and large): Let’s leave the paired t-test behind and analyze the data using a linear mixed model now. Reformat the data to long format, and now the outcome is grade, the predictor is test (Exam 1 vs. Exam 2) with random intercepts and fixed slopes (there’s only two timepoints, so we can have only random intercepts OR random slopes but not both):


m1 <- lmer(data = longdata, value ~ name + (1|id))


The slope of 1.41 is the exact same unstandardized mean difference from the paired t-test. What’s new are the marginal R2 and conditional R2 values (see Nakagawa et al., 2012 but also Rights & Sterba, 2019) . The marginal R2 looks just at the fixed effects (i.e., the effect of test), which explains a paltry 1.2% of the variance (a small effect). The conditional R2 also incorporates the random intercepts (i.e., individual differences in performance at exam performance at exam 1). After that, the conditional R2 is 98.9% which is again almost as large as is theoretically possible! This makes sense when you look at the plot back at the beginning though: There is a lot of variability between people at exam 1, but only very slight changes from exam 1 to exam 2.

Conclusion: There you have it, depending on how you define “effect size” in this toy dataset, the effect size is both large and small.  The moral of this story is that you wouldn’t be able to explain the results fully with any single effect size measure: In this case, it is notable that almost every participant in this dataset did better on the second exam within a pretty narrow range of improvement; however, the amount of improvement for any individual student was small. 


Reflexivity in Quantitative Research

The Problem of Forking Paths

There is a lot of time to think while at home on sabbatical, and since my role is so often teaching or conducting statistical analyses, I think about statistics pretty often. The problem that has been vexing me lately is related to what Gelman and Loken call the “Garden of Forking Paths.” Simply put, most research hypotheses in the social sciences have a one-to-many relationship with their associated statistical hypotheses. In fact, there are so many different statistical hypotheses that could represent evidence (or absence of evidence) for a given research hypothesis, it is positively breathtaking. For instance, let’s start with a null hypothesis test for a two-group between-subjects comparison of a numerical outcome — arguably one of the most basic kinds of comparisons taught in virtually every introductory statistics course. Just off the top of my head, you could use an student t-test, an unequal-variances t-test, a wilcoxon rank-sum test, a permutation test, a Bayseian t-test, structured means modelling, or a MIMIC model. On top of that, there are decisions to be made about outliers, random responders, covariates, and a dizzying array of possible ways to measure almost any latent construct (e.g., Fried, 2017 summarizes 7 different instruments to measure depression, encompassing 52 different symptoms!). There are literally thousands of plausible ways to formalize any given hypothesis in statistical terms — really, a nearly infinite possible hypothesis space if you break things down into smaller variations.

This is a problem for a post-positivist approach to science. Though these principles go mostly unstated, Bisel & Adame (2017) succinctly summarize the ontology, epistemology and axiology of this worldview:

Post-positivistic research assumes that social reality is out there and has enough stability and patterning to be known. Social reality is conceived as coherent, whole, and singular. […] Post-positivistic research assumes that social reality is measurable and knowable, albeit difficult to access. […] Post-positivistic research assumes that knowledge about social reality is inherently worthwhile to acquire and should be as value neutral as possible in its characterization of that reality.

Bisel & Adame (2017) pg. 1

Most researchers doing quantitative work (myself included) ascribe to some variation of this belief structure either implicitly or explicitly. But the garden of forking paths is a serious problem for these beliefs. First, there’s a problem of ontology: Of the near-infinite possible options a researcher is faced with to formalize a verbal hypothesis in statistical terms, only a small subset of near-equivalent options can be “true” if reality is singular. Then there’s the problem of value neutrality; many analytic choices are not value neutral, and might be made consciously (or unconsciously) to support the analyst’s preexisting beliefs. For instance, when handling the familywise error rate problem, a researcher may a-priori choose to use a procedure that is known to be conservative (e.g., a Bonferroni correction) when they believe that the null hypothesis is actually true. This would bias a study against the alternative hypothesis even if the study were preregistered.

A lot of the scientific reform movement criticizes “p-hacking” where a person essentially exploits these degrees of freedom. That is, some analysts continue to explore the nearly infinite hypothesis space of possibilities for analysis until they find something that supports their argument, then report that as evidence while failing to report all the attempts that did not support their argument. Given publication bias, this very consistently results in an inflation of effect sizes and false positives. It’s been discussed extensively elsewhere, so no need to describe this further other than to reiterate that exploiting this is obviously a terrible way to do science.

Preregistration of hypotheses and a data analysis plan is a useful way to curb these sorts of abuses. With preregistration, the scientist limits the hypothesis space by pre-planning the analysis and design and committing to an approach in advance without exploration. This does a great job at reducing researcher bias and the exploitation problem noted above, and is worth implementing for that alone. Indeed, transparency is widely considered to be a virtue in statistical analysis. But the problem remains that the sheer number of arbitrary choices that need to be made before preregistering means that there is no way to know if those choices were the right choices. That is, if reality is indeed singular, and there are tens of thousands of ways to translate a verbal hypothesis into a statistical format, then the specific subset of statistical formulations that a researcher chooses in any given study (preregistered or not) is in all likelihood, not actually the best summary of reality. That is, preregistration curbs biasing of effects due to motivated reasoning, but does not solve the forking paths problem. This point has been discussed previously by Rubin (2017) and in a more mathematically sophisticated way by Devezer et al. (2021).

Another option that is can be combined with preregistration or used in isolation is multiverse analysis. Here, the researcher chooses 100s or even 1000s of possible analyses within the hypothesis space instead of just a handful of analyses, as would be more typical. Then, the researcher graphically presents all of those analyses, sometimes even producing a sort of average across many different analysis approaches. In this way, the researcher is able to see how much the results depend on idiosyncratic data analysis choices, and which choices have a big impact. There’s much to recommend this technique, though the obvious trepidation for most analysts is that is an exhausting, incredible amount of extra work. The other obvious problem for a post-positivist is that we still don’t know which of these analyses to trust most! It’s better insomuch as the hypothesis space becomes more visible, but given the sheer size of the hypothesis space (or “multiverse”) for any given question it’s still quite possible that the researcher has missed the best summary of reality. Moreover, if the analysis does vary substantially across analysis approaches, it may be difficult to know which approach (if any) is true. You’re only really in a confident place if you try 1000s of analyses, and they all produce results that are essentially equivalent. However, the assumptions of post-positivism (i.e., a singular reality) would mean that this will rarely ever be the case!

The inevitable conclusion then is that we can never be entirely certain which statistical approach is best. Statisticians are usually good at identifying and ruling out bad, untenable approaches (e.g., median splits, responder analyses; Andrew Althouse has an excellent list) but identifying the “best” approach will always be controversial and based on a set of values, beliefs, and biases of the researcher. There is no truly bias-free research, even though we can strive to reduce bias in various ways. Note too that I don’t necessarily mean political bias — analysts vary in attitudes towards various issues such as Type I vs. Type II error importance, whether robustness is more important than clear effect sizes, etc.

Reflexivity in Qualitative Research

This general problem has been known to qualitative researchers for quite some time, and their pragmatic solution is increased transparency through thorough analysis and reporting of one’s own biases. The epistemological differences of qualitative researchers have very frequently meant that statisticians and qualitative analysts don’t see eye-to-eye on a lot of topics, because their fundamental view of reality and how knowledge is created can be very different. Briefly, many qualitative researchers reject key components of post-positivism. Instead of a singular reality, they maintain that there are numerous, equally valid ways to understand the world. They also question whether any knowledge can be “value neutral” because of the inherent limitations of our own perspectives.

Thus, a common recommendation in qualitative research is to include a reflexivity statement. A good basic definition of reflexivity is given by Haynes (2012):

In simple terms, reflexivity is an awareness of the researcher’s role in the practice of research and the way this is influenced by the object of the research, enabling the researcher to acknowledge the way in which he or she affects both the research processes and outcomes. […] In other words, researcher reflexivity involves thinking about how our thinking came to be, how pre-existing understanding is constantly revised in the light of new understandings, and how this in turn affects our research. (Haynes, 2012).

I believe that reflexivity statements could be productively applied in quantitative research to improve transparency and potentially reduce bias by providing necessary context for the researcher’s analytic and design choices. There is some precedent for this, as Gelman and Henig (2015) describe “Honest acknowledgement of the researcher’s position, goals, experiences, and subjective point of view (pg. 978)” as a virtue along more conventional virtues such as transparency and impartiality.

As a general rule, reflexivity statements are virtually never used in quantitative research, even though they are a requirement for publishing qualitative research in many venues. As an illustrative example (and a challenge for myself) I am going to try and write a “reflexivity statement” for one of my prior quantitative papers:

Mackinnon, S. P., Ray, C. M., Firth, S. M., & O’Connor, R. M. (2019). Perfectionism and negative motives for drinking: A 21-day diary study. Journal of Research in Personality, 78, 177-178.

Reflexivity Statement for Mackinnon et al. (2019)

I wrote the grant application over the second half of 2015 and finalized the design and ethics application in August 2016. When I wrote the grant, I was a limited-term employee at Dalhousie with an eye to hopefully getting a tenure track position. While waiting for the grant results, I unexpectedly transitioned into a permanent instructor position at Dalhousie — formally, this position has increased teaching and no research requirement. When I won the grant, I decided to take on the grant project (with my department’s blessing) despite having no institutional requirement to conduct research.

My background in perfectionism was from studying under Simon Sherry, who was in turn influenced heavily by Gord Flett and Paul Hewitt. Succinctly put, this comes with a set of inherited biases that (a) perfectionism is primarily bad for mental health and (b) that perfectionism is, to a great extent, a negative interpersonal phenomenon. This affected my choice to focus on nondisplay of imperfection rather than other facets of perfectionism that some argue are more positive (e.g., high standards).

My background in alcohol research came from a postdoctoral fellowship with Sherry Stewart. Alongside the long history of behaviorist research at Dalhousie University where I received my Ph.D., I tend to think about psychological phenomenon of various sorts in operant conditioning terms. That is, I conceptualize drinking primarily as a function of positive and negative reinforcement. Thus, I focused primarily on negatively reinforcing motives (i.e., drinking to cope and conformity) in this paper and did not consider alternatives to this framework. Indeed, I am so inured to this perspective, I find it difficult to think of other explanatory frameworks that might be equally useful.

I am also in the position of being both the principal investigator and the primary statistical analyst on this paper. As a statistically-minded person, my bias in interpretation is often to think about statistical (rather than conceptual) limitations to research. As an analyst, I have a broad overarching bias towards more parsimonious models. That is, I value models with fewer parameters and add model complexity only when forced. This bias shows in a few places, such as using item parcels instead of items for some measures, deliberately choosing some 3-item scales in the design process, and using fixed slopes so that the model would be as parsimonious as possible. Perhaps most notably, two analyses in the paper were reviewer requests and both involved increased model complexity: Adding alcohol quantity to the model and (b) running a supplementary random slopes model. My own resistance to these components in the text is likely due to my bias for model parsimony.

Concluding Thoughts

That is a lot of text, so I could see a statement like that being an online supplement rather than in the body of the text. But I think that the process, when combined with reading the paper, would help readers understand my own biases and choices so they could gauge for themselves whether that influenced the final results. That is, my own biases shaped which forking paths I chose both in the design and analysis of my own study, and knowing those biases might help readers identify my blind spots. Though reflexivity statements are not a cure-all for psychological science, I think that they are an under-utilized tool that could be productively applied to increase transparency in quantitative studies.


Are Inferential Statistics Really What Psychology Needs?

In statistical analysis, we are generally publishing inferential statistics in psychology. That is, we often make estimates for population parameters and test null hypotheses. I have been thinking about the intersection of equity and statistics for a while now, ever since reading the history of statistics.

Tracing back to Galton, Fisher and Pearson it quickly becomes apparent that these statistical methods were designed specifically to identify group differences. That is, it begins with the assumption that different populations have different population distributions, and statistical methods were devised to set about proving those points without needing to measure data from an entire population. This can be generalized to correlations, where we are now assuming the relationship between X and Y is a single “true” value in the population that can be approximated within certain bounds of uncertainty. It was certainly revolutionary and changed the scope of what psychologists studied – we moved from intensive studies of individuals (e.g., Wundt, Ebbinghaus), to using statistical methods to estimate population parameters.

But it has always gnawed at the back of my mind that these might be the wrong questions, driven by the goal of ranking and stratifying the human population. Properly conducted, inferential statistics help us answer certain questions about populations, but my abiding interest in humans has always been about individual people, not population parameters.

I’ve begun to wonder if the field’s almost obsessive focus on population parameters – and weak tests of differences between them with null hypothesis significance testing – is fundamentally misguided in a lot of areas. There are obviously lots of cases where conventional statistics have generated important knowledge but given that much of my research interest over the years has been on human subjectivity (e.g., well-being, motivation, self-reports of personality, anxiety, depression) I’ve begun to wonder whether conventional statistical methods can really be of much use to truly understand these sorts of things. In no particular order, some problems:

1. Samples of humans willing to participate in research are virtually never random or representative, given the pragmatic constraints of actual data collection. Thus, we are estimating population parameters for specialized sub-populations that are quite often not defined or known. Outside of census research and very well-funded research programs, obtaining anything close to representativeness is typically unachievable.

2. Galton introduced the idea of the “normal” distribution, that most attributes in humans fall on it, with most being average and others falling around the tails. Somewhere along the line, the idea of normally distributed errors became generalized to the idea that TRAITS in humans fall along a normal distribution.  That is, there became a belief that population distributions of most measured human capacities, like personality and intelligence, should be normally distributed. For Galton, this interest is natural and making this argument helped with his aim to rank humans in mental ability.  However, many things we measure in humans are not distributed in this way (e.g., well-being, for instance, typically has negative skewness; alcohol consumption has positive skewness). In fact, if distributions of university grades are any indication, these distributions are also not “normally” distributed in most cases.

3. The focus on “population differences,” while not in of itself racist, it guides research along the lines as to emphasize and reify categories of people. The tools that we have promote categorization of humans into sub-populations, and comparison of those groups (e.g., t-tests, ANOVA). When we focus on Galton, Fisher, and Pearson, the bulk of our statistical methods were deliberately and explicitly designed and created to try and prove that populations differ in their mental abilities and were strongly motivated by eugenics in many cases. It constrains our inquiry to things like: (a) Does group A have a different population mean than Group B?  

4. Can there even truly be a population parameter for some latent constructs, which are themselves co-constructed through human interaction with psychological tests? Moreover, the nature of a lot of personality data is not truly numerical (it’s instead ordinal) so should we really be applying techniques for numerical data here?  

More questions than answers, I suppose. What I want to understand are within-person processes. That is, what is going on in the mind of a single person and can we actually predict their thoughts, feelings, and behavior? The intensive study of individual people has never been overly popular, but I wonder if it is something uniquely psychological that might need methods other than inferential stats.

I also think that people could do a lot richer description of humans using data visualization, descriptive statistics, and qualitative approaches. I’ve always found a detailed analysis of a human life in biography or the idea of a “quantified self” to be fascinating. There’s so much to learn about individual human variation; I think that psychology could really learn a lot by taking a step back and really thoroughly observing and describing the humans that we study.

I might explore some of these ideas more later. For now, I’m mostly just trying to give myself some time to really think and discuss ideas with other people during sabbatical, a precious commodity that gets lost in the regular bustle of day-to-day academics!


Increasing statistical power in psychological research without increasing sample size

Note: I wrote this ages ago back in 2013, and it lives on the now defunct open science collaboration blog. Storing it here on my own website for archiving purposes. Original link:

What is statistical power?

As scientists, we strive to avoid errors. Type I errors are false positives: Finding a relationship when one probably doesn’t exist.  Lots has been said about these kinds of errors, and the field of psychology is sometimes accused of excessive Type I error rates through publication biases, p-hacking, and failing to account for multiple comparisons (Open Science Collaboration, in press).  Type II errors are false negatives: Failing to find a relationship when one probably does exist. Type II errors are related to statistical power. Statistical Power is the probability that the test will reject the null hypothesis when the null hypothesis is false. Many authors suggest a statistical power rate of at least .80. This corresponds to an 80% probability of not committing a Type II error. Accuracy in parameter estimation (APIE) is closely related to statistical power, and it refers to the width of the confidence interval for the effect size (Maxwell et al., 2008). The smaller this width, the more precise your results are. This means that low statistical power not only increases Type II errors, but also Type I errors because underpowered studies have wide confidence intervals. Simply put, underpowered studies are imprecise and are unlikely to replicate (Button et al., 2013).

Studies in psychology are grossly underpowered

Psychological research has been grossly underpowered for a long time. Fifty years ago, Cohen (1962) estimated the statistical power to detect a medium effect size in abnormal psychology was about .48. The situation has improved slightly, but it’s still a serious problem today. For instance, one review suggested only 52% of articles in the applied psychology literature achieved .80 power for a medium effect size (Mone et al., 1996). This is in part because psychologists are studying small effects. One massive review of 322 meta-analyses including 8 million participants suggested that the average effect size in social psychology is relatively small (r = .21). To put this into perspective, you’d need 175 participants to have .80 power for a simple correlation between two variables at this effect size. This gets even worse when we’re studying interaction effects. One review suggests that the average effect size for interaction effects is even smaller (f2 = .009), which means that sample sizes of around 875 people would be needed to achieve .80 power.

What can we do to increase power?

Traditional recommendations for increasing statistical power suggest either (a) Increasing sample size, (b) maximizing effect size or (c) using a more liberal p-value criteria. However, increasing the effect size has no impact on the width of the confidence interval (Maxwell et al., 2008), and using a more liberal p-value comes at the expense of increased Type I error. Thus, most people assume that increasing the sample size is the only consistent way to increase statistical power. This isn’t always feasible due to funding limitations, or because researchers are studying rare populations (e.g., people with autism spectrum disorder). Fortunately, there are other solutions. Below, I list three ways you can increase statistical power without increasing sample size. You might also check out Hansen and Collins (1994) for a lengthier discussion.

Recommendation 1: Decrease the mean square error

Decreasing the mean square error will have the same impact as increasing sample size (if you want to see the math, check out McClelland, 2000). Okay. You’ve probably heard the term “mean square error” before, but the definition might be kind of fuzzy. Basically, your model makes a prediction for what the outcome variable (Y) should be, given certain values of the predictor (X). Naturally, it’s not a perfect prediction because you have measurement error, and because there are other important variables you probably didn’t measure. The mean square error is the difference between what your model predicts, and what the true values of the data actually are.  So, anything that improves the quality of your measurement or accounts for potential confounding variables will reduce the mean square error, and thus improve statistical power. Let’s make this concrete. Here are three specific techniques you can use:

  1. Reduce measurement error by using more reliable measures (i.e., better internal consistency, test-retest reliability, inter-rater reliability, etc.). You’ve probably read that .70 is the “rule-of-thumb” for acceptable reliability. Okay, sure. That’s publishable. But consider this: Let’s say you want to test a correlation coefficient. Assuming both measures have a reliability of .70, your observed correlation will be about 1.43 times smaller than the true population parameter (I got this using Spearman’s correlation attenuation formula).  Because you have a smaller observed effect size, you end up with less statistical power. Why do this to yourself? Reduce measurement error.  
  • Control for confounding variables. With correlational research, this means including control variables that predict the outcome variable, but are relatively uncorrelated with other predictor variables. In experimental designs, this means taking great care to control for as many possible confounds as possible. In both cases, this reduces the mean square error and improves the overall predictive power of the model – and thus, improves statistical power.
  • Use repeated-measures designs. Repeated measures designs reduce the mean square error by partitioning out the variance due to subjects. Depending on the kind of analysis you do, it can also increase the degrees of freedom for the analysis substantially (e.g., some multi-level models). I’m a big fan of repeated measures designs, because they allow researchers to collect a lot of data from fewer participants.

Recommendation 2: Increase the variance of your predictor variable

Another less-known way to increase statistical power is to increase the variance of your predictor variables (X).

  1. In correlational research, use more comprehensive continuous measures. That is, there should be a large possible range of values endorsed by participants. However, the measure should also capture many different aspects of the construct of interest; artificially increasing the range of X by adding redundant items (i.e., simply re-phrasing existing items to ask the same question) will actually hurt the validity of the analysis. Also, avoid dichotomizing your measures (e.g., median splits), because this reduces the variance and typically reduces power.
  • In experimental research, unequally allocating participants to each condition can improve statistical power. For example, say you were designing an experiment with 3 conditions (let’s say low, medium, or high self-esteem). Most of us would equally assign participants to all three groups, right? Well, as it turns out, that reduces statistical power. The optimal design for a linear relationship would be 50% low, 50% high, and omit the medium condition. The optimal design for a quadratic relationship would be 25% low, 50% medium, and 25% high. The proportions vary widely depending on the design and the kind of relationship you expect, but I recommend you check out McClelland (1997) to get more information on efficient experimental designs. You might be surprised.

Recommendation 3: Make sure predictor variables are uncorrelated with each other

A final way to increase statistical power is to increase the proportion of the variance in X not shared with other variables in the model. When predictor variables are correlated with each other, this is known as colinearity. For example, depression and anxiety are positively correlated with each other; including both as simultaneous predictors (say, in multiple regression) means that statistical power will be reduced, especially if one of the two variables actually doesn’t predict the outcome variable. Lots of textbooks suggest that we should only be worried about this when colinearity is extremely high (e.g., correlations around > .70). However, studies have shown that even modest intercorrlations among predictor variables will reduce statistical power (Mason et al., 1991). Bottom line: If you can design a model where your predictor variables are relatively uncorrelated with each other, you can improve statistical power.


Increasing statistical power is one of the rare times where what is good for science, and what is good for your career actually coincides. It increases the accuracy and replicability of results, so it’s good for science. It also increases your likelihood of finding a statistically significant result (assuming the effect actually exists), making it more likely to get something published. You don’t need to torture your data with obsessive re-analysis until you get p < .05.  Instead, put more thought into research design in order to maximize statistical power. Everyone wins, and you can use that time you used to spend sweating over p-values to do something more productive. Like volunteering with the Open Science Collaboration.


Button, K. S., Ioannidis, J. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376. doi: 10.1038/nrn3475

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65, 145-153. doi:10.1037/h0045186

Hansen, W. B., & Collins, L. M. (1994). Seven ways to increase power without increasing N. In L. M. Collins & L. A. Seitz (Eds.), Advances in data analysis for prevention intervention research (NIDA Research Monograph 142, NIH Publication No. 94-3599, pp. 184–195). Rockville, MD: National Institutes of Health.

Mason, C. H., & Perreault, W. D. (1991). Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research, 28, 268-280. doi:10.2307/3172863

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537-563. doi:10.1146/annurev.psych.59.103006.093735

McClelland, G. H. (1997). Optimal design in psychological research. Psychological Methods, 2, 3-19. doi:10.1037/1082-989X.2.1.3

McClelland, G. H. (2000). Increasing statistical power without increasing sample size. American Psychologist, 55, 963-964. doi:10.1037/0003-066X.55.8.963

Mone, M. A., Mueller, G. C., & Mauland, W. (1996). The perceptions and usage of statistical power in applied psychology and management research. Personnel Psychology, 49, 103-120. doi:10.1111/j.1744-6570.1996.tb01793.x

Open Science Collaboration. (in press). The Reproducibility Project: A model of large-scale collaboration for empirical research on reproducibility. In V. Stodden, F. Leisch, & R. Peng (Eds.), Implementing Reproducible Computational Research (A Volume in The R Series).  New York, NY: Taylor & Francis. doi:10.2139/ssrn.2195999


Which Starburst Flavour is Tastiest?


Ever wonder which starburst flavor is the best? Now you can, with the help of SCIENCE. Naturally, pink is the best (personal communication, me). But don’t take my word for it, polls from Buzzfeed and both support pink as the best, followed by red, yellow, and orange, respectively. However, these polls rely on retrospective memory and could benefit from experiments where participants provide numeric ratings of each candy immediately after consuming them. Thus, the present experiment has the following research question: What is the rank order of tastiness for the original 4 starburst candy flavors?


Students from my third year research methods in social psychology course (N = 72) completed this experiment between 2016-2018 as part of a class exercise. Each student was given 4 starburst candies, one of each flavor (pink/strawberry, red/cherry, orange, and yellow/lemon). They were instructed to eat each candy sequentially in the same order: pink, red, yellow, and orange. Students rated each candy on a 7-point scale from 1 (Extremely untasty) to 7 (Extremely tasty). Thus, I used a within-subjects design with 4 conditions and no counterbalancing.


Data were analyzed using a Kruskall-Wallis test, setting alpha at .05. Pairwise tests were conducted using Wilcoxon rank-sum tests, correcting alpha using a Bonferroni correction for familywise error rates (alpha = .008). Results are displayed in the two figures below. There was a clear ranked preference with pink emerging as the tastiest, followed by red, orange and yellow. Pairwise tests found statistically significant differences between these pairs: (a) pink-orange, (b) pink-yellow, and (c) red-yellow. Open data and syntax can be found on my OSF page.

Figure 1. Faceted bar plots comparing starburst candy flavors
Figure 2. Box plots with pairwise comparisons using wilcoxon rank-sum tests. Numbers indicate p-values.


Overall, the results mostly mirror the results of the polls, except that the rank order of orange and yellow are switched. Pink is indeed the superior flavor of original starburst candy flavors. The lack of counterbalancing is a significant limitation, as people may have gotten sick of eating starburst candies by the last one. This seems unlikely, as starburst are delicious. In sum, science supports pink starburst as the tastiest flavor, and yellow as the least tasty flavor.


Video Game Genres by Year: 1980-2016

Video games have been a hobby since childhood so it’s fun to look back at the history of video games. I found a cool dataset that includes all video games with sales of 100,000 or more from 1980-2016. It’s necessarily incomplete since it was scraped from an existing website; this is especially true for the early days of gaming (e.g, it has only 98 Nintendo Entertainment System (NES) games out of the library of ~716). It also misses a lot of the download-only market for PC and has no mobile games. Nonetheless, it’s still an interesting snapshot of video game history. For my part, I’m interested in tracking the popularity of video game genres from 1980-2016. That is, the type of game (e.g., platformer, shooter, role-playing). Starting out, I’d presume that platformers (e.g., Super Mario) would have been most popular back in the NES and Super Nintendo / Sega Genesis era (the halcyon, nostalgia-ridden days of my youth). But let’s see what turns up!

First thing to note is that there is really not nearly as much data here prior to ~1995 (shortly after the Playstation 1 comes out). I ran a simple stacked bar chart first to look at this.

ggplot(newdata, aes(x = Year2, fill = Genre)) +
  geom_bar(position = "stack") +
  scale_fill_ptol() +
  scale_x_continuous(limits = c(1980,2016), breaks = seq(1980, 2016, 5)) +
  labs(x = "", y = "") +
  ggtitle("Video Games by Year and Genre: 1980-2016")

The problem with this chart is that I can’t make any sense of how popular different genres are because of the large differences in counts. Thus, it makes more sense to look at proportions. I decided to go with a stacked density plot this time, and I’m pleased with the result.

ggplot(newdata, aes(x = Year2, group = Genre, fill = Genre)) +
  geom_density(adjust=1.5, position="fill") +
  scale_fill_ptol() +
  scale_x_continuous(limits = c(1980,2016), breaks = seq(1980, 2016, 5)) +
  scale_y_continuous(limits = c(0,1), breaks = seq(0,1,.10)) +
  labs(x = "", y = "") +
  ggtitle("Video Games by Year and Genre: 1980-2016")

To interpret the results, it might help to know the broad eras of video game consoles:

  • (1976–1983): Atari 2600
  • (1983–1987): NES.
  • (1987–1993): SNES/Genesis.
  • (1993–1998): PS1/N64/Saturn.
  • (1998–2005): Dreamcast/PS2/Gamecube/Xbox.
  • (2005–2012): PS3/360/Wii.
  • (2012-Present): PS4/XB1/WiiU.

First, remember that these are only the games that sold more than 100,000 copies, so it’s more an index of popularity than actual games published. With this in mind, a few interesting trends emerge.

  • Action games are most popular in the Atari era (e.g., Frogger), and have picked up again substantially in the present-day era of console gaming.
  • Adventure games have slowly increased in popularity over time.
  • Fighting games hit their peak in the early 90s. Games like Mortal Kombat, Street Fighter, and King of Fighters are popular here.
  • As expected, platformers are huge during the NES and early SNES era. The popularity of Super Mario Bros. spawned a lot of clones.
  • The big spike and then decline of puzzle games is probably an artifact of the data. Puzzle games are probably even more popular now with mobile phones, but those games aren’t included in the dataset. I expect these games were much easier to program, back in the day, which accounts for the early spike (e.g., Tetris).
  • Racing games get popular in the 1995-2005 era. The popularity of game series like Mario Kart spring to mind here.
  • Roleplaying games don’t emerge as top sellers until the NES era, and stay pretty steady after that. I expected to see spike of roleplaying games in the PS1 era, considered by many to be the golden age of JRPGs. But probably doesn’t get picked up in this dataset because many of them are cult classics, and did not sell a lot in the US.
  • Strategy games don’t emerge as top sellers until the SNES era, but remain pretty steady after that.

Overall, very cool data and fun to look through. As usual, data and code available on the blog’s OSF site.


Cloud-based apps for statistics teaching

At the recent 2019 Canadian Psychological Association convention, I gave a workshop on Cloud-Based Statistics Apps that could potentially be useful for teaching. I’m posting the list of applications here in case anyone finds them useful for teaching. In addition, also posting my slides from the talk, where I describe my approach to using these apps when teaching undergrad statistics.

Central Limit Theorem (Means)

This application runs simulations to demonstrate the central limit theorem from different population distributions by sampling up to 1000 times from a population. It shows the population distribution, the first 10 sample distributions, and the sampling distribution. This apps helps enormously with teaching students the central limit theorem. It can sample from normal, uniform, right-skewed, and left-skewed population distributions.

Central Limit Theorem (Proportions)

This app is like the one above, except that it shows the sampling distribution for proportions.

Regression Diagnostics

This app simulates new data for each run and shows students how to review diagnostic plots from linear regression models. It allows students to simulate situations where the normality assumption is likely to be met as well as situations where the assumptions are violated (quadratic relationships, non-constant variance). Helpful for getting students to interpret plots of residuals.

Generic Distribution Calculator

Calculates area under the curve for various distributions, such as z, binomial, t, F and χ2. Useful when teaching students about standardizing distributions (e.g., converting to z-scores) and for conceptual understanding of p-values.

Apps by Bruce Dudek

Note that Bruce Dudek does not post his code for these apps online. However, he may be willing to share code with you if you send an email (I use some of these apps in my own classes!).

Distributions (z, t, F & chi-squared)

Bruce developed a set of distribution apps that improve upon the generic ShinyEd one above in some ways. These apps are primarily useful for looking up p-values, and can replace looking up p-values in conventional critical value tables from old-school statistics textbooks. If the emphasis is entirely on looking up p-values, these apps work better.

Descriptive Statistics and Visualization

This is a cool app that has a few datasets built in, and allows students to calculate descriptive statistics (e.g., means, SDs, range) as well as a few basic data visualizations (e.g., histograms, box plots). Can be useful as a way to introduce students to exploring data with graphs.

Confidence Intervals

This app simulates confidence intervals (up to 100 at a time). It is useful to help students understand the conceptual issues surrounding confidence intervals, and the impact of sample size and confidence limit (e.g., 90% vs 95%) changes the width of the intervals.

RShiny Apps from Other Sources

ANOVA Playground

This app has a LOT going on. It can allow the user to simulate data and run a one-way ANOVA, complete with beautiful data visualizations. It can also allow the user to read in data to run a one-way ANOVA on a pre-specified dataset. Finally, it also allows the user to run individual t-tests as a rudimentary form of post-hoc test. This app is very useful to help students understand and interpret one-way ANOVA.

Other Online Apps that Don’t Use RShiny

The applications below don’t actually use RShiny but are similar in the sense that they can be run in a web browser without having to download software. I’ve found them to be useful in teaching.

Guess the Correlation

A game that shows students a scatterplot and asks them to guess the magnitude. Very fun.


This allows you to add dots to a scatterplot with a point-and-click format, calculates the correlation coefficient and allows you to save the resulting dataset. Insanely useful for teaching.

Rock ‘n Poll

Explains the concept of sampling variability using political polling as an example. Very beautiful viz.

Rossman/Chance Java Applet Collection

This site has been around a long time and uses java and javascript. Many different simulations, some that duplicate those above, some new. They are not as “pretty” as some of the other options, but they are still very useful, and notably have some simulations on probability. Note: The older ones that use java don’t seem to run on all computers; the javascript ones (labeled “js”) seem to work fine.

Kristoffer Magnusson’s d3.js Visualizations

Kristoffer has made some truly excellent data visualizations worth checking out. The ones above cover null hypothesis significance testing, confidence intervals, effect size, t-distributions, and equivalence testing. I honestly can’t hype these enough, they are all top-notch.

Online cloud-based statistics software

Sometimes, you want your students to analyze some data in a similar fashion to how they would if they had conventional software installed (e.g., SPSS, SAS). All of these do all of the calculations online, and remove the need for students to install software which, I have discovered, is a very error-prone process for many students. Useful if you want students to do statistical calculations in a class where learning a particular software is not the goal.


Lollipop Plot: Meta-analysis of Gender Differences in Sexual Behaviour

When I went to the APS conference in San Francisco last year, I got to hear Janet Hyde talk about the gender similarities hypothesis. Broadly, she argues that most gender differences (i.e., men vs. women) in psychological variables tend to be small in size. She used meta-analysis — statistically summarizing the results of lots of published research — as a method of testing her hypothesis. I thought it was fascinating stuff and a great talk, so I wanted to incorporate some of her research into intro psychology. Since I’ve been sprucing up the intro psych section on sex and gender, I thought it would be interesting to include Petersen and Hyde’s (2010) meta-analysis on gender differences in sexual attitudes and behaviour. Broadly, it found that there are a few large sex differences (e.g., men watch more pornography and masturbate more than women), but consistent with her general argument, most differences were very small in size (i.e., smaller than d = .20) and men and women are more similar than different. However, the paper itself doesn’t present the results in a very PowerPoint friendly way, and certainly not intelligible to freshmen students:

Portion of Table 1 from Petersen & Hyde (2010)

So, with that in mind I decided to make a plot up to better visualize the data. The degree of uncertainty displayed by the 95% CI is important … but most in the class don’t actually have any statistics background. So I decided to teach them what “cohen’s d” was first, then create a graph that showed off those numbers. I also taught them that values less than .20 are essentially negligible, so I wanted to highlight that on the plot. I decided on a lollipop plot, which is a bit more appealing than a bar plot.


#Order the variables in Rank Order

mydata$name <- factor(mydata$name, levels = mydata$name[order(mydata$d)])

#Create the Plot

  ggplot(mydata, aes(x=name, y=d, label=d)) + 
    geom_point(stat='identity', fill="black", size=6)  +
    geom_segment(aes(y = 0, 
                     x = name, 
                     yend = d, 
                     xend = name), 
                 color = "black") +
    geom_text(color="white", size=2) +
    labs(title="Gender Differences in Sexual Behaviour", 
         subtitle = "Negative numbers = more common in women; positive numbers = more common in men",
         x = "", y = "cohen's d (effect size)") + 
    ylim(-1, 1) +
    theme_bw() +
    geom_hline(yintercept = 0, linetype = "solid", color = "grey") +
    geom_hline(yintercept = -.20, linetype = "dashed", color = "grey") +
    geom_hline(yintercept = .20, linetype = "dashed", color = "grey") +
Lollipop plot showing meta-analysis results

Overall, I think the plot turned out pretty clear and was a lot more appealing than just putting a table of numbers for a more dynamic presentation. The ordering from largest to smallest helps a lot to guide the eye, and the dotted grey lines show the range of “trivial” effect sizes. So there are some substantial gender differences, but the effect sizes are actually a LOT smaller than you’d expect! For context, a cohen’s d of ~.60 would look like this (using this site to generate this quickly):

So phrased another way, if you selected a random man and a random woman, there is ~67% chance that the man will watch more pornography than the woman (the “common language effect size“). That’s not insubstantial, but also probably a lot smaller than a lot of people would think! Many of the effect sizes are vanishingly small. For a Cohen’s d of 0.20, there’s only a 55% chance that a randomly selected man will be more likely to engage in the behavior than a randomly selected woman. For that matter, men and women are actually more similar than they are different on all of these variables, if you consider the percent overlap among the distributions. Which broadly, is what Dr. Hyde’s point has been all along.

Code and data available on my OSF page.


Open Sex Role Inventory

Did you know about the Open Source Psychometrics project? It has more than two dozen personality tests that are all free to use with a creative commons license and posts large, open access datasets for their validation? Wow. What’s even stranger is that this site has no university affiliation, so far as I can tell and I can’t find any info on the site’s administrator. They’ve collected data from ~300,000 people here, these datasets are massive and an incredible resource, but are also virtually untouched by the academy. It’s downright criminal.

Case in point, the Open Sex Roles Inventory (OSRI). I started looking up data on Bem’s Sex Role inventory to help enrich the gender section of my intro psychology class, when I came across this measure. The open dataset had ~20,000 data points in it and I thought it’d be fun to play around with. I sent an email to the site administrator asking about how to score it and validation info. No email response, but the website was updated four days later with a TON of info and modifications to the scale to make a 20-item, more well-validated short form from the original 44 items. Seriously, who is this wonderful person? These data could easily be published in any personality journal they wanted, but they’re just free for all to use with no glory or publishing prestige. Searching online, I found a single publication using the OSRI, and it was a Ph.D. dissertation that used the data to test out a new method of measurement invariance, for which this particular data was mostly incidental (finding that the 44-item version of the OSRI didn’t have measurement invariance across heterosexual vs. non-heterosexual groups.

Anyway, back to the point. I wanted to have a visualization for my intro psychology class to clarify the distinction between sex (i.e., biological/physiological) and gender (i.e., socio-cultural). Specifically, I wanted to talk about Sandra Bem’s theory published way back in 1974 that gender roles are spectrum with at least two dimensions (masculinity vs. femininity). I’m aware of challenges to this distinction (e.g., that sex/gender are inseparable, and that sex is also a spectrum), but broadly won’t get into that here. In any event, the dataset I have only has sex measured as “Male, Female, Other,” so a categorical model will need to do for now. My goal was to show that there is a sizable minority of people beyond cis-gendered men and women. The dataset is pretty large (~16,000 after missing data was removed), so I decided to use geom_hex() with the viridis color scheme. This crushes my computer, and takes a long time to run but in the end, after a lot of obsessing over details, I am happy with the result. Special thanks to this post on Stack Overflow that helped me solve an graphical glitch that stumped me for a long time.

ggplot(mydata2, aes(x = FemSF, y = MascSF)) +
  geom_hex(aes(colour = ..count..), bins = 22) + 
  theme_classic() +
  facet_grid(. ~ gender) +
  scale_fill_viridis() +
  geom_hline(yintercept = mean(mydata$MascSF)) +
  geom_vline(xintercept = mean(mydata$FemSF)) +
  labs(y = "Masculinity", x = "Femininity" ) +
  geom_text(x=4.25, y=4.25, label="Androgynous", col = "darkslategray2") + 
  geom_text(x=2.0, y=1.75, label="Undifferentiated", col = "darkslategray2") +
  geom_text(x=2.0, y=4.25, label="Masculine", col = "darkslategray2") +
  geom_text(x=4.25, y=1.75, label="Feminine", col = "darkslategray2")
Hexagonal Scatterplot for the OSRI

I like the way this graph came out. It shows that indeed, you can predict self-identified male or female status with the items on this questionnaire. But, it also shows that there’s a lot of variation, with folks that are androgynous (i.e., high in both) or undifferentiated (i.e., low in both) in for males and females. The “Other” gender category looks more similar to the female category than the male category, but a little more in the center overall. When I did some t-tests comparing men vs. women, femininity had a cohen’s d of 1.14 and masculinity had a cohen’s d of 0.82. Those are huge effect sizes! Much larger than the majority of sex differences in the literature.

Out of general interest, I decided to do a few more quick plots to break separate out the heterosexual vs. the non-heterosexual participants (I binned homosexual, bixsexual, asexual and other into one category for comparison in this analysis). Nothing fancy this time around, just some simple density plots:

ggplot(mydata2, aes(x = Fem, fill = gender)) +
  geom_density(alpha = .4) +
  facet_grid(. ~ orientation.d) +
  scale_fill_manual(values = pal) +
  labs(x = "Femininity", fill ="")

ggplot(mydata2, aes(x = MascSF, fill = gender)) +
  geom_density(alpha = .4) +
  facet_grid(. ~ orientation.d) +
  scale_fill_manual(values = pal) +
  labs(x ="Masculinity", fill ="")
Density plots for femininity and masculinity, split by gender and sexual orientation

One thing that is pretty striking about these plots is that the femininity and masculinity scales predict identifying as a heterosexual man or woman REALLY well, with massive effect sizes. Folks identifying as “other” fell somewhere in between, though a little closer to the female profile. However, for the non-heterosexual folks it’s a different story. The sex difference is still there for femininity, but less pronounced, mostly accounted for by men being more feminine. For the masculinity scale though, there’s basically no discrimination at all! Men, women and non-binary folks can’t really be distinguished based on their masculinity scores. I suppose that’s consistent with the measurement invariance issue the this Ph.D. thesis picked up on. Not sure what it means in a theoretical sense, but it’s definitely interesting.

I could stop here, but I wanted to also do a quick factor analysis on the short form data, just to see if the factor structure looked good. Nothing fancy or in-depth, just an exploratory factor analysis extracting two factors, with principal axis factoring, and an oblique rotation. I also fussed around until I could make a nice plot of the factor loadings, using the ggrepel package.

#Factor Analysis

mydata4 <- select(mydata2, 
       Q4, Q26, Q28, Q30, Q34, Q36, Q38, Q40, Q42, Q44,
Q1, Q3, Q5, Q7, Q9, Q17, Q29, Q35, Q39, Q43)

corMatsf <- cor(mydata4)

solutionsf <- fa(r = corMatsf, nfactors = 2, rotate = "oblimin", fm = "pa")

#Made a .csv file manually by copying into excel (can't figure out better solution)

print(solution$loadings, digits=2, cutoff=0)
loadings.sf <- read.csv("efa.loadingsSF2.csv")

#The plot

ggplot(loadings.sf, aes(x = PA1, y = PA2, label = item, col = FemMasc)) +
  geom_label_repel() +
  geom_point(size = 3) +
  scale_y_continuous(limits = c(-.80, .80), breaks = seq(-.80, .80, .10)) +
  scale_x_continuous(limits = c(-.80, .80), breaks = seq(-.80, .80, .10)) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_color_manual(values = c("hotpink", "royalblue")) +
  labs(x = "Factor 1 (Masculinity)", y = "Factor 2 (Femininity)", col = "") +

Plot of standardized factor loadings for OSRI

That looks pretty excellent, items are clearly distinguished by their factors and the factors are mostly uncorrelated. So the measurement seems solid. No wonder, given the amount of data that went into validating it! Overall, very cool dataset and measure. Might even use this myself some day in a published study…

If you want to take the test yourself, take a look here:

Data and syntax available on my OSF site. Data were used as part of a CC BY-NC-SA 4.0 license, with attribution the mysterious owner of


Pathfinder Monster Database: AC vs Touch AC

Ok, one more visit to this pathfinder monster database before I’m on to a new dataset. This time, I wanted to take a look at the relationship between Armor Class (i.e., how hard a monster is to hit) and Challenge Rating (i.e., how tough the monster is, according to the game developers). There should be a pretty strong linear relationship between AC and CR.

However, the thing I’m really interested in is the relationship between Touch AC and CR. Because pathfinder is needlessly complicated, monsters have a separate AC for just touching them (or getting hit with certain kinds of attacks like rays or bullets). It’s been my experience that touch AC is always low, and doesn’t seem to change at all as the game progresses. Pragmatically speaking, this means that at a certain point touch attacks basically always hit. Let’s see if that’s the case. I think some run-of-the mill scatterplots will do here, but might as well make them fancy. Three packages needed: (a) ggplot2 for the basic plotting; (b) ggExtra to add density distributions to the plot; (c) cowplot, to stitch the two plots together at the end and (d) ggthemes, in case I want some custom color palettes.


Then, I can make the plots. Made the dots transparent with “alpha = 0.3” to deal with overplotting. Personally, I find it looks cleaner than geom_jitter(); it’s always low-key bothered me that the randomness added by geom_jitter() actually changes the positions of the raw data a little. Feels a little like lying with data to me, sometimes. With the ggmarginal command, I added in the density plots so we can see the overall distributions. Then at the end, I used cowplot to stitch them together side-by-side. Knowing this, I made sure both graphs had the same y-axis lengths to facilitate easier comparison, a key component of the small multiples approach.

p1 <- ggplot(mydata, aes(x = CR, y = AC)) +
  geom_point(size = 2, alpha = 0.3, color = "steelblue") +
  theme_classic() +
  geom_smooth(col = "royalblue", se = FALSE) +
  scale_y_continuous(breaks = seq(0, 70, 5)) +
  scale_x_continuous(breaks = seq(0, 40, 5)) 

p1m <- ggMarginal(p1, type = "density", fill = "steelblue")

p2 <- ggplot(mydata, aes(x = CR, y = AC_Touch)) +
  geom_point(size = 2, alpha = 0.3, color = "steelblue") +
  theme_classic() +
  geom_smooth(col = "royalblue", se = FALSE) +
  scale_y_continuous(limits = c(0, 70), breaks = seq(0, 70, 5)) +
  scale_x_continuous(breaks = seq(0, 40, 5)) +
  labs(y = "Touch AC")

p2m <- ggMarginal(p2, type = "density", fill = "steelblue")

totalplot2 <- cowplot::plot_grid(p1m, p2m, labels = "auto", ncol = 2)
base_aspect_ratio = 2)
Scatterplots of AC and CR

Ok, well there’s a clear (very strong) linear relationship between CR and AC (r = .91). That’s probably not surprising, given that the game designers probably wanted that to happen by design, to keep up with increases to attack bonuses that players get each level. However, touch AC is very weakly related to touch AC (r = .12). In fact, it’s really only mythic level monsters beyond CR 20 that see any increase in Touch AC at all! So for the vast majority of play, there’s no relationship between CR and Touch AC. I got to wondering then, what is the variance in touch AC actually attributed to? A good first guess would be a creature’s size:

ggplot(mydata, aes(x = CR, y = AC_Touch, col = Size)) +
  geom_point(size = 2, alpha = 0.3) +
  theme_classic() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_y_continuous(limits = c(0, 70), breaks = seq(0, 70, 5)) +
  scale_x_continuous(breaks = seq(0, 40, 5)) +
  labs(y = "Touch AC") +

Yup, looks like the majority of variation is due to that. So in summary, touch AC doesn’t really change as monster CR increases, and most of the variation is due to size. It’s a weird mechanic to add to the game, and easily abusable by additions like the gunslinger that always hit touch AC. Far as I can tell though, hasn’t been removed from the second version under playtest. Oh well, add it to the list of weird broken mechanics, I suppose.

Scatterplots of touch AC and CR, split by size

Syntax and data available on the blog’s OSF page.