Increasing statistical power in psychological research without increasing sample size

Note: I wrote this ages ago back in 2013, and it lives on the now defunct open science collaboration blog. Storing it here on my own website for archiving purposes. Original link:

What is statistical power?

As scientists, we strive to avoid errors. Type I errors are false positives: Finding a relationship when one probably doesn’t exist.  Lots has been said about these kinds of errors, and the field of psychology is sometimes accused of excessive Type I error rates through publication biases, p-hacking, and failing to account for multiple comparisons (Open Science Collaboration, in press).  Type II errors are false negatives: Failing to find a relationship when one probably does exist. Type II errors are related to statistical power. Statistical Power is the probability that the test will reject the null hypothesis when the null hypothesis is false. Many authors suggest a statistical power rate of at least .80. This corresponds to an 80% probability of not committing a Type II error. Accuracy in parameter estimation (APIE) is closely related to statistical power, and it refers to the width of the confidence interval for the effect size (Maxwell et al., 2008). The smaller this width, the more precise your results are. This means that low statistical power not only increases Type II errors, but also Type I errors because underpowered studies have wide confidence intervals. Simply put, underpowered studies are imprecise and are unlikely to replicate (Button et al., 2013).

Studies in psychology are grossly underpowered

Psychological research has been grossly underpowered for a long time. Fifty years ago, Cohen (1962) estimated the statistical power to detect a medium effect size in abnormal psychology was about .48. The situation has improved slightly, but it’s still a serious problem today. For instance, one review suggested only 52% of articles in the applied psychology literature achieved .80 power for a medium effect size (Mone et al., 1996). This is in part because psychologists are studying small effects. One massive review of 322 meta-analyses including 8 million participants suggested that the average effect size in social psychology is relatively small (r = .21). To put this into perspective, you’d need 175 participants to have .80 power for a simple correlation between two variables at this effect size. This gets even worse when we’re studying interaction effects. One review suggests that the average effect size for interaction effects is even smaller (f2 = .009), which means that sample sizes of around 875 people would be needed to achieve .80 power.

What can we do to increase power?

Traditional recommendations for increasing statistical power suggest either (a) Increasing sample size, (b) maximizing effect size or (c) using a more liberal p-value criteria. However, increasing the effect size has no impact on the width of the confidence interval (Maxwell et al., 2008), and using a more liberal p-value comes at the expense of increased Type I error. Thus, most people assume that increasing the sample size is the only consistent way to increase statistical power. This isn’t always feasible due to funding limitations, or because researchers are studying rare populations (e.g., people with autism spectrum disorder). Fortunately, there are other solutions. Below, I list three ways you can increase statistical power without increasing sample size. You might also check out Hansen and Collins (1994) for a lengthier discussion.

Recommendation 1: Decrease the mean square error

Decreasing the mean square error will have the same impact as increasing sample size (if you want to see the math, check out McClelland, 2000). Okay. You’ve probably heard the term “mean square error” before, but the definition might be kind of fuzzy. Basically, your model makes a prediction for what the outcome variable (Y) should be, given certain values of the predictor (X). Naturally, it’s not a perfect prediction because you have measurement error, and because there are other important variables you probably didn’t measure. The mean square error is the difference between what your model predicts, and what the true values of the data actually are.  So, anything that improves the quality of your measurement or accounts for potential confounding variables will reduce the mean square error, and thus improve statistical power. Let’s make this concrete. Here are three specific techniques you can use:

  1. Reduce measurement error by using more reliable measures (i.e., better internal consistency, test-retest reliability, inter-rater reliability, etc.). You’ve probably read that .70 is the “rule-of-thumb” for acceptable reliability. Okay, sure. That’s publishable. But consider this: Let’s say you want to test a correlation coefficient. Assuming both measures have a reliability of .70, your observed correlation will be about 1.43 times smaller than the true population parameter (I got this using Spearman’s correlation attenuation formula).  Because you have a smaller observed effect size, you end up with less statistical power. Why do this to yourself? Reduce measurement error.  
  • Control for confounding variables. With correlational research, this means including control variables that predict the outcome variable, but are relatively uncorrelated with other predictor variables. In experimental designs, this means taking great care to control for as many possible confounds as possible. In both cases, this reduces the mean square error and improves the overall predictive power of the model – and thus, improves statistical power.
  • Use repeated-measures designs. Repeated measures designs reduce the mean square error by partitioning out the variance due to subjects. Depending on the kind of analysis you do, it can also increase the degrees of freedom for the analysis substantially (e.g., some multi-level models). I’m a big fan of repeated measures designs, because they allow researchers to collect a lot of data from fewer participants.

Recommendation 2: Increase the variance of your predictor variable

Another less-known way to increase statistical power is to increase the variance of your predictor variables (X).

  1. In correlational research, use more comprehensive continuous measures. That is, there should be a large possible range of values endorsed by participants. However, the measure should also capture many different aspects of the construct of interest; artificially increasing the range of X by adding redundant items (i.e., simply re-phrasing existing items to ask the same question) will actually hurt the validity of the analysis. Also, avoid dichotomizing your measures (e.g., median splits), because this reduces the variance and typically reduces power.
  • In experimental research, unequally allocating participants to each condition can improve statistical power. For example, say you were designing an experiment with 3 conditions (let’s say low, medium, or high self-esteem). Most of us would equally assign participants to all three groups, right? Well, as it turns out, that reduces statistical power. The optimal design for a linear relationship would be 50% low, 50% high, and omit the medium condition. The optimal design for a quadratic relationship would be 25% low, 50% medium, and 25% high. The proportions vary widely depending on the design and the kind of relationship you expect, but I recommend you check out McClelland (1997) to get more information on efficient experimental designs. You might be surprised.

Recommendation 3: Make sure predictor variables are uncorrelated with each other

A final way to increase statistical power is to increase the proportion of the variance in X not shared with other variables in the model. When predictor variables are correlated with each other, this is known as colinearity. For example, depression and anxiety are positively correlated with each other; including both as simultaneous predictors (say, in multiple regression) means that statistical power will be reduced, especially if one of the two variables actually doesn’t predict the outcome variable. Lots of textbooks suggest that we should only be worried about this when colinearity is extremely high (e.g., correlations around > .70). However, studies have shown that even modest intercorrlations among predictor variables will reduce statistical power (Mason et al., 1991). Bottom line: If you can design a model where your predictor variables are relatively uncorrelated with each other, you can improve statistical power.


Increasing statistical power is one of the rare times where what is good for science, and what is good for your career actually coincides. It increases the accuracy and replicability of results, so it’s good for science. It also increases your likelihood of finding a statistically significant result (assuming the effect actually exists), making it more likely to get something published. You don’t need to torture your data with obsessive re-analysis until you get p < .05.  Instead, put more thought into research design in order to maximize statistical power. Everyone wins, and you can use that time you used to spend sweating over p-values to do something more productive. Like volunteering with the Open Science Collaboration.


Button, K. S., Ioannidis, J. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376. doi: 10.1038/nrn3475

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65, 145-153. doi:10.1037/h0045186

Hansen, W. B., & Collins, L. M. (1994). Seven ways to increase power without increasing N. In L. M. Collins & L. A. Seitz (Eds.), Advances in data analysis for prevention intervention research (NIDA Research Monograph 142, NIH Publication No. 94-3599, pp. 184–195). Rockville, MD: National Institutes of Health.

Mason, C. H., & Perreault, W. D. (1991). Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research, 28, 268-280. doi:10.2307/3172863

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537-563. doi:10.1146/annurev.psych.59.103006.093735

McClelland, G. H. (1997). Optimal design in psychological research. Psychological Methods, 2, 3-19. doi:10.1037/1082-989X.2.1.3

McClelland, G. H. (2000). Increasing statistical power without increasing sample size. American Psychologist, 55, 963-964. doi:10.1037/0003-066X.55.8.963

Mone, M. A., Mueller, G. C., & Mauland, W. (1996). The perceptions and usage of statistical power in applied psychology and management research. Personnel Psychology, 49, 103-120. doi:10.1111/j.1744-6570.1996.tb01793.x

Open Science Collaboration. (in press). The Reproducibility Project: A model of large-scale collaboration for empirical research on reproducibility. In V. Stodden, F. Leisch, & R. Peng (Eds.), Implementing Reproducible Computational Research (A Volume in The R Series).  New York, NY: Taylor & Francis. doi:10.2139/ssrn.2195999


Which Starburst Flavour is Tastiest?


Ever wonder which starburst flavor is the best? Now you can, with the help of SCIENCE. Naturally, pink is the best (personal communication, me). But don’t take my word for it, polls from Buzzfeed and both support pink as the best, followed by red, yellow, and orange, respectively. However, these polls rely on retrospective memory and could benefit from experiments where participants provide numeric ratings of each candy immediately after consuming them. Thus, the present experiment has the following research question: What is the rank order of tastiness for the original 4 starburst candy flavors?


Students from my third year research methods in social psychology course (N = 72) completed this experiment between 2016-2018 as part of a class exercise. Each student was given 4 starburst candies, one of each flavor (pink/strawberry, red/cherry, orange, and yellow/lemon). They were instructed to eat each candy sequentially in the same order: pink, red, yellow, and orange. Students rated each candy on a 7-point scale from 1 (Extremely untasty) to 7 (Extremely tasty). Thus, I used a within-subjects design with 4 conditions and no counterbalancing.


Data were analyzed using a Kruskall-Wallis test, setting alpha at .05. Pairwise tests were conducted using Wilcoxon rank-sum tests, correcting alpha using a Bonferroni correction for familywise error rates (alpha = .008). Results are displayed in the two figures below. There was a clear ranked preference with pink emerging as the tastiest, followed by red, orange and yellow. Pairwise tests found statistically significant differences between these pairs: (a) pink-orange, (b) pink-yellow, and (c) red-yellow. Open data and syntax can be found on my OSF page.

Figure 1. Faceted bar plots comparing starburst candy flavors
Figure 2. Box plots with pairwise comparisons using wilcoxon rank-sum tests. Numbers indicate p-values.


Overall, the results mostly mirror the results of the polls, except that the rank order of orange and yellow are switched. Pink is indeed the superior flavor of original starburst candy flavors. The lack of counterbalancing is a significant limitation, as people may have gotten sick of eating starburst candies by the last one. This seems unlikely, as starburst are delicious. In sum, science supports pink starburst as the tastiest flavor, and yellow as the least tasty flavor.


Video Game Genres by Year: 1980-2016

Video games have been a hobby since childhood so it’s fun to look back at the history of video games. I found a cool dataset that includes all video games with sales of 100,000 or more from 1980-2016. It’s necessarily incomplete since it was scraped from an existing website; this is especially true for the early days of gaming (e.g, it has only 98 Nintendo Entertainment System (NES) games out of the library of ~716). It also misses a lot of the download-only market for PC and has no mobile games. Nonetheless, it’s still an interesting snapshot of video game history. For my part, I’m interested in tracking the popularity of video game genres from 1980-2016. That is, the type of game (e.g., platformer, shooter, role-playing). Starting out, I’d presume that platformers (e.g., Super Mario) would have been most popular back in the NES and Super Nintendo / Sega Genesis era (the halcyon, nostalgia-ridden days of my youth). But let’s see what turns up!

First thing to note is that there is really not nearly as much data here prior to ~1995 (shortly after the Playstation 1 comes out). I ran a simple stacked bar chart first to look at this.

ggplot(newdata, aes(x = Year2, fill = Genre)) +
  geom_bar(position = "stack") +
  scale_fill_ptol() +
  scale_x_continuous(limits = c(1980,2016), breaks = seq(1980, 2016, 5)) +
  labs(x = "", y = "") +
  ggtitle("Video Games by Year and Genre: 1980-2016")

The problem with this chart is that I can’t make any sense of how popular different genres are because of the large differences in counts. Thus, it makes more sense to look at proportions. I decided to go with a stacked density plot this time, and I’m pleased with the result.

ggplot(newdata, aes(x = Year2, group = Genre, fill = Genre)) +
  geom_density(adjust=1.5, position="fill") +
  scale_fill_ptol() +
  scale_x_continuous(limits = c(1980,2016), breaks = seq(1980, 2016, 5)) +
  scale_y_continuous(limits = c(0,1), breaks = seq(0,1,.10)) +
  labs(x = "", y = "") +
  ggtitle("Video Games by Year and Genre: 1980-2016")

To interpret the results, it might help to know the broad eras of video game consoles:

  • (1976–1983): Atari 2600
  • (1983–1987): NES.
  • (1987–1993): SNES/Genesis.
  • (1993–1998): PS1/N64/Saturn.
  • (1998–2005): Dreamcast/PS2/Gamecube/Xbox.
  • (2005–2012): PS3/360/Wii.
  • (2012-Present): PS4/XB1/WiiU.

First, remember that these are only the games that sold more than 100,000 copies, so it’s more an index of popularity than actual games published. With this in mind, a few interesting trends emerge.

  • Action games are most popular in the Atari era (e.g., Frogger), and have picked up again substantially in the present-day era of console gaming.
  • Adventure games have slowly increased in popularity over time.
  • Fighting games hit their peak in the early 90s. Games like Mortal Kombat, Street Fighter, and King of Fighters are popular here.
  • As expected, platformers are huge during the NES and early SNES era. The popularity of Super Mario Bros. spawned a lot of clones.
  • The big spike and then decline of puzzle games is probably an artifact of the data. Puzzle games are probably even more popular now with mobile phones, but those games aren’t included in the dataset. I expect these games were much easier to program, back in the day, which accounts for the early spike (e.g., Tetris).
  • Racing games get popular in the 1995-2005 era. The popularity of game series like Mario Kart spring to mind here.
  • Roleplaying games don’t emerge as top sellers until the NES era, and stay pretty steady after that. I expected to see spike of roleplaying games in the PS1 era, considered by many to be the golden age of JRPGs. But probably doesn’t get picked up in this dataset because many of them are cult classics, and did not sell a lot in the US.
  • Strategy games don’t emerge as top sellers until the SNES era, but remain pretty steady after that.

Overall, very cool data and fun to look through. As usual, data and code available on the blog’s OSF site.


Cloud-based apps for statistics teaching

At the recent 2019 Canadian Psychological Association convention, I gave a workshop on Cloud-Based Statistics Apps that could potentially be useful for teaching. I’m posting the list of applications here in case anyone finds them useful for teaching. In addition, also posting my slides from the talk, where I describe my approach to using these apps when teaching undergrad statistics.

Central Limit Theorem (Means)

This application runs simulations to demonstrate the central limit theorem from different population distributions by sampling up to 1000 times from a population. It shows the population distribution, the first 10 sample distributions, and the sampling distribution. This apps helps enormously with teaching students the central limit theorem. It can sample from normal, uniform, right-skewed, and left-skewed population distributions.

Central Limit Theorem (Proportions)

This app is like the one above, except that it shows the sampling distribution for proportions.

Regression Diagnostics

This app simulates new data for each run and shows students how to review diagnostic plots from linear regression models. It allows students to simulate situations where the normality assumption is likely to be met as well as situations where the assumptions are violated (quadratic relationships, non-constant variance). Helpful for getting students to interpret plots of residuals.

Generic Distribution Calculator

Calculates area under the curve for various distributions, such as z, binomial, t, F and χ2. Useful when teaching students about standardizing distributions (e.g., converting to z-scores) and for conceptual understanding of p-values.

Apps by Bruce Dudek

Note that Bruce Dudek does not post his code for these apps online. However, he may be willing to share code with you if you send an email (I use some of these apps in my own classes!).

Distributions (z, t, F & chi-squared)

Bruce developed a set of distribution apps that improve upon the generic ShinyEd one above in some ways. These apps are primarily useful for looking up p-values, and can replace looking up p-values in conventional critical value tables from old-school statistics textbooks. If the emphasis is entirely on looking up p-values, these apps work better.

Descriptive Statistics and Visualization

This is a cool app that has a few datasets built in, and allows students to calculate descriptive statistics (e.g., means, SDs, range) as well as a few basic data visualizations (e.g., histograms, box plots). Can be useful as a way to introduce students to exploring data with graphs.

Confidence Intervals

This app simulates confidence intervals (up to 100 at a time). It is useful to help students understand the conceptual issues surrounding confidence intervals, and the impact of sample size and confidence limit (e.g., 90% vs 95%) changes the width of the intervals.

RShiny Apps from Other Sources

ANOVA Playground

This app has a LOT going on. It can allow the user to simulate data and run a one-way ANOVA, complete with beautiful data visualizations. It can also allow the user to read in data to run a one-way ANOVA on a pre-specified dataset. Finally, it also allows the user to run individual t-tests as a rudimentary form of post-hoc test. This app is very useful to help students understand and interpret one-way ANOVA.

Other Online Apps that Don’t Use RShiny

The applications below don’t actually use RShiny but are similar in the sense that they can be run in a web browser without having to download software. I’ve found them to be useful in teaching.

Guess the Correlation

A game that shows students a scatterplot and asks them to guess the magnitude. Very fun.


This allows you to add dots to a scatterplot with a point-and-click format, calculates the correlation coefficient and allows you to save the resulting dataset. Insanely useful for teaching.

Rock ‘n Poll

Explains the concept of sampling variability using political polling as an example. Very beautiful viz.

Rossman/Chance Java Applet Collection

This site has been around a long time and uses java and javascript. Many different simulations, some that duplicate those above, some new. They are not as “pretty” as some of the other options, but they are still very useful, and notably have some simulations on probability. Note: The older ones that use java don’t seem to run on all computers; the javascript ones (labeled “js”) seem to work fine.

Kristoffer Magnusson’s d3.js Visualizations

Kristoffer has made some truly excellent data visualizations worth checking out. The ones above cover null hypothesis significance testing, confidence intervals, effect size, t-distributions, and equivalence testing. I honestly can’t hype these enough, they are all top-notch.

Online cloud-based statistics software

Sometimes, you want your students to analyze some data in a similar fashion to how they would if they had conventional software installed (e.g., SPSS, SAS). All of these do all of the calculations online, and remove the need for students to install software which, I have discovered, is a very error-prone process for many students. Useful if you want students to do statistical calculations in a class where learning a particular software is not the goal.


Lollipop Plot: Meta-analysis of Gender Differences in Sexual Behaviour

When I went to the APS conference in San Francisco last year, I got to hear Janet Hyde talk about the gender similarities hypothesis. Broadly, she argues that most gender differences (i.e., men vs. women) in psychological variables tend to be small in size. She used meta-analysis — statistically summarizing the results of lots of published research — as a method of testing her hypothesis. I thought it was fascinating stuff and a great talk, so I wanted to incorporate some of her research into intro psychology. Since I’ve been sprucing up the intro psych section on sex and gender, I thought it would be interesting to include Petersen and Hyde’s (2010) meta-analysis on gender differences in sexual attitudes and behaviour. Broadly, it found that there are a few large sex differences (e.g., men watch more pornography and masturbate more than women), but consistent with her general argument, most differences were very small in size (i.e., smaller than d = .20) and men and women are more similar than different. However, the paper itself doesn’t present the results in a very PowerPoint friendly way, and certainly not intelligible to freshmen students:

Portion of Table 1 from Petersen & Hyde (2010)

So, with that in mind I decided to make a plot up to better visualize the data. The degree of uncertainty displayed by the 95% CI is important … but most in the class don’t actually have any statistics background. So I decided to teach them what “cohen’s d” was first, then create a graph that showed off those numbers. I also taught them that values less than .20 are essentially negligible, so I wanted to highlight that on the plot. I decided on a lollipop plot, which is a bit more appealing than a bar plot.


#Order the variables in Rank Order

mydata$name <- factor(mydata$name, levels = mydata$name[order(mydata$d)])

#Create the Plot

  ggplot(mydata, aes(x=name, y=d, label=d)) + 
    geom_point(stat='identity', fill="black", size=6)  +
    geom_segment(aes(y = 0, 
                     x = name, 
                     yend = d, 
                     xend = name), 
                 color = "black") +
    geom_text(color="white", size=2) +
    labs(title="Gender Differences in Sexual Behaviour", 
         subtitle = "Negative numbers = more common in women; positive numbers = more common in men",
         x = "", y = "cohen's d (effect size)") + 
    ylim(-1, 1) +
    theme_bw() +
    geom_hline(yintercept = 0, linetype = "solid", color = "grey") +
    geom_hline(yintercept = -.20, linetype = "dashed", color = "grey") +
    geom_hline(yintercept = .20, linetype = "dashed", color = "grey") +
Lollipop plot showing meta-analysis results

Overall, I think the plot turned out pretty clear and was a lot more appealing than just putting a table of numbers for a more dynamic presentation. The ordering from largest to smallest helps a lot to guide the eye, and the dotted grey lines show the range of “trivial” effect sizes. So there are some substantial gender differences, but the effect sizes are actually a LOT smaller than you’d expect! For context, a cohen’s d of ~.60 would look like this (using this site to generate this quickly):

So phrased another way, if you selected a random man and a random woman, there is ~67% chance that the man will watch more pornography than the woman (the “common language effect size“). That’s not insubstantial, but also probably a lot smaller than a lot of people would think! Many of the effect sizes are vanishingly small. For a Cohen’s d of 0.20, there’s only a 55% chance that a randomly selected man will be more likely to engage in the behavior than a randomly selected woman. For that matter, men and women are actually more similar than they are different on all of these variables, if you consider the percent overlap among the distributions. Which broadly, is what Dr. Hyde’s point has been all along.

Code and data available on my OSF page.


Open Sex Role Inventory

Did you know about the Open Source Psychometrics project? It has more than two dozen personality tests that are all free to use with a creative commons license and posts large, open access datasets for their validation? Wow. What’s even stranger is that this site has no university affiliation, so far as I can tell and I can’t find any info on the site’s administrator. They’ve collected data from ~300,000 people here, these datasets are massive and an incredible resource, but are also virtually untouched by the academy. It’s downright criminal.

Case in point, the Open Sex Roles Inventory (OSRI). I started looking up data on Bem’s Sex Role inventory to help enrich the gender section of my intro psychology class, when I came across this measure. The open dataset had ~20,000 data points in it and I thought it’d be fun to play around with. I sent an email to the site administrator asking about how to score it and validation info. No email response, but the website was updated four days later with a TON of info and modifications to the scale to make a 20-item, more well-validated short form from the original 44 items. Seriously, who is this wonderful person? These data could easily be published in any personality journal they wanted, but they’re just free for all to use with no glory or publishing prestige. Searching online, I found a single publication using the OSRI, and it was a Ph.D. dissertation that used the data to test out a new method of measurement invariance, for which this particular data was mostly incidental (finding that the 44-item version of the OSRI didn’t have measurement invariance across heterosexual vs. non-heterosexual groups.

Anyway, back to the point. I wanted to have a visualization for my intro psychology class to clarify the distinction between sex (i.e., biological/physiological) and gender (i.e., socio-cultural). Specifically, I wanted to talk about Sandra Bem’s theory published way back in 1974 that gender roles are spectrum with at least two dimensions (masculinity vs. femininity). I’m aware of challenges to this distinction (e.g., that sex/gender are inseparable, and that sex is also a spectrum), but broadly won’t get into that here. In any event, the dataset I have only has sex measured as “Male, Female, Other,” so a categorical model will need to do for now. My goal was to show that there is a sizable minority of people beyond cis-gendered men and women. The dataset is pretty large (~16,000 after missing data was removed), so I decided to use geom_hex() with the viridis color scheme. This crushes my computer, and takes a long time to run but in the end, after a lot of obsessing over details, I am happy with the result. Special thanks to this post on Stack Overflow that helped me solve an graphical glitch that stumped me for a long time.

ggplot(mydata2, aes(x = FemSF, y = MascSF)) +
  geom_hex(aes(colour = ..count..), bins = 22) + 
  theme_classic() +
  facet_grid(. ~ gender) +
  scale_fill_viridis() +
  geom_hline(yintercept = mean(mydata$MascSF)) +
  geom_vline(xintercept = mean(mydata$FemSF)) +
  labs(y = "Masculinity", x = "Femininity" ) +
  geom_text(x=4.25, y=4.25, label="Androgynous", col = "darkslategray2") + 
  geom_text(x=2.0, y=1.75, label="Undifferentiated", col = "darkslategray2") +
  geom_text(x=2.0, y=4.25, label="Masculine", col = "darkslategray2") +
  geom_text(x=4.25, y=1.75, label="Feminine", col = "darkslategray2")
Hexagonal Scatterplot for the OSRI

I like the way this graph came out. It shows that indeed, you can predict self-identified male or female status with the items on this questionnaire. But, it also shows that there’s a lot of variation, with folks that are androgynous (i.e., high in both) or undifferentiated (i.e., low in both) in for males and females. The “Other” gender category looks more similar to the female category than the male category, but a little more in the center overall. When I did some t-tests comparing men vs. women, femininity had a cohen’s d of 1.14 and masculinity had a cohen’s d of 0.82. Those are huge effect sizes! Much larger than the majority of sex differences in the literature.

Out of general interest, I decided to do a few more quick plots to break separate out the heterosexual vs. the non-heterosexual participants (I binned homosexual, bixsexual, asexual and other into one category for comparison in this analysis). Nothing fancy this time around, just some simple density plots:

ggplot(mydata2, aes(x = Fem, fill = gender)) +
  geom_density(alpha = .4) +
  facet_grid(. ~ orientation.d) +
  scale_fill_manual(values = pal) +
  labs(x = "Femininity", fill ="")

ggplot(mydata2, aes(x = MascSF, fill = gender)) +
  geom_density(alpha = .4) +
  facet_grid(. ~ orientation.d) +
  scale_fill_manual(values = pal) +
  labs(x ="Masculinity", fill ="")
Density plots for femininity and masculinity, split by gender and sexual orientation

One thing that is pretty striking about these plots is that the femininity and masculinity scales predict identifying as a heterosexual man or woman REALLY well, with massive effect sizes. Folks identifying as “other” fell somewhere in between, though a little closer to the female profile. However, for the non-heterosexual folks it’s a different story. The sex difference is still there for femininity, but less pronounced, mostly accounted for by men being more feminine. For the masculinity scale though, there’s basically no discrimination at all! Men, women and non-binary folks can’t really be distinguished based on their masculinity scores. I suppose that’s consistent with the measurement invariance issue the this Ph.D. thesis picked up on. Not sure what it means in a theoretical sense, but it’s definitely interesting.

I could stop here, but I wanted to also do a quick factor analysis on the short form data, just to see if the factor structure looked good. Nothing fancy or in-depth, just an exploratory factor analysis extracting two factors, with principal axis factoring, and an oblique rotation. I also fussed around until I could make a nice plot of the factor loadings, using the ggrepel package.

#Factor Analysis

mydata4 <- select(mydata2, 
       Q4, Q26, Q28, Q30, Q34, Q36, Q38, Q40, Q42, Q44,
Q1, Q3, Q5, Q7, Q9, Q17, Q29, Q35, Q39, Q43)

corMatsf <- cor(mydata4)

solutionsf <- fa(r = corMatsf, nfactors = 2, rotate = "oblimin", fm = "pa")

#Made a .csv file manually by copying into excel (can't figure out better solution)

print(solution$loadings, digits=2, cutoff=0)
loadings.sf <- read.csv("efa.loadingsSF2.csv")

#The plot

ggplot(loadings.sf, aes(x = PA1, y = PA2, label = item, col = FemMasc)) +
  geom_label_repel() +
  geom_point(size = 3) +
  scale_y_continuous(limits = c(-.80, .80), breaks = seq(-.80, .80, .10)) +
  scale_x_continuous(limits = c(-.80, .80), breaks = seq(-.80, .80, .10)) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_color_manual(values = c("hotpink", "royalblue")) +
  labs(x = "Factor 1 (Masculinity)", y = "Factor 2 (Femininity)", col = "") +

Plot of standardized factor loadings for OSRI

That looks pretty excellent, items are clearly distinguished by their factors and the factors are mostly uncorrelated. So the measurement seems solid. No wonder, given the amount of data that went into validating it! Overall, very cool dataset and measure. Might even use this myself some day in a published study…

If you want to take the test yourself, take a look here:

Data and syntax available on my OSF site. Data were used as part of a CC BY-NC-SA 4.0 license, with attribution the mysterious owner of


Pathfinder Monster Database: AC vs Touch AC

Ok, one more visit to this pathfinder monster database before I’m on to a new dataset. This time, I wanted to take a look at the relationship between Armor Class (i.e., how hard a monster is to hit) and Challenge Rating (i.e., how tough the monster is, according to the game developers). There should be a pretty strong linear relationship between AC and CR.

However, the thing I’m really interested in is the relationship between Touch AC and CR. Because pathfinder is needlessly complicated, monsters have a separate AC for just touching them (or getting hit with certain kinds of attacks like rays or bullets). It’s been my experience that touch AC is always low, and doesn’t seem to change at all as the game progresses. Pragmatically speaking, this means that at a certain point touch attacks basically always hit. Let’s see if that’s the case. I think some run-of-the mill scatterplots will do here, but might as well make them fancy. Three packages needed: (a) ggplot2 for the basic plotting; (b) ggExtra to add density distributions to the plot; (c) cowplot, to stitch the two plots together at the end and (d) ggthemes, in case I want some custom color palettes.


Then, I can make the plots. Made the dots transparent with “alpha = 0.3” to deal with overplotting. Personally, I find it looks cleaner than geom_jitter(); it’s always low-key bothered me that the randomness added by geom_jitter() actually changes the positions of the raw data a little. Feels a little like lying with data to me, sometimes. With the ggmarginal command, I added in the density plots so we can see the overall distributions. Then at the end, I used cowplot to stitch them together side-by-side. Knowing this, I made sure both graphs had the same y-axis lengths to facilitate easier comparison, a key component of the small multiples approach.

p1 <- ggplot(mydata, aes(x = CR, y = AC)) +
  geom_point(size = 2, alpha = 0.3, color = "steelblue") +
  theme_classic() +
  geom_smooth(col = "royalblue", se = FALSE) +
  scale_y_continuous(breaks = seq(0, 70, 5)) +
  scale_x_continuous(breaks = seq(0, 40, 5)) 

p1m <- ggMarginal(p1, type = "density", fill = "steelblue")

p2 <- ggplot(mydata, aes(x = CR, y = AC_Touch)) +
  geom_point(size = 2, alpha = 0.3, color = "steelblue") +
  theme_classic() +
  geom_smooth(col = "royalblue", se = FALSE) +
  scale_y_continuous(limits = c(0, 70), breaks = seq(0, 70, 5)) +
  scale_x_continuous(breaks = seq(0, 40, 5)) +
  labs(y = "Touch AC")

p2m <- ggMarginal(p2, type = "density", fill = "steelblue")

totalplot2 <- cowplot::plot_grid(p1m, p2m, labels = "auto", ncol = 2)
base_aspect_ratio = 2)
Scatterplots of AC and CR

Ok, well there’s a clear (very strong) linear relationship between CR and AC (r = .91). That’s probably not surprising, given that the game designers probably wanted that to happen by design, to keep up with increases to attack bonuses that players get each level. However, touch AC is very weakly related to touch AC (r = .12). In fact, it’s really only mythic level monsters beyond CR 20 that see any increase in Touch AC at all! So for the vast majority of play, there’s no relationship between CR and Touch AC. I got to wondering then, what is the variance in touch AC actually attributed to? A good first guess would be a creature’s size:

ggplot(mydata, aes(x = CR, y = AC_Touch, col = Size)) +
  geom_point(size = 2, alpha = 0.3) +
  theme_classic() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_y_continuous(limits = c(0, 70), breaks = seq(0, 70, 5)) +
  scale_x_continuous(breaks = seq(0, 40, 5)) +
  labs(y = "Touch AC") +

Yup, looks like the majority of variation is due to that. So in summary, touch AC doesn’t really change as monster CR increases, and most of the variation is due to size. It’s a weird mechanic to add to the game, and easily abusable by additions like the gunslinger that always hit touch AC. Far as I can tell though, hasn’t been removed from the second version under playtest. Oh well, add it to the list of weird broken mechanics, I suppose.

Scatterplots of touch AC and CR, split by size

Syntax and data available on the blog’s OSF page.


Pathfinder Monster Database Plots

I want to incorporate more R into my classes at Dalhousie. Problem is, I am a pretty bad R coder– I spent much of the past decade or so with SPSS and Mplus. But there’s lots of evidence that R is the future of science. I find that the best way to learn is project-based, so I’m going to start blogging on R code. I’m going to focus on topics that are inherently interesting to me, with a focus on data visualization. If I keep it fun, I’m more likely to stick with it.

So, to start I’m going to analyze data from the Pathfinder Monster Database, a comprehensive database of all 2812 monsters from Paizo’s tabletop roleplaying game, Pathfinder. I’ve played Pathfinder for years now and there are a lot of crunchy numbers in there. Probably why I like it so much! I’m going to look at the relationship between creature type two outcome variables (a) Armor Class (i.e., how hard the creature is to hit) and (b) Challenge rating (i.e., how tough the monster is overall). The goal is to see what creature type is “toughest” overall.

The data needed a little bit of cleaning (e.g., changing “Dragon” to “dragon” for some entries), but it was in good shape overall. I decided to try out ridge plots as the way to visualize the data, since I’ve never used them before. First thing to do is load the necessary libraries into R.


Next, since I want the two plots to be in order from highest to lowest values of AC/CR, I need to use the next bit of code which requires dplyr. This creates two new variables I can use to re-order the y-axis with later. I also created a color palette of 13 random colors, since there are 13 creature types and I didn’t like the default ggplot2 colors here.

<h1>Order variables by AC</h1>
avg <- mydata %>%
group_by(Type) %&gt;%
summarise(mean = mean(AC))

ACorder &lt;- avg$Type[order(avg$mean)]
<h1>Order variables by CR</h1>
avg2 <- mydata %>%
group_by(Type) %&gt;%
summarise(mean2 = mean(CR))

CRorder &lt;- avg2$Type[order(avg2$mean2)]
<h1>Create color palette</h1>
pal &lt;- rainbow(13)

Ok, now I can create the two plots using the geom_density_ridges() function. This needs the ggridges package to function, as base ggplot2 can’t do this.

ggplot(mydata, aes(x = CR, y = Type, fill = Type)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none") +
scale_y_discrete(limits = CRorder) +
scale_x_continuous(limits = c(0,30), breaks = seq(0, 30, 5)) +
scale_fill_manual(values = pal) +
labs (y = "", x = "Challenge Rating")

ggplot(mydata, aes(x = AC, y = Type, fill = Type)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none") +
scale_y_discrete(limits = CRorder) +
scale_x_continuous(limits = c(0,50), breaks = seq(0, 50, 5)) +
scale_fill_manual(values = pal) +
labs (y = "", x = "Armor Class")

So, the toughest monster types in Pathfinder are dragons, followed by outsiders. The weakest monster types are vermin and animals. The ranking of toughness by CR and AC are exactly the same, as it turns out. However, the distribution for oozes are way different than everything else: These creature types tend to be really easy to hit, but are still tough because of lots of other abilities and immunities. The positive skew in the distributions for CR are interesting, since it shows that there are generally a LOT more monsters under CR 10, which makes sense given that very few games get to such high levels.

I like ridge plots. They work a lot better than overlapping histograms when there are lots of groups and lots of cases. There was a bit of difficulty with numbers less than 1 for the CR plot (e.g., some CRs are 1/3). Without the “scale_x_continuous(limits = c(0,50)” function, the graph displayed values less than 0, which is outside the range of actual data. I believe that the graph is now bunching all the CRs that are less than 1 (~217 data points) as “0” on the graph above. Overall, a fun first attempt, and neat data to work with.

Datafile and syntax available on the blog’s OSF page.


Generalized Linear Models for Between Subjects Designs

There aren’t many good, easy-to-understand resources on Generalized Linear Models. This is a shame, because they are usually a substantial improvement over more conventional ANOVA analyses, because they can much better account for violations of the normality assumption. Check out some tutorial slides I created here:

They only cover between-subjects designs. Maybe some time I’ll also make one for generalized mixed models, which take the best of GLiM and multilevel models and combine them into one.


Simple Validity Statistics for Teachers

My primary area of interest (besides statistics) is personality psychology. If there’s one thing you’ll notice about personality psychologists, it’s that we’re kind of obsessed with questionnaire measurement – and usually rely on some pretty complicated statistics to really be satisfied that a questionnaire is suitable for our purposes. Really though, we’re usually interested in two things:

Are the questionnaires reliable? That is, does the questionnaire produce consistent results under similar conditions?

Are the measurements valid? That is, does the questionnaire actually measure what it’s supposed to measure?

So when I started teaching courses, I started thinking about how I might build assessments that were both reliable AND valid for my students. After all, some research suggests that teachers have a pretty poor track record on developing reliable and valid ways to grade student performance. Besides, many of the assessments I use (e.g., exams, essays) share a lot in common with questionnaires, so many of the same principles should apply. In this post, I’m going to focus on convergent and divergent validity. This will require some knowledge of the correlation coefficient.

Convergent validity means that there is a strong, positive correlation between two measures that ARE supposed to be correlated with each other. If this were a scientific study, you might correlate two questionnaires that are supposed to be related to each other (say, positive affect and life satisfaction). In the context of teaching, you might correlate two assessment tools that are supposedly measuring the same thing (e.g., quizzes and exams). In this case, a large correlation provides evidence for convergent validity. Practically speaking, correlations larger than r = .30 provide acceptable evidence, and correlations greater than r = .50 provide excellent evidence.

Divergent validity means there is a small, or non-existent correlation between two measures that are NOT supposed to be correlated with each other. In a teaching context, you might expect little correlation between exam and oral presentation grades, since they measure different things (e.g., critical thinking versus communication skills). Practically speaking, you would hope for correlations smaller than r = .30 to support divergent validity, with a non-significant correlation being the strongest support.

Below is a sample correlation matrix from a 3000-level course I’ve taught in the past (Research Methods in Clinical Psychology). In this class, students complete two essays and two exams.

Statistics for Teachers table1

N = 39, all correlations significant at p < .05

Assuming that these assessment tools are valid, I’d expect three things:

a) Grades on the two essays will be highly correlated with each other

b) Grades on the two exams will be highly correlated with each other

c) The inter-correlations between exams and essays will be large, but not as large as the correlations between assessments measured in the same way. This is because exams and essays tap overlapping – but still probably discrete – skillsets.

A brief review of the correlation matrix above supports all three contentions, and gives me a bit more confidence in the validity of my assessment tools. If these correlations were a lot lower (< .30) I’d need to investigate if it’s simply a different skillset being measured, or if my measurement was poor.

There are many more ways that teachers can incorporate statistics into their teaching practice, without needing to be a statistics expert, but this is an easy one that anybody can implement.