Categories
Uncategorized

My Open Access Psychology Datasets

Over the years, I have been curating a number of open-access datasets in the field of psychology. In the interest of more easily sharing these with other people, and to keep myself organized I am collating them all on this page with links and short descriptions, organized approximately in chronological order. The plan is to continually update this page as I produce more data.

Hopefully these will be of use to educators looking for datasets when teaching applied classes such as statistics or research methods. They might also be useful for students or other researchers looking for real datasets to for independent research projects using the data for secondary analysis.

If you re-use any of these datasets and find them useful, I’d love to hear from you. Feel free to comment below, or send me an email!

Do Men and Women Exhibit Different
Preferences for Mates? A Replication of
Eastwick and Finkel (2008)

Description: Data include 307 undergraduate participants recruited from the University of Maryland in the Fall 2012 and Spring 2013 semesters. The procedure first involved a speed-dating methodology where participants filled out questionnaires about their romantic interests and assessments of the other participants in the speed dating session. There were also follow-up surveys over the 30 days following the speed-dating session to follow up on participants who did end up dating each other. The design is complicated, so I recommend reading Selterman et al. (2015) if you want more details. The data are comparatively messy, with a lot of attrition and potentially frivolous responses (e.g., some people participated for fun but were not necessarily serious about dating anyone) but still has lots of potential to assess questions related to romantic partner selection in speed dating contexts. It could also be useful in teaching multilevel modelling and complex data structures.

Measures: Predictor variables include gender, assessments of how important physical attractiveness, earning prospects, and good personality are for selecting a date. Outcomes include a wide array of relationship variables, such as “yessing” (i.e., being willing to go on a date), sexual desire, romantic interest etc.

Link to Data: https://osf.io/ng6cc/ (look for the file “speed.dating.2.zip”)

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349. https://doi.org/10.1126/science.aac4716

Selterman, D. F., Chagnon, E., & Mackinnon, S. P. (2015). Do men and women exhibit different preferences for mates? A replication of Eastwick & Finkel (2008). Sage OPEN, 5, 1-14. https://doi.org/10.1177/2158244015605160

The effect of question structure on self-reported drinking: Ascending versus descending order effects

Description: Includes a single set of self-report survey data from a sample of 791 Canadian undergraduates from Dalhousie University. Includes an experimental manipulation where participants were randomly assigned to ascending (lowest-to-highest) or descending (highest-to-lowest) response-order conditions. These data can be used to teach analysis of simple two-condition experimental designs. Moreover, since the experimental manipulation had only a very small impact on participant responses, these data could also be used to examine the correlational relationships between the variables.

Measures: Aside from the experimental manipulation, questionnaires assessed demographics (age, sex, ethnicity), a short-form alcohol use disorders screener (AUDIT-C), measures of drinking motivations (coping, enhancement, social, and conformity), and measures of personality (impulsivity, sensation seeking, anxiety sensitivity, hopelessness). Except for the AUDIT-C, all other measures used a 4pt Likert response format.

Link to Data: https://osf.io/9nq3c/

Mackinnon, S. P. & Firth, S. M. (2018). The effect of question structure on self-reported drinking: Ascending versus descending order effects. Journal of Research in Personality, 73, 21-26. https://doi.org/10.1016/j.jrp.2017.10.004

Perfectionism, Negative Motives for Drinking, and Alcohol-Related Problems: A 21-day Diary Study

Description: Two datasets include self-report data from (a) a 21-day daily diary study (N = 263) and (b) a cross-sectional psychometric study (N = 139) of emerging adult drinkers. Data were collected from two Canadian cities, and represent unique, non-overlapping participants in both datasets. These data can be used to examine research questions related to personality, emotions, and alcohol consumption, including changes from day-to-day in many variables. For more details on the dataset, see the Mackinnon et al. (2021) data paper below.

Measures: Questionnaires assessed perfectionism, reinforcement sensitivity, big five personality traits, alcohol consumption/problems, binge eating, social support, positive and negative affect, social anxiety, and drinking motives.

Link to Data: https://osf.io/gduy4/

Merwin, K. E., Mackinnon, S. P., O’Connor, R. M., & Flett, G. L. (2022). Socially prescribed perfectionism predicts next-day binge eating behaviors over 20-days. Journal of Counseling Psychology, 69, 554-564. https://doi.org/10.1037/cou0000600

Mackinnon, S. P., Curtis, R. & O’Connor, R. (2022). A tutorial in longitudinal measurement invariance and cross-lagged panel models using lavaan. Metapsychology, 6, 1-20. https://doi.org/10.15626/MP.2020.2595

Mackinnon, S. P., Ray, C. M., Firth, S. M., & O’Connor, R. M. (2021). Data from “Perfectionism, Negative Motives for Drinking, and Alcohol-Related Problems: A 21-day Diary Study”. Journal of Open Psychology Data, 9: 1, pp.   1–6. doi: https://doi.org/10.5334/jopd.44

Kehayes, I.-L. L., & Mackinnon, S. P. (2019). Investigating the Relationship Between Perfectionistic Self-Presentation and Social Anxiety Using Daily Diary Methods: A Replication. Collabra: Psychology, 5(1), 33. http://doi.org/10.1525/collabra.257

Mackinnon, S. P., Ray, C. M., Firth, S. M., & O’Connor, R. M. (2019). Perfectionism and negative motives for drinking: A 21-day diary study. Journal of Research in Personality, 78, 177-178. https://doi.org/10.1016/j.jrp.2018.12.003

Effects of personal force and physical contact on moral decision-making: A replication

Description: Trent was one of my undergraduate honours students back in 2019-202, and we contributed data the Psychological Accelerator project on moral dilemmas that was published in 2022 as part of his honours thesis. However, it quickly became apparent that we would not get access to the raw data in time for Trent to graduate, so we collected data concurrently on our own little replication study on the same topic. Trent eventually moved away to Lakehead university, and this languished in the file drawer until 2024 when I decided to just archive it on PsyArXiv without seeking formal publication.

Measures/Design: This study includes data from 268 undergraduate students in a 3×2 between-within design. Participants read moral dilemma vignettes and were asked to rate their moral acceptability on a yes/no and 1-9 scale. Participants also provided qualitative responses explaining their reasoning and rated the realism and clarity of the vignettes. There were three between-subjects conditions: (a) physical contact and personal force; (b) personal force only; and (c) no physical contact or personal force. Participants read two vignettes, each based in a different contexts (footbridge vs. speedboat). More details on the experimental manipulation can be found on the OSF page and paper below. There are also a few other measured individual difference variables including questions on income, religion, individualism-collectivism, and the Oxford Utilitarianism scale.

Link to Data: https://osf.io/47msh/

Lynds, T. M., & Mackinnon, S. P. (2024, August 11). Effects of personal force and physical contact on moral decision-making: A replication. Retrieved from osf.io/preprints/psyarxiv/dca65

Response-Order Effects for Self-report Questionnaires: Exploring the role of Overclaiming Accuracy and Bias

Description: Self-report survey data with a simple 2-group between-subjects experimental design from a sample of 744 Canadian undergraduates from Dalhousie University. The experimental manipulation had participants randomly assigned to ascending (lowest-to-highest) or descending (highest-to-lowest) response-order conditions. The experimental manipulation did not have any substantial impact on survey responses, so these data could be used to examine correlational relationships. The accuracy & bias measures might also be useful when teaching signal detection theory (e.g., false alarms vs. hits).

Measures/Design: Questionnaires that were presented in one of the two (ascending vs. descending) orders included stress, depression, anxiety, satisfaction with life, positive and negative affect and multidimensional perfectionism. All of these questionnaires were measured using a 5pt scale from 1 (strongly disagree) to 5 (strongly agree). In addition, we assessed demographics (age, sex, ethnicity) and administered the Overclaiming Questionnaire which asks participants to rate familiarity with real (e.g., asteroid) and non-real (e.g., plates of parallax) items. Ratings of accuracy and bias using signal detection theory were derived from this questionnaire.

Link to Data: https://osf.io/aec25/

Mackinnon, S. P. & Wang, M. (2020). Response-order effects for self-report questionnaires: Exploring the role of overclaiming accuracy and bias. Journal of Articles in Support of the Null Hypothesis, 16, 113-125. https://www.jasnh.com/pdf/Vol16-No2-article4.pdf

Remedial Math Refresher for Psychology Students in a (Social) Statistics Course

Description: Not all research studies work out, and this one was never published. Heather Hobson was an interdisciplinary PhD student who was not able to finish her degree due to family health issues. The core idea was that some students had difficulty in introductory statistics courses because of deficits in basic mathematics skills. A set of tutorial videos was created on topics such as order of operations, absolute values, exponents, radicals, comparative values of decimals, and the addition, subtraction, multiplication and division of decimals. We thought that these tutorial videos would help improve performance and reduce anxiety in statistics classrooms. It was a two time point pre-post design within a single statistics classroom, where all participants could voluntarily access the videos and questionnaires. Due to a variety of factors, participation rates were low; out of a possible 249 students, only 19 students completed the video tutorials so we were unable to assess whether or not they were effective. A short summary of results was posted on the OSF page and data were presented at one conference.

Measures: A somewhat larger number of students completed the statistics anxiety questionnaire (n = 91) or math ability quiz (n = 84) during at least one of the measurement occasions, so the data does have some capacity for secondary use. Measured variables include age, gender, university major (psychology vs. neuroscience), where they graduated high school, self-reported GPA, the Statistics Anxiety Rating Scale, grade on a 30-item math quiz, and the final grade in the statistics class (PSYO 2501, Statistical Methods I) that all students were taking.

Link to Data: https://osf.io/67v58/

Mackinnon, S. P., McCaughey, N.J. Hobson, H., & Hill, T. G. (2022). The performance gap in statistics education for Arts vs. Science students. Talk at the 83rd Annual National Convention of the Canadian Psychological Association (CPA), Calgary, Alberta. 

The Association of Self-Efficacy, Anxiety Sensitivity, and Self-Critical Perfectionism with Statistics and Math Anxiety

Description: Data comes from a cross-sectional survey of 447 graduate and undergraduate students. Approximately 30% of the sample was graduate students collected from Prolific, with the remainder being undergraduates from the participant pool at Dalhousie. The aim was to validate a short form measure of statistics anxiety and serve as a pilot study for some the studies to follow. It could be useful to teach correlational studies, factor analysis, or one-way ANOVA (e.g., comparing arts, science & other).

Measures: A variety of questionnaire measures were assessed, including self-efficacy, perfectionism cognitions, perfectionistic self-presentation, trait perfectionism, anxiety sensitivity, statistics anxiety, and math anxiety. Demographic variables included age, ethnicity, university major, university program (graduate vs. undergraduate), university faculty (arts, science, other), and whether or not they were currently or previously taking a statistics course.

Link to Data: https://osf.io/nzhq6/

McCaughey, N., Hill, T. G., & Mackinnon, S. P. (2022). The association of self-efficacy, anxiety sensitivity, and perfectionism with statistics and math anxiety. Personality Science, 3, 1-13. https://doi.org/10.5964/ps.7091

Perfectionism and Statistics Anxiety: Experimental Data

Description: An experiment with a design similar to a randomized controlled trial design (i.e., between-subjects randomization, pre-post measurement of outcomes). Participants were randomized to one of two conditions: (a) an easy statistics quiz or (b) a difficult statistics quiz, with feedback on performance. Study 1 aided in the creation of the the experimental manipulation by testing the difficulty of statistics multiple choice questions on two samples of 100 students. Study 2 used the experimental manipulation to test a vulnerability-stress hypothesis: Is the negative emotional impact of failing a difficult statistics quiz moderated by key personality traits (i.e., perfectionism, anxiety sensitivity)? It would be useful as an example of teaching moderation.

Study 1 Measures: Study 1 primarily measured responses to 30 multiple choice statistics knowledge questions in two samples. There were also demographic measures including sex/gender, language, country/nationality, employment status, student status, university major, whether they took a prior statistics class and how many, and a subjective self-report of how high they think their grades are relative to other students.

Study 2 Measures: Aside from the between-subjects experimental manipulation (easy vs. hard), at pre-test only, we measured demographics, statistics background, anxiety sensitivity, perfectionism, fear of failure, academic self-efficacy, and grade meta-cognitions. At pre- and post-test (i.e., before and after the experimental manipulation) we measured perfectionistic cognitions, perfectionistic self-presentation, statistics anxiety, state self-esteem and state affect.

Link to Data: https://osf.io/dxspb/

Workye, R., Shephard, A., Alexander, S., Cribbie, R. A., Flett, G. L., & Mackinnon, S. P. (in press). Perfectionism, Anxiety Sensitivity, and Negative Reactions Following a Failed Statistics Test: A Vulnerability-Stress Model. Scholarship of Teaching and Learning in Psychology. https://doi.org/10.1037/stl0000363

Risk and Resilience Factors in Juvenile Idiopathic Arthritis

Description: Online survey data was collected from youth with Juvenile Idiopathic Arthritis between 13-18 years old and a caregiver (N = 104 dyads had both members complete all measures). Missing data is complex, and up to 156 unique dyads participated, depending on how partial data is counted. This project was led by my co-supervised graduate student Yvonne Brandelli, and the goal was to look at risk and resilience factors for youth suffering from Juvenile Idiopathic Arthritis. There is lots of opportunity to look at correlational relationships, but also dyadic questions (e.g., comparing proxy reports to self-reports).

Measures: The youth and caregivers completed a very wide array of measures including basic demographics, type of diagnosis, pain, functional disability, internalizing symptoms, pediatric quality of life, perfectionistic personality, pain catastrophic, fear of pain, chronic pain acceptance, coping strategies for pain, self efficacy, optimism, resilience, pain-related beliefs, stigma and the family environment scale. This includes a mixture of self-report measures and parent proxy measures.

Link to Data: https://osf.io/svn8d/

Brandelli, Y. N., Mackinnon, S. P., Chambers, C. T., Parker, J. A., Huber, A. M., Stinson, J. N., Johnson, S. A., & Wilson, J. P. (2024). Exploring pain adaption in youth with juvenile idiopathic arthritis: Identifying youth and parent resilience resources and mechanisms. Arthritis Care and Research. https://doi.org/10.1002/acr.25439

Brandelli, Y. N., Mackinnon, S. P., Chambers, C. T., Parker, J. A., Huber, A. M., Stinson, J. N., Johnson, S. A., & Wilson, J. P. (accepted). Understanding perfectionism in youth with juvenile idiopathic arthritis and their caregivers. Journal of Pediatric Psychology.

Two-wave Longitudinal Study of Perfectionism, Statistics Anxiety, and Well-Being

Description: This study is a two-wave longitudinal study of university students currently taking a statistics class with N = 346 participants at Time 1 and N = 244 participants at Time 2 (four months later). The primary research questions were: (a) Do poor grades in a statistics class lead to increased statistics anxiety over time and (b) does perfectionism moderate this relationship? It was meant as a companion study to earlier my experimental work (Workye et al., 2023) above. My graduate student at the time (Taylor Hill) also added a few additional measures on well-being and flow as part of her dissertation work. Overall, any secondary use will need rigorous handling of missing data.

Measures: A number of self-reported online questionnaires were measured, including: Demographics, experience in statistics courses, anxiety sensitivity, measures of perfectionism, grade meta-cognition, academic self-efficacy, fear of failure, statistics anxiety, generalized anxiety, achievement striving, perceived competence, flow, well-being and self-reported statistics grades.

Link: https://osf.io/qg59u/

Hill, T. G., Loock, J. V., & Mackinnon, S. P. (2024). Focused, Flourishing, but Not in Flow: Achievement Strivers’ Experiences of Competence, Flow, and Well-Being During Personally Expressive Activities. International Journal of Applied Positive Psychology, 9(3), 1655-1684.

Mackinnon, S. P., Alexander, S. M., Chen, R., Cribbie, R. A., Flett, G. L., & Hill, T. H. (under review). Perfectionism, anxiety sensitivity, and statistics anxiety: A test of the vulnerability-Stress model using a 2-wave longitudinal study. Submitted to Journal of Research in Personality.

Bridging the Gaps: Comparing Structural Equation Models to Network Analysis Models of Depression, Anxiety, and Perfectionism

Description: A cross-sectional online self-report survey with a reasonably large sample size (N = 737) after redacting participants who did not consent to share data open-access (n = 22). This was a for a preregistered study broadly interested using network analysis to examine the relationship between perfectionism, depression, and anxiety. This could be a useful example for teaching factor analysis or other multivariable correlational analyses.

Measures: Measures include demographics, the Depression Anxiety Stress Scale-21, the Revised Frost Multidimensional Perfectionism Scale (short form), the Anxiety Sensitivity Index-3, a 7-item measure of Generalized Anxiety, the Patient Health Questionnaire-9, and the Big Three Perfectionism Scale – Short Form.

Link: https://osf.io/sfwxn/

Kim, A. & Mackinnon, S. P. (2024). Bridging the gaps: Comparing structural equation models to network analysis models of depression, anxiety, and perfectionism. International Journal of Personality Psychology, 10, 77-88.
https://doi.org/10.21827/ijpp.10.41687



Categories
Uncategorized

Reflexivity in Quantitative Research

The Problem of Forking Paths

There is a lot of time to think while at home on sabbatical, and since my role is so often teaching or conducting statistical analyses, I think about statistics pretty often. The problem that has been vexing me lately is related to what Gelman and Loken call the “Garden of Forking Paths.” Simply put, most research hypotheses in the social sciences have a one-to-many relationship with their associated statistical hypotheses. In fact, there are so many different statistical hypotheses that could represent evidence (or absence of evidence) for a given research hypothesis, it is positively breathtaking. For instance, let’s start with a null hypothesis test for a two-group between-subjects comparison of a numerical outcome — arguably one of the most basic kinds of comparisons taught in virtually every introductory statistics course. Just off the top of my head, you could use an student t-test, an unequal-variances t-test, a wilcoxon rank-sum test, a permutation test, a Bayseian t-test, structured means modelling, or a MIMIC model. On top of that, there are decisions to be made about outliers, random responders, covariates, and a dizzying array of possible ways to measure almost any latent construct (e.g., Fried, 2017 summarizes 7 different instruments to measure depression, encompassing 52 different symptoms!). There are literally thousands of plausible ways to formalize any given hypothesis in statistical terms — really, a nearly infinite possible hypothesis space if you break things down into smaller variations.

This is a problem for a post-positivist approach to science. Though these principles go mostly unstated, Bisel & Adame (2017) succinctly summarize the ontology, epistemology and axiology of this worldview:

Post-positivistic research assumes that social reality is out there and has enough stability and patterning to be known. Social reality is conceived as coherent, whole, and singular. […] Post-positivistic research assumes that social reality is measurable and knowable, albeit difficult to access. […] Post-positivistic research assumes that knowledge about social reality is inherently worthwhile to acquire and should be as value neutral as possible in its characterization of that reality.

Bisel & Adame (2017) pg. 1

Most researchers doing quantitative work (myself included) ascribe to some variation of this belief structure either implicitly or explicitly. But the garden of forking paths is a serious problem for these beliefs. First, there’s a problem of ontology: Of the near-infinite possible options a researcher is faced with to formalize a verbal hypothesis in statistical terms, only a small subset of near-equivalent options can be “true” if reality is singular. Then there’s the problem of value neutrality; many analytic choices are not value neutral, and might be made consciously (or unconsciously) to support the analyst’s preexisting beliefs. For instance, when handling the familywise error rate problem, a researcher may a-priori choose to use a procedure that is known to be conservative (e.g., a Bonferroni correction) when they believe that the null hypothesis is actually true. This would bias a study against the alternative hypothesis even if the study were preregistered.

A lot of the scientific reform movement criticizes “p-hacking” where a person essentially exploits these degrees of freedom. That is, some analysts continue to explore the nearly infinite hypothesis space of possibilities for analysis until they find something that supports their argument, then report that as evidence while failing to report all the attempts that did not support their argument. Given publication bias, this very consistently results in an inflation of effect sizes and false positives. It’s been discussed extensively elsewhere, so no need to describe this further other than to reiterate that exploiting this is obviously a terrible way to do science.

Preregistration of hypotheses and a data analysis plan is a useful way to curb these sorts of abuses. With preregistration, the scientist limits the hypothesis space by pre-planning the analysis and design and committing to an approach in advance without exploration. This does a great job at reducing researcher bias and the exploitation problem noted above, and is worth implementing for that alone. Indeed, transparency is widely considered to be a virtue in statistical analysis. But the problem remains that the sheer number of arbitrary choices that need to be made before preregistering means that there is no way to know if those choices were the right choices. That is, if reality is indeed singular, and there are tens of thousands of ways to translate a verbal hypothesis into a statistical format, then the specific subset of statistical formulations that a researcher chooses in any given study (preregistered or not) is in all likelihood, not actually the best summary of reality. That is, preregistration curbs biasing of effects due to motivated reasoning, but does not solve the forking paths problem. This point has been discussed previously by Rubin (2017) and in a more mathematically sophisticated way by Devezer et al. (2021).

Another option that is can be combined with preregistration or used in isolation is multiverse analysis. Here, the researcher chooses 100s or even 1000s of possible analyses within the hypothesis space instead of just a handful of analyses, as would be more typical. Then, the researcher graphically presents all of those analyses, sometimes even producing a sort of average across many different analysis approaches. In this way, the researcher is able to see how much the results depend on idiosyncratic data analysis choices, and which choices have a big impact. There’s much to recommend this technique, though the obvious trepidation for most analysts is that is an exhausting, incredible amount of extra work. The other obvious problem for a post-positivist is that we still don’t know which of these analyses to trust most! It’s better insomuch as the hypothesis space becomes more visible, but given the sheer size of the hypothesis space (or “multiverse”) for any given question it’s still quite possible that the researcher has missed the best summary of reality. Moreover, if the analysis does vary substantially across analysis approaches, it may be difficult to know which approach (if any) is true. You’re only really in a confident place if you try 1000s of analyses, and they all produce results that are essentially equivalent. However, the assumptions of post-positivism (i.e., a singular reality) would mean that this will rarely ever be the case!

The inevitable conclusion then is that we can never be entirely certain which statistical approach is best. Statisticians are usually good at identifying and ruling out bad, untenable approaches (e.g., median splits, responder analyses; Andrew Althouse has an excellent list) but identifying the “best” approach will always be controversial and based on a set of values, beliefs, and biases of the researcher. There is no truly bias-free research, even though we can strive to reduce bias in various ways. Note too that I don’t necessarily mean political bias — analysts vary in attitudes towards various issues such as Type I vs. Type II error importance, whether robustness is more important than clear effect sizes, etc.

Reflexivity in Qualitative Research

This general problem has been known to qualitative researchers for quite some time, and their pragmatic solution is increased transparency through thorough analysis and reporting of one’s own biases. The epistemological differences of qualitative researchers have very frequently meant that statisticians and qualitative analysts don’t see eye-to-eye on a lot of topics, because their fundamental view of reality and how knowledge is created can be very different. Briefly, many qualitative researchers reject key components of post-positivism. Instead of a singular reality, they maintain that there are numerous, equally valid ways to understand the world. They also question whether any knowledge can be “value neutral” because of the inherent limitations of our own perspectives.

Thus, a common recommendation in qualitative research is to include a reflexivity statement. A good basic definition of reflexivity is given by Haynes (2012):

In simple terms, reflexivity is an awareness of the researcher’s role in the practice of research and the way this is influenced by the object of the research, enabling the researcher to acknowledge the way in which he or she affects both the research processes and outcomes. […] In other words, researcher reflexivity involves thinking about how our thinking came to be, how pre-existing understanding is constantly revised in the light of new understandings, and how this in turn affects our research. (Haynes, 2012).

I believe that reflexivity statements could be productively applied in quantitative research to improve transparency and potentially reduce bias by providing necessary context for the researcher’s analytic and design choices. There is some precedent for this, as Gelman and Henig (2015) describe “Honest acknowledgement of the researcher’s position, goals, experiences, and subjective point of view (pg. 978)” as a virtue along more conventional virtues such as transparency and impartiality.

As a general rule, reflexivity statements are virtually never used in quantitative research, even though they are a requirement for publishing qualitative research in many venues. As an illustrative example (and a challenge for myself) I am going to try and write a “reflexivity statement” for one of my prior quantitative papers:

Mackinnon, S. P., Ray, C. M., Firth, S. M., & O’Connor, R. M. (2019). Perfectionism and negative motives for drinking: A 21-day diary study. Journal of Research in Personality, 78, 177-178. https://doi.org/10.1016/j.jrp.2018.12.003

Reflexivity Statement for Mackinnon et al. (2019)

I wrote the grant application over the second half of 2015 and finalized the design and ethics application in August 2016. When I wrote the grant, I was a limited-term employee at Dalhousie with an eye to hopefully getting a tenure track position. While waiting for the grant results, I unexpectedly transitioned into a permanent instructor position at Dalhousie — formally, this position has increased teaching and no research requirement. When I won the grant, I decided to take on the grant project (with my department’s blessing) despite having no institutional requirement to conduct research.

My background in perfectionism was from studying under Simon Sherry, who was in turn influenced heavily by Gord Flett and Paul Hewitt. Succinctly put, this comes with a set of inherited biases that (a) perfectionism is primarily bad for mental health and (b) that perfectionism is, to a great extent, a negative interpersonal phenomenon. This affected my choice to focus on nondisplay of imperfection rather than other facets of perfectionism that some argue are more positive (e.g., high standards).

My background in alcohol research came from a postdoctoral fellowship with Sherry Stewart. Alongside the long history of behaviorist research at Dalhousie University where I received my Ph.D., I tend to think about psychological phenomenon of various sorts in operant conditioning terms. That is, I conceptualize drinking primarily as a function of positive and negative reinforcement. Thus, I focused primarily on negatively reinforcing motives (i.e., drinking to cope and conformity) in this paper and did not consider alternatives to this framework. Indeed, I am so inured to this perspective, I find it difficult to think of other explanatory frameworks that might be equally useful.

I am also in the position of being both the principal investigator and the primary statistical analyst on this paper. As a statistically-minded person, my bias in interpretation is often to think about statistical (rather than conceptual) limitations to research. As an analyst, I have a broad overarching bias towards more parsimonious models. That is, I value models with fewer parameters and add model complexity only when forced. This bias shows in a few places, such as using item parcels instead of items for some measures, deliberately choosing some 3-item scales in the design process, and using fixed slopes so that the model would be as parsimonious as possible. Perhaps most notably, two analyses in the paper were reviewer requests and both involved increased model complexity: Adding alcohol quantity to the model and (b) running a supplementary random slopes model. My own resistance to these components in the text is likely due to my bias for model parsimony.

Concluding Thoughts

That is a lot of text, so I could see a statement like that being an online supplement rather than in the body of the text. But I think that the process, when combined with reading the paper, would help readers understand my own biases and choices so they could gauge for themselves whether that influenced the final results. That is, my own biases shaped which forking paths I chose both in the design and analysis of my own study, and knowing those biases might help readers identify my blind spots. Though reflexivity statements are not a cure-all for psychological science, I think that they are an under-utilized tool that could be productively applied to increase transparency in quantitative studies.

Categories
Uncategorized

Are Inferential Statistics Really What Psychology Needs?

In statistical analysis, we are generally publishing inferential statistics in psychology. That is, we often make estimates for population parameters and test null hypotheses. I have been thinking about the intersection of equity and statistics for a while now, ever since reading the history of statistics.

Tracing back to Galton, Fisher and Pearson it quickly becomes apparent that these statistical methods were designed specifically to identify group differences. That is, it begins with the assumption that different populations have different population distributions, and statistical methods were devised to set about proving those points without needing to measure data from an entire population. This can be generalized to correlations, where we are now assuming the relationship between X and Y is a single “true” value in the population that can be approximated within certain bounds of uncertainty. It was certainly revolutionary and changed the scope of what psychologists studied – we moved from intensive studies of individuals (e.g., Wundt, Ebbinghaus), to using statistical methods to estimate population parameters.

But it has always gnawed at the back of my mind that these might be the wrong questions, driven by the goal of ranking and stratifying the human population. Properly conducted, inferential statistics help us answer certain questions about populations, but my abiding interest in humans has always been about individual people, not population parameters.

I’ve begun to wonder if the field’s almost obsessive focus on population parameters – and weak tests of differences between them with null hypothesis significance testing – is fundamentally misguided in a lot of areas. There are obviously lots of cases where conventional statistics have generated important knowledge but given that much of my research interest over the years has been on human subjectivity (e.g., well-being, motivation, self-reports of personality, anxiety, depression) I’ve begun to wonder whether conventional statistical methods can really be of much use to truly understand these sorts of things. In no particular order, some problems:

1. Samples of humans willing to participate in research are virtually never random or representative, given the pragmatic constraints of actual data collection. Thus, we are estimating population parameters for specialized sub-populations that are quite often not defined or known. Outside of census research and very well-funded research programs, obtaining anything close to representativeness is typically unachievable.

2. Galton introduced the idea of the “normal” distribution, that most attributes in humans fall on it, with most being average and others falling around the tails. Somewhere along the line, the idea of normally distributed errors became generalized to the idea that TRAITS in humans fall along a normal distribution.  That is, there became a belief that population distributions of most measured human capacities, like personality and intelligence, should be normally distributed. For Galton, this interest is natural and making this argument helped with his aim to rank humans in mental ability.  However, many things we measure in humans are not distributed in this way (e.g., well-being, for instance, typically has negative skewness; alcohol consumption has positive skewness). In fact, if distributions of university grades are any indication, these distributions are also not “normally” distributed in most cases.

3. The focus on “population differences,” while not in of itself racist, it guides research along the lines as to emphasize and reify categories of people. The tools that we have promote categorization of humans into sub-populations, and comparison of those groups (e.g., t-tests, ANOVA). When we focus on Galton, Fisher, and Pearson, the bulk of our statistical methods were deliberately and explicitly designed and created to try and prove that populations differ in their mental abilities and were strongly motivated by eugenics in many cases. It constrains our inquiry to things like: (a) Does group A have a different population mean than Group B?  

4. Can there even truly be a population parameter for some latent constructs, which are themselves co-constructed through human interaction with psychological tests? Moreover, the nature of a lot of personality data is not truly numerical (it’s instead ordinal) so should we really be applying techniques for numerical data here?  

More questions than answers, I suppose. What I want to understand are within-person processes. That is, what is going on in the mind of a single person and can we actually predict their thoughts, feelings, and behavior? The intensive study of individual people has never been overly popular, but I wonder if it is something uniquely psychological that might need methods other than inferential stats.

I also think that people could do a lot richer description of humans using data visualization, descriptive statistics, and qualitative approaches. I’ve always found a detailed analysis of a human life in biography or the idea of a “quantified self” to be fascinating. There’s so much to learn about individual human variation; I think that psychology could really learn a lot by taking a step back and really thoroughly observing and describing the humans that we study.

I might explore some of these ideas more later. For now, I’m mostly just trying to give myself some time to really think and discuss ideas with other people during sabbatical, a precious commodity that gets lost in the regular bustle of day-to-day academics!

Categories
Uncategorized

Increasing statistical power in psychological research without increasing sample size

Note: I wrote this ages ago back in 2013, and it lives on the now defunct open science collaboration blog. Storing it here on my own website for archiving purposes. Original link: http://osc.centerforopenscience.org/2013/11/03/Increasing-statistical-power/

What is statistical power?

As scientists, we strive to avoid errors. Type I errors are false positives: Finding a relationship when one probably doesn’t exist.  Lots has been said about these kinds of errors, and the field of psychology is sometimes accused of excessive Type I error rates through publication biases, p-hacking, and failing to account for multiple comparisons (Open Science Collaboration, in press).  Type II errors are false negatives: Failing to find a relationship when one probably does exist. Type II errors are related to statistical power. Statistical Power is the probability that the test will reject the null hypothesis when the null hypothesis is false. Many authors suggest a statistical power rate of at least .80. This corresponds to an 80% probability of not committing a Type II error. Accuracy in parameter estimation (APIE) is closely related to statistical power, and it refers to the width of the confidence interval for the effect size (Maxwell et al., 2008). The smaller this width, the more precise your results are. This means that low statistical power not only increases Type II errors, but also Type I errors because underpowered studies have wide confidence intervals. Simply put, underpowered studies are imprecise and are unlikely to replicate (Button et al., 2013).

Studies in psychology are grossly underpowered

Psychological research has been grossly underpowered for a long time. Fifty years ago, Cohen (1962) estimated the statistical power to detect a medium effect size in abnormal psychology was about .48. The situation has improved slightly, but it’s still a serious problem today. For instance, one review suggested only 52% of articles in the applied psychology literature achieved .80 power for a medium effect size (Mone et al., 1996). This is in part because psychologists are studying small effects. One massive review of 322 meta-analyses including 8 million participants suggested that the average effect size in social psychology is relatively small (r = .21). To put this into perspective, you’d need 175 participants to have .80 power for a simple correlation between two variables at this effect size. This gets even worse when we’re studying interaction effects. One review suggests that the average effect size for interaction effects is even smaller (f2 = .009), which means that sample sizes of around 875 people would be needed to achieve .80 power.

What can we do to increase power?

Traditional recommendations for increasing statistical power suggest either (a) Increasing sample size, (b) maximizing effect size or (c) using a more liberal p-value criteria. However, increasing the effect size has no impact on the width of the confidence interval (Maxwell et al., 2008), and using a more liberal p-value comes at the expense of increased Type I error. Thus, most people assume that increasing the sample size is the only consistent way to increase statistical power. This isn’t always feasible due to funding limitations, or because researchers are studying rare populations (e.g., people with autism spectrum disorder). Fortunately, there are other solutions. Below, I list three ways you can increase statistical power without increasing sample size. You might also check out Hansen and Collins (1994) for a lengthier discussion.

Recommendation 1: Decrease the mean square error

Decreasing the mean square error will have the same impact as increasing sample size (if you want to see the math, check out McClelland, 2000). Okay. You’ve probably heard the term “mean square error” before, but the definition might be kind of fuzzy. Basically, your model makes a prediction for what the outcome variable (Y) should be, given certain values of the predictor (X). Naturally, it’s not a perfect prediction because you have measurement error, and because there are other important variables you probably didn’t measure. The mean square error is the difference between what your model predicts, and what the true values of the data actually are.  So, anything that improves the quality of your measurement or accounts for potential confounding variables will reduce the mean square error, and thus improve statistical power. Let’s make this concrete. Here are three specific techniques you can use:

  1. Reduce measurement error by using more reliable measures (i.e., better internal consistency, test-retest reliability, inter-rater reliability, etc.). You’ve probably read that .70 is the “rule-of-thumb” for acceptable reliability. Okay, sure. That’s publishable. But consider this: Let’s say you want to test a correlation coefficient. Assuming both measures have a reliability of .70, your observed correlation will be about 1.43 times smaller than the true population parameter (I got this using Spearman’s correlation attenuation formula).  Because you have a smaller observed effect size, you end up with less statistical power. Why do this to yourself? Reduce measurement error.  
  • Control for confounding variables. With correlational research, this means including control variables that predict the outcome variable, but are relatively uncorrelated with other predictor variables. In experimental designs, this means taking great care to control for as many possible confounds as possible. In both cases, this reduces the mean square error and improves the overall predictive power of the model – and thus, improves statistical power.
  • Use repeated-measures designs. Repeated measures designs reduce the mean square error by partitioning out the variance due to subjects. Depending on the kind of analysis you do, it can also increase the degrees of freedom for the analysis substantially (e.g., some multi-level models). I’m a big fan of repeated measures designs, because they allow researchers to collect a lot of data from fewer participants.

Recommendation 2: Increase the variance of your predictor variable

Another less-known way to increase statistical power is to increase the variance of your predictor variables (X).

  1. In correlational research, use more comprehensive continuous measures. That is, there should be a large possible range of values endorsed by participants. However, the measure should also capture many different aspects of the construct of interest; artificially increasing the range of X by adding redundant items (i.e., simply re-phrasing existing items to ask the same question) will actually hurt the validity of the analysis. Also, avoid dichotomizing your measures (e.g., median splits), because this reduces the variance and typically reduces power.
  • In experimental research, unequally allocating participants to each condition can improve statistical power. For example, say you were designing an experiment with 3 conditions (let’s say low, medium, or high self-esteem). Most of us would equally assign participants to all three groups, right? Well, as it turns out, that reduces statistical power. The optimal design for a linear relationship would be 50% low, 50% high, and omit the medium condition. The optimal design for a quadratic relationship would be 25% low, 50% medium, and 25% high. The proportions vary widely depending on the design and the kind of relationship you expect, but I recommend you check out McClelland (1997) to get more information on efficient experimental designs. You might be surprised.

Recommendation 3: Make sure predictor variables are uncorrelated with each other

A final way to increase statistical power is to increase the proportion of the variance in X not shared with other variables in the model. When predictor variables are correlated with each other, this is known as colinearity. For example, depression and anxiety are positively correlated with each other; including both as simultaneous predictors (say, in multiple regression) means that statistical power will be reduced, especially if one of the two variables actually doesn’t predict the outcome variable. Lots of textbooks suggest that we should only be worried about this when colinearity is extremely high (e.g., correlations around > .70). However, studies have shown that even modest intercorrlations among predictor variables will reduce statistical power (Mason et al., 1991). Bottom line: If you can design a model where your predictor variables are relatively uncorrelated with each other, you can improve statistical power.

Conclusion

Increasing statistical power is one of the rare times where what is good for science, and what is good for your career actually coincides. It increases the accuracy and replicability of results, so it’s good for science. It also increases your likelihood of finding a statistically significant result (assuming the effect actually exists), making it more likely to get something published. You don’t need to torture your data with obsessive re-analysis until you get p < .05.  Instead, put more thought into research design in order to maximize statistical power. Everyone wins, and you can use that time you used to spend sweating over p-values to do something more productive. Like volunteering with the Open Science Collaboration.

References

Button, K. S., Ioannidis, J. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. J., & Munafò, M. R. (2013). Power failure: Why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376. doi: 10.1038/nrn3475

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65, 145-153. doi:10.1037/h0045186

Hansen, W. B., & Collins, L. M. (1994). Seven ways to increase power without increasing N. In L. M. Collins & L. A. Seitz (Eds.), Advances in data analysis for prevention intervention research (NIDA Research Monograph 142, NIH Publication No. 94-3599, pp. 184–195). Rockville, MD: National Institutes of Health.

Mason, C. H., & Perreault, W. D. (1991). Collinearity, power, and interpretation of multiple regression analysis. Journal of Marketing Research, 28, 268-280. doi:10.2307/3172863

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537-563. doi:10.1146/annurev.psych.59.103006.093735

McClelland, G. H. (1997). Optimal design in psychological research. Psychological Methods, 2, 3-19. doi:10.1037/1082-989X.2.1.3

McClelland, G. H. (2000). Increasing statistical power without increasing sample size. American Psychologist, 55, 963-964. doi:10.1037/0003-066X.55.8.963

Mone, M. A., Mueller, G. C., & Mauland, W. (1996). The perceptions and usage of statistical power in applied psychology and management research. Personnel Psychology, 49, 103-120. doi:10.1111/j.1744-6570.1996.tb01793.x

Open Science Collaboration. (in press). The Reproducibility Project: A model of large-scale collaboration for empirical research on reproducibility. In V. Stodden, F. Leisch, & R. Peng (Eds.), Implementing Reproducible Computational Research (A Volume in The R Series).  New York, NY: Taylor & Francis. doi:10.2139/ssrn.2195999

Categories
Uncategorized

Which Starburst Flavour is Tastiest?

Introduction

Ever wonder which starburst flavor is the best? Now you can, with the help of SCIENCE. Naturally, pink is the best (personal communication, me). But don’t take my word for it, polls from Buzzfeed and Ranker.com both support pink as the best, followed by red, yellow, and orange, respectively. However, these polls rely on retrospective memory and could benefit from experiments where participants provide numeric ratings of each candy immediately after consuming them. Thus, the present experiment has the following research question: What is the rank order of tastiness for the original 4 starburst candy flavors?

Method

Students from my third year research methods in social psychology course (N = 72) completed this experiment between 2016-2018 as part of a class exercise. Each student was given 4 starburst candies, one of each flavor (pink/strawberry, red/cherry, orange, and yellow/lemon). They were instructed to eat each candy sequentially in the same order: pink, red, yellow, and orange. Students rated each candy on a 7-point scale from 1 (Extremely untasty) to 7 (Extremely tasty). Thus, I used a within-subjects design with 4 conditions and no counterbalancing.

Results

Data were analyzed using a Kruskall-Wallis test, setting alpha at .05. Pairwise tests were conducted using Wilcoxon rank-sum tests, correcting alpha using a Bonferroni correction for familywise error rates (alpha = .008). Results are displayed in the two figures below. There was a clear ranked preference with pink emerging as the tastiest, followed by red, orange and yellow. Pairwise tests found statistically significant differences between these pairs: (a) pink-orange, (b) pink-yellow, and (c) red-yellow. Open data and syntax can be found on my OSF page.

Figure 1. Faceted bar plots comparing starburst candy flavors
Figure 2. Box plots with pairwise comparisons using wilcoxon rank-sum tests. Numbers indicate p-values.

Discussion

Overall, the results mostly mirror the results of the polls, except that the rank order of orange and yellow are switched. Pink is indeed the superior flavor of original starburst candy flavors. The lack of counterbalancing is a significant limitation, as people may have gotten sick of eating starburst candies by the last one. This seems unlikely, as starburst are delicious. In sum, science supports pink starburst as the tastiest flavor, and yellow as the least tasty flavor.

Categories
Uncategorized

Video Game Genres by Year: 1980-2016

Video games have been a hobby since childhood so it’s fun to look back at the history of video games. I found a cool dataset that includes all video games with sales of 100,000 or more from 1980-2016. It’s necessarily incomplete since it was scraped from an existing website; this is especially true for the early days of gaming (e.g, it has only 98 Nintendo Entertainment System (NES) games out of the library of ~716). It also misses a lot of the download-only market for PC and has no mobile games. Nonetheless, it’s still an interesting snapshot of video game history. For my part, I’m interested in tracking the popularity of video game genres from 1980-2016. That is, the type of game (e.g., platformer, shooter, role-playing). Starting out, I’d presume that platformers (e.g., Super Mario) would have been most popular back in the NES and Super Nintendo / Sega Genesis era (the halcyon, nostalgia-ridden days of my youth). But let’s see what turns up!

First thing to note is that there is really not nearly as much data here prior to ~1995 (shortly after the Playstation 1 comes out). I ran a simple stacked bar chart first to look at this.

ggplot(newdata, aes(x = Year2, fill = Genre)) +
  geom_bar(position = "stack") +
  scale_fill_ptol() +
  scale_x_continuous(limits = c(1980,2016), breaks = seq(1980, 2016, 5)) +
  labs(x = "", y = "") +
  ggtitle("Video Games by Year and Genre: 1980-2016")

The problem with this chart is that I can’t make any sense of how popular different genres are because of the large differences in counts. Thus, it makes more sense to look at proportions. I decided to go with a stacked density plot this time, and I’m pleased with the result.

ggplot(newdata, aes(x = Year2, group = Genre, fill = Genre)) +
  geom_density(adjust=1.5, position="fill") +
  scale_fill_ptol() +
  scale_x_continuous(limits = c(1980,2016), breaks = seq(1980, 2016, 5)) +
  scale_y_continuous(limits = c(0,1), breaks = seq(0,1,.10)) +
  labs(x = "", y = "") +
  ggtitle("Video Games by Year and Genre: 1980-2016")

To interpret the results, it might help to know the broad eras of video game consoles:

  • (1976–1983): Atari 2600
  • (1983–1987): NES.
  • (1987–1993): SNES/Genesis.
  • (1993–1998): PS1/N64/Saturn.
  • (1998–2005): Dreamcast/PS2/Gamecube/Xbox.
  • (2005–2012): PS3/360/Wii.
  • (2012-Present): PS4/XB1/WiiU.

First, remember that these are only the games that sold more than 100,000 copies, so it’s more an index of popularity than actual games published. With this in mind, a few interesting trends emerge.

  • Action games are most popular in the Atari era (e.g., Frogger), and have picked up again substantially in the present-day era of console gaming.
  • Adventure games have slowly increased in popularity over time.
  • Fighting games hit their peak in the early 90s. Games like Mortal Kombat, Street Fighter, and King of Fighters are popular here.
  • As expected, platformers are huge during the NES and early SNES era. The popularity of Super Mario Bros. spawned a lot of clones.
  • The big spike and then decline of puzzle games is probably an artifact of the data. Puzzle games are probably even more popular now with mobile phones, but those games aren’t included in the dataset. I expect these games were much easier to program, back in the day, which accounts for the early spike (e.g., Tetris).
  • Racing games get popular in the 1995-2005 era. The popularity of game series like Mario Kart spring to mind here.
  • Roleplaying games don’t emerge as top sellers until the NES era, and stay pretty steady after that. I expected to see spike of roleplaying games in the PS1 era, considered by many to be the golden age of JRPGs. But probably doesn’t get picked up in this dataset because many of them are cult classics, and did not sell a lot in the US.
  • Strategy games don’t emerge as top sellers until the SNES era, but remain pretty steady after that.

Overall, very cool data and fun to look through. As usual, data and code available on the blog’s OSF site.

Categories
Uncategorized

Cloud-based apps for statistics teaching

At the recent 2019 Canadian Psychological Association convention, I gave a workshop on Cloud-Based Statistics Apps that could potentially be useful for teaching. I’m posting the list of applications here in case anyone finds them useful for teaching. In addition, also posting my slides from the talk, where I describe my approach to using these apps when teaching undergrad statistics.


Central Limit Theorem (Means)

https://gallery.shinyapps.io/CLT_mean/

This application runs simulations to demonstrate the central limit theorem from different population distributions by sampling up to 1000 times from a population. It shows the population distribution, the first 10 sample distributions, and the sampling distribution. This apps helps enormously with teaching students the central limit theorem. It can sample from normal, uniform, right-skewed, and left-skewed population distributions.

Central Limit Theorem (Proportions)

https://gallery.shinyapps.io/CLT_prop/

This app is like the one above, except that it shows the sampling distribution for proportions.

Regression Diagnostics

https://gallery.shinyapps.io/slr_diag/

This app simulates new data for each run and shows students how to review diagnostic plots from linear regression models. It allows students to simulate situations where the normality assumption is likely to be met as well as situations where the assumptions are violated (quadratic relationships, non-constant variance). Helpful for getting students to interpret plots of residuals.

Generic Distribution Calculator

https://gallery.shinyapps.io/dist_calc/

Calculates area under the curve for various distributions, such as z, binomial, t, F and χ2. Useful when teaching students about standardizing distributions (e.g., converting to z-scores) and for conceptual understanding of p-values.

Apps by Bruce Dudek

Note that Bruce Dudek does not post his code for these apps online. However, he may be willing to share code with you if you send an email (I use some of these apps in my own classes!).

Distributions (z, t, F & chi-squared)

https://shiny.rit.albany.edu/stat/stdnormal/

Bruce developed a set of distribution apps that improve upon the generic ShinyEd one above in some ways. These apps are primarily useful for looking up p-values, and can replace looking up p-values in conventional critical value tables from old-school statistics textbooks. If the emphasis is entirely on looking up p-values, these apps work better.

Descriptive Statistics and Visualization

https://shiny.rit.albany.edu/stat/describe/

This is a cool app that has a few datasets built in, and allows students to calculate descriptive statistics (e.g., means, SDs, range) as well as a few basic data visualizations (e.g., histograms, box plots). Can be useful as a way to introduce students to exploring data with graphs.

Confidence Intervals

https://shiny.rit.albany.edu/stat/confidence/

This app simulates confidence intervals (up to 100 at a time). It is useful to help students understand the conceptual issues surrounding confidence intervals, and the impact of sample size and confidence limit (e.g., 90% vs 95%) changes the width of the intervals.

RShiny Apps from Other Sources

ANOVA Playground

https://rna.wlu.edu/shiny-apps/anovaPlayground/

This app has a LOT going on. It can allow the user to simulate data and run a one-way ANOVA, complete with beautiful data visualizations. It can also allow the user to read in data to run a one-way ANOVA on a pre-specified dataset. Finally, it also allows the user to run individual t-tests as a rudimentary form of post-hoc test. This app is very useful to help students understand and interpret one-way ANOVA.

Other Online Apps that Don’t Use RShiny

The applications below don’t actually use RShiny but are similar in the sense that they can be run in a web browser without having to download software. I’ve found them to be useful in teaching.

Guess the Correlation

http://guessthecorrelation.com/

A game that shows students a scatterplot and asks them to guess the magnitude. Very fun.

DrawMyData

http://www.robertgrantstats.co.uk/drawmydata.html

This allows you to add dots to a scatterplot with a point-and-click format, calculates the correlation coefficient and allows you to save the resulting dataset. Insanely useful for teaching.

Rock ‘n Poll

http://rocknpoll.graphics/index.html

Explains the concept of sampling variability using political polling as an example. Very beautiful viz.

Rossman/Chance Java Applet Collection

http://www.rossmanchance.com/applets/

This site has been around a long time and uses java and javascript. Many different simulations, some that duplicate those above, some new. They are not as “pretty” as some of the other options, but they are still very useful, and notably have some simulations on probability. Note: The older ones that use java don’t seem to run on all computers; the javascript ones (labeled “js”) seem to work fine.

Kristoffer Magnusson’s d3.js Visualizations

https://rpsychologist.com/d3/NHST/

https://rpsychologist.com/d3/CI/

https://rpsychologist.com/d3/cohend/

https://rpsychologist.com/d3/tdist/

https://rpsychologist.com/d3/equivalence/

Kristoffer has made some truly excellent data visualizations worth checking out. The ones above cover null hypothesis significance testing, confidence intervals, effect size, t-distributions, and equivalence testing. I honestly can’t hype these enough, they are all top-notch.

Online cloud-based statistics software

https://www.graphpad.com/quickcalcs/

https://statscloud.app/

http://powerandsamplesize.com/

https://www.socscistatistics.com/tests/

Sometimes, you want your students to analyze some data in a similar fashion to how they would if they had conventional software installed (e.g., SPSS, SAS). All of these do all of the calculations online, and remove the need for students to install software which, I have discovered, is a very error-prone process for many students. Useful if you want students to do statistical calculations in a class where learning a particular software is not the goal.

Categories
Uncategorized

Lollipop Plot: Meta-analysis of Gender Differences in Sexual Behaviour

When I went to the APS conference in San Francisco last year, I got to hear Janet Hyde talk about the gender similarities hypothesis. Broadly, she argues that most gender differences (i.e., men vs. women) in psychological variables tend to be small in size. She used meta-analysis — statistically summarizing the results of lots of published research — as a method of testing her hypothesis. I thought it was fascinating stuff and a great talk, so I wanted to incorporate some of her research into intro psychology. Since I’ve been sprucing up the intro psych section on sex and gender, I thought it would be interesting to include Petersen and Hyde’s (2010) meta-analysis on gender differences in sexual attitudes and behaviour. Broadly, it found that there are a few large sex differences (e.g., men watch more pornography and masturbate more than women), but consistent with her general argument, most differences were very small in size (i.e., smaller than d = .20) and men and women are more similar than different. However, the paper itself doesn’t present the results in a very PowerPoint friendly way, and certainly not intelligible to freshmen students:

Portion of Table 1 from Petersen & Hyde (2010)

So, with that in mind I decided to make a plot up to better visualize the data. The degree of uncertainty displayed by the 95% CI is important … but most in the class don’t actually have any statistics background. So I decided to teach them what “cohen’s d” was first, then create a graph that showed off those numbers. I also taught them that values less than .20 are essentially negligible, so I wanted to highlight that on the plot. I decided on a lollipop plot, which is a bit more appealing than a bar plot.

library(ggplot2)
library(ggthemes)
library(ggExtra)

#Order the variables in Rank Order

mydata$name <- factor(mydata$name, levels = mydata$name[order(mydata$d)])

#Create the Plot

  ggplot(mydata, aes(x=name, y=d, label=d)) + 
    geom_point(stat='identity', fill="black", size=6)  +
    geom_segment(aes(y = 0, 
                     x = name, 
                     yend = d, 
                     xend = name), 
                 color = "black") +
    geom_text(color="white", size=2) +
    labs(title="Gender Differences in Sexual Behaviour", 
         subtitle = "Negative numbers = more common in women; positive numbers = more common in men",
         x = "", y = "cohen's d (effect size)") + 
    ylim(-1, 1) +
    theme_bw() +
    geom_hline(yintercept = 0, linetype = "solid", color = "grey") +
    geom_hline(yintercept = -.20, linetype = "dashed", color = "grey") +
    geom_hline(yintercept = .20, linetype = "dashed", color = "grey") +
        coord_flip()
Lollipop plot showing meta-analysis results

Overall, I think the plot turned out pretty clear and was a lot more appealing than just putting a table of numbers for a more dynamic presentation. The ordering from largest to smallest helps a lot to guide the eye, and the dotted grey lines show the range of “trivial” effect sizes. So there are some substantial gender differences, but the effect sizes are actually a LOT smaller than you’d expect! For context, a cohen’s d of ~.60 would look like this (using this site to generate this quickly):

So phrased another way, if you selected a random man and a random woman, there is ~67% chance that the man will watch more pornography than the woman (the “common language effect size“). That’s not insubstantial, but also probably a lot smaller than a lot of people would think! Many of the effect sizes are vanishingly small. For a Cohen’s d of 0.20, there’s only a 55% chance that a randomly selected man will be more likely to engage in the behavior than a randomly selected woman. For that matter, men and women are actually more similar than they are different on all of these variables, if you consider the percent overlap among the distributions. Which broadly, is what Dr. Hyde’s point has been all along.

Code and data available on my OSF page.

Categories
Uncategorized

Open Sex Role Inventory

Did you know about the Open Source Psychometrics project? It has more than two dozen personality tests that are all free to use with a creative commons license and posts large, open access datasets for their validation? Wow. What’s even stranger is that this site has no university affiliation, so far as I can tell and I can’t find any info on the site’s administrator. They’ve collected data from ~300,000 people here, these datasets are massive and an incredible resource, but are also virtually untouched by the academy. It’s downright criminal.

Case in point, the Open Sex Roles Inventory (OSRI). I started looking up data on Bem’s Sex Role inventory to help enrich the gender section of my intro psychology class, when I came across this measure. The open dataset had ~20,000 data points in it and I thought it’d be fun to play around with. I sent an email to the site administrator asking about how to score it and validation info. No email response, but the website was updated four days later with a TON of info and modifications to the scale to make a 20-item, more well-validated short form from the original 44 items. Seriously, who is this wonderful person? These data could easily be published in any personality journal they wanted, but they’re just free for all to use with no glory or publishing prestige. Searching online, I found a single publication using the OSRI, and it was a Ph.D. dissertation that used the data to test out a new method of measurement invariance, for which this particular data was mostly incidental (finding that the 44-item version of the OSRI didn’t have measurement invariance across heterosexual vs. non-heterosexual groups.

Anyway, back to the point. I wanted to have a visualization for my intro psychology class to clarify the distinction between sex (i.e., biological/physiological) and gender (i.e., socio-cultural). Specifically, I wanted to talk about Sandra Bem’s theory published way back in 1974 that gender roles are spectrum with at least two dimensions (masculinity vs. femininity). I’m aware of challenges to this distinction (e.g., that sex/gender are inseparable, and that sex is also a spectrum), but broadly won’t get into that here. In any event, the dataset I have only has sex measured as “Male, Female, Other,” so a categorical model will need to do for now. My goal was to show that there is a sizable minority of people beyond cis-gendered men and women. The dataset is pretty large (~16,000 after missing data was removed), so I decided to use geom_hex() with the viridis color scheme. This crushes my computer, and takes a long time to run but in the end, after a lot of obsessing over details, I am happy with the result. Special thanks to this post on Stack Overflow that helped me solve an graphical glitch that stumped me for a long time.

ggplot(mydata2, aes(x = FemSF, y = MascSF)) +
  geom_hex(aes(colour = ..count..), bins = 22) + 
  theme_classic() +
  facet_grid(. ~ gender) +
  scale_fill_viridis() +
  geom_hline(yintercept = mean(mydata$MascSF)) +
  geom_vline(xintercept = mean(mydata$FemSF)) +
  labs(y = "Masculinity", x = "Femininity" ) +
  geom_text(x=4.25, y=4.25, label="Androgynous", col = "darkslategray2") + 
  geom_text(x=2.0, y=1.75, label="Undifferentiated", col = "darkslategray2") +
  geom_text(x=2.0, y=4.25, label="Masculine", col = "darkslategray2") +
  geom_text(x=4.25, y=1.75, label="Feminine", col = "darkslategray2")
Hexagonal Scatterplot for the OSRI

I like the way this graph came out. It shows that indeed, you can predict self-identified male or female status with the items on this questionnaire. But, it also shows that there’s a lot of variation, with folks that are androgynous (i.e., high in both) or undifferentiated (i.e., low in both) in for males and females. The “Other” gender category looks more similar to the female category than the male category, but a little more in the center overall. When I did some t-tests comparing men vs. women, femininity had a cohen’s d of 1.14 and masculinity had a cohen’s d of 0.82. Those are huge effect sizes! Much larger than the majority of sex differences in the literature.

Out of general interest, I decided to do a few more quick plots to break separate out the heterosexual vs. the non-heterosexual participants (I binned homosexual, bixsexual, asexual and other into one category for comparison in this analysis). Nothing fancy this time around, just some simple density plots:

ggplot(mydata2, aes(x = Fem, fill = gender)) +
  geom_density(alpha = .4) +
  facet_grid(. ~ orientation.d) +
  scale_fill_manual(values = pal) +
  labs(x = "Femininity", fill ="")

ggplot(mydata2, aes(x = MascSF, fill = gender)) +
  geom_density(alpha = .4) +
  facet_grid(. ~ orientation.d) +
  scale_fill_manual(values = pal) +
  labs(x ="Masculinity", fill ="")
Density plots for femininity and masculinity, split by gender and sexual orientation

One thing that is pretty striking about these plots is that the femininity and masculinity scales predict identifying as a heterosexual man or woman REALLY well, with massive effect sizes. Folks identifying as “other” fell somewhere in between, though a little closer to the female profile. However, for the non-heterosexual folks it’s a different story. The sex difference is still there for femininity, but less pronounced, mostly accounted for by men being more feminine. For the masculinity scale though, there’s basically no discrimination at all! Men, women and non-binary folks can’t really be distinguished based on their masculinity scores. I suppose that’s consistent with the measurement invariance issue the this Ph.D. thesis picked up on. Not sure what it means in a theoretical sense, but it’s definitely interesting.

I could stop here, but I wanted to also do a quick factor analysis on the short form data, just to see if the factor structure looked good. Nothing fancy or in-depth, just an exploratory factor analysis extracting two factors, with principal axis factoring, and an oblique rotation. I also fussed around until I could make a nice plot of the factor loadings, using the ggrepel package.

#Factor Analysis

mydata4 <- select(mydata2, 
       Q4, Q26, Q28, Q30, Q34, Q36, Q38, Q40, Q42, Q44,
Q1, Q3, Q5, Q7, Q9, Q17, Q29, Q35, Q39, Q43)

corMatsf <- cor(mydata4)

solutionsf <- fa(r = corMatsf, nfactors = 2, rotate = "oblimin", fm = "pa")

#Made a .csv file manually by copying into excel (can't figure out better solution)

print(solution$loadings, digits=2, cutoff=0)
loadings.sf <- read.csv("efa.loadingsSF2.csv")

#The plot

ggplot(loadings.sf, aes(x = PA1, y = PA2, label = item, col = FemMasc)) +
  geom_label_repel() +
  geom_point(size = 3) +
  scale_y_continuous(limits = c(-.80, .80), breaks = seq(-.80, .80, .10)) +
  scale_x_continuous(limits = c(-.80, .80), breaks = seq(-.80, .80, .10)) +
  geom_hline(yintercept = 0) +
  geom_vline(xintercept = 0) +
  scale_color_manual(values = c("hotpink", "royalblue")) +
  labs(x = "Factor 1 (Masculinity)", y = "Factor 2 (Femininity)", col = "") +
  guides(col=FALSE)

Plot of standardized factor loadings for OSRI

That looks pretty excellent, items are clearly distinguished by their factors and the factors are mostly uncorrelated. So the measurement seems solid. No wonder, given the amount of data that went into validating it! Overall, very cool dataset and measure. Might even use this myself some day in a published study…

If you want to take the test yourself, take a look here: https://openpsychometrics.org/tests/OSRI/

Data and syntax available on my OSF site. Data were used as part of a CC BY-NC-SA 4.0 license, with attribution the mysterious owner of https://openpsychometrics.org.

Categories
Uncategorized

Pathfinder Monster Database: AC vs Touch AC

Ok, one more visit to this pathfinder monster database before I’m on to a new dataset. This time, I wanted to take a look at the relationship between Armor Class (i.e., how hard a monster is to hit) and Challenge Rating (i.e., how tough the monster is, according to the game developers). There should be a pretty strong linear relationship between AC and CR.

However, the thing I’m really interested in is the relationship between Touch AC and CR. Because pathfinder is needlessly complicated, monsters have a separate AC for just touching them (or getting hit with certain kinds of attacks like rays or bullets). It’s been my experience that touch AC is always low, and doesn’t seem to change at all as the game progresses. Pragmatically speaking, this means that at a certain point touch attacks basically always hit. Let’s see if that’s the case. I think some run-of-the mill scatterplots will do here, but might as well make them fancy. Three packages needed: (a) ggplot2 for the basic plotting; (b) ggExtra to add density distributions to the plot; (c) cowplot, to stitch the two plots together at the end and (d) ggthemes, in case I want some custom color palettes.

library(ggplot2)
library(ggExtra)
library(cowplot)
library(ggthemes)

Then, I can make the plots. Made the dots transparent with “alpha = 0.3” to deal with overplotting. Personally, I find it looks cleaner than geom_jitter(); it’s always low-key bothered me that the randomness added by geom_jitter() actually changes the positions of the raw data a little. Feels a little like lying with data to me, sometimes. With the ggmarginal command, I added in the density plots so we can see the overall distributions. Then at the end, I used cowplot to stitch them together side-by-side. Knowing this, I made sure both graphs had the same y-axis lengths to facilitate easier comparison, a key component of the small multiples approach.

p1 <- ggplot(mydata, aes(x = CR, y = AC)) +
  geom_point(size = 2, alpha = 0.3, color = "steelblue") +
  theme_classic() +
  geom_smooth(col = "royalblue", se = FALSE) +
  scale_y_continuous(breaks = seq(0, 70, 5)) +
  scale_x_continuous(breaks = seq(0, 40, 5)) 

p1m <- ggMarginal(p1, type = "density", fill = "steelblue")

p2 <- ggplot(mydata, aes(x = CR, y = AC_Touch)) +
  geom_point(size = 2, alpha = 0.3, color = "steelblue") +
  theme_classic() +
  geom_smooth(col = "royalblue", se = FALSE) +
  scale_y_continuous(limits = c(0, 70), breaks = seq(0, 70, 5)) +
  scale_x_continuous(breaks = seq(0, 40, 5)) +
  labs(y = "Touch AC")

p2m <- ggMarginal(p2, type = "density", fill = "steelblue")

totalplot2 <- cowplot::plot_grid(p1m, p2m, labels = "auto", ncol = 2)
  
cowplot::save_plot("scatter.pathfinder.png", 
totalplot2, 
base_aspect_ratio = 2)
Scatterplots of AC and CR

Ok, well there’s a clear (very strong) linear relationship between CR and AC (r = .91). That’s probably not surprising, given that the game designers probably wanted that to happen by design, to keep up with increases to attack bonuses that players get each level. However, touch AC is very weakly related to touch AC (r = .12). In fact, it’s really only mythic level monsters beyond CR 20 that see any increase in Touch AC at all! So for the vast majority of play, there’s no relationship between CR and Touch AC. I got to wondering then, what is the variance in touch AC actually attributed to? A good first guess would be a creature’s size:

ggplot(mydata, aes(x = CR, y = AC_Touch, col = Size)) +
  geom_point(size = 2, alpha = 0.3) +
  theme_classic() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_y_continuous(limits = c(0, 70), breaks = seq(0, 70, 5)) +
  scale_x_continuous(breaks = seq(0, 40, 5)) +
  labs(y = "Touch AC") +
  scale_color_ptol()

Yup, looks like the majority of variation is due to that. So in summary, touch AC doesn’t really change as monster CR increases, and most of the variation is due to size. It’s a weird mechanic to add to the game, and easily abusable by additions like the gunslinger that always hit touch AC. Far as I can tell though, hasn’t been removed from the second version under playtest. Oh well, add it to the list of weird broken mechanics, I suppose.

Scatterplots of touch AC and CR, split by size

Syntax and data available on the blog’s OSF page.