The Problem of Forking Paths
There is a lot of time to think while at home on sabbatical, and since my role is so often teaching or conducting statistical analyses, I think about statistics pretty often. The problem that has been vexing me lately is related to what Gelman and Loken call the “Garden of Forking Paths.” Simply put, most research hypotheses in the social sciences have a one-to-many relationship with their associated statistical hypotheses. In fact, there are so many different statistical hypotheses that could represent evidence (or absence of evidence) for a given research hypothesis, it is positively breathtaking. For instance, let’s start with a null hypothesis test for a two-group between-subjects comparison of a numerical outcome — arguably one of the most basic kinds of comparisons taught in virtually every introductory statistics course. Just off the top of my head, you could use an student t-test, an unequal-variances t-test, a wilcoxon rank-sum test, a permutation test, a Bayseian t-test, structured means modelling, or a MIMIC model. On top of that, there are decisions to be made about outliers, random responders, covariates, and a dizzying array of possible ways to measure almost any latent construct (e.g., Fried, 2017 summarizes 7 different instruments to measure depression, encompassing 52 different symptoms!). There are literally thousands of plausible ways to formalize any given hypothesis in statistical terms — really, a nearly infinite possible hypothesis space if you break things down into smaller variations.
This is a problem for a post-positivist approach to science. Though these principles go mostly unstated, Bisel & Adame (2017) succinctly summarize the ontology, epistemology and axiology of this worldview:
Post-positivistic research assumes that social reality is out there and has enough stability and patterning to be known. Social reality is conceived as coherent, whole, and singular. […] Post-positivistic research assumes that social reality is measurable and knowable, albeit difficult to access. […] Post-positivistic research assumes that knowledge about social reality is inherently worthwhile to acquire and should be as value neutral as possible in its characterization of that reality.Bisel & Adame (2017) pg. 1
Most researchers doing quantitative work (myself included) ascribe to some variation of this belief structure either implicitly or explicitly. But the garden of forking paths is a serious problem for these beliefs. First, there’s a problem of ontology: Of the near-infinite possible options a researcher is faced with to formalize a verbal hypothesis in statistical terms, only a small subset of near-equivalent options can be “true” if reality is singular. Then there’s the problem of value neutrality; many analytic choices are not value neutral, and might be made consciously (or unconsciously) to support the analyst’s preexisting beliefs. For instance, when handling the familywise error rate problem, a researcher may a-priori choose to use a procedure that is known to be conservative (e.g., a Bonferroni correction) when they believe that the null hypothesis is actually true. This would bias a study against the alternative hypothesis even if the study were preregistered.
A lot of the scientific reform movement criticizes “p-hacking” where a person essentially exploits these degrees of freedom. That is, some analysts continue to explore the nearly infinite hypothesis space of possibilities for analysis until they find something that supports their argument, then report that as evidence while failing to report all the attempts that did not support their argument. Given publication bias, this very consistently results in an inflation of effect sizes and false positives. It’s been discussed extensively elsewhere, so no need to describe this further other than to reiterate that exploiting this is obviously a terrible way to do science.
Preregistration of hypotheses and a data analysis plan is a useful way to curb these sorts of abuses. With preregistration, the scientist limits the hypothesis space by pre-planning the analysis and design and committing to an approach in advance without exploration. This does a great job at reducing researcher bias and the exploitation problem noted above, and is worth implementing for that alone. Indeed, transparency is widely considered to be a virtue in statistical analysis. But the problem remains that the sheer number of arbitrary choices that need to be made before preregistering means that there is no way to know if those choices were the right choices. That is, if reality is indeed singular, and there are tens of thousands of ways to translate a verbal hypothesis into a statistical format, then the specific subset of statistical formulations that a researcher chooses in any given study (preregistered or not) is in all likelihood, not actually the best summary of reality. That is, preregistration curbs biasing of effects due to motivated reasoning, but does not solve the forking paths problem. This point has been discussed previously by Rubin (2017) and in a more mathematically sophisticated way by Devezer et al. (2021).
Another option that is can be combined with preregistration or used in isolation is multiverse analysis. Here, the researcher chooses 100s or even 1000s of possible analyses within the hypothesis space instead of just a handful of analyses, as would be more typical. Then, the researcher graphically presents all of those analyses, sometimes even producing a sort of average across many different analysis approaches. In this way, the researcher is able to see how much the results depend on idiosyncratic data analysis choices, and which choices have a big impact. There’s much to recommend this technique, though the obvious trepidation for most analysts is that is an exhausting, incredible amount of extra work. The other obvious problem for a post-positivist is that we still don’t know which of these analyses to trust most! It’s better insomuch as the hypothesis space becomes more visible, but given the sheer size of the hypothesis space (or “multiverse”) for any given question it’s still quite possible that the researcher has missed the best summary of reality. Moreover, if the analysis does vary substantially across analysis approaches, it may be difficult to know which approach (if any) is true. You’re only really in a confident place if you try 1000s of analyses, and they all produce results that are essentially equivalent. However, the assumptions of post-positivism (i.e., a singular reality) would mean that this will rarely ever be the case!
The inevitable conclusion then is that we can never be entirely certain which statistical approach is best. Statisticians are usually good at identifying and ruling out bad, untenable approaches (e.g., median splits, responder analyses; Andrew Althouse has an excellent list) but identifying the “best” approach will always be controversial and based on a set of values, beliefs, and biases of the researcher. There is no truly bias-free research, even though we can strive to reduce bias in various ways. Note too that I don’t necessarily mean political bias — analysts vary in attitudes towards various issues such as Type I vs. Type II error importance, whether robustness is more important than clear effect sizes, etc.
Reflexivity in Qualitative Research
This general problem has been known to qualitative researchers for quite some time, and their pragmatic solution is increased transparency through thorough analysis and reporting of one’s own biases. The epistemological differences of qualitative researchers have very frequently meant that statisticians and qualitative analysts don’t see eye-to-eye on a lot of topics, because their fundamental view of reality and how knowledge is created can be very different. Briefly, many qualitative researchers reject key components of post-positivism. Instead of a singular reality, they maintain that there are numerous, equally valid ways to understand the world. They also question whether any knowledge can be “value neutral” because of the inherent limitations of our own perspectives.
Thus, a common recommendation in qualitative research is to include a reflexivity statement. A good basic definition of reflexivity is given by Haynes (2012):
In simple terms, reflexivity is an awareness of the researcher’s role in the practice of research and the way this is influenced by the object of the research, enabling the researcher to acknowledge the way in which he or she affects both the research processes and outcomes. […] In other words, researcher reflexivity involves thinking about how our thinking came to be, how pre-existing understanding is constantly revised in the light of new understandings, and how this in turn affects our research. (Haynes, 2012).
I believe that reflexivity statements could be productively applied in quantitative research to improve transparency and potentially reduce bias by providing necessary context for the researcher’s analytic and design choices. There is some precedent for this, as Gelman and Henig (2015) describe “Honest acknowledgement of the researcher’s position, goals, experiences, and subjective point of view (pg. 978)” as a virtue along more conventional virtues such as transparency and impartiality.
As a general rule, reflexivity statements are virtually never used in quantitative research, even though they are a requirement for publishing qualitative research in many venues. As an illustrative example (and a challenge for myself) I am going to try and write a “reflexivity statement” for one of my prior quantitative papers:
Mackinnon, S. P., Ray, C. M., Firth, S. M., & O’Connor, R. M. (2019). Perfectionism and negative motives for drinking: A 21-day diary study. Journal of Research in Personality, 78, 177-178. https://doi.org/10.1016/j.jrp.2018.12.003
Reflexivity Statement for Mackinnon et al. (2019)
I wrote the grant application over the second half of 2015 and finalized the design and ethics application in August 2016. When I wrote the grant, I was a limited-term employee at Dalhousie with an eye to hopefully getting a tenure track position. While waiting for the grant results, I unexpectedly transitioned into a permanent instructor position at Dalhousie — formally, this position has increased teaching and no research requirement. When I won the grant, I decided to take on the grant project (with my department’s blessing) despite having no institutional requirement to conduct research.
My background in perfectionism was from studying under Simon Sherry, who was in turn influenced heavily by Gord Flett and Paul Hewitt. Succinctly put, this comes with a set of inherited biases that (a) perfectionism is primarily bad for mental health and (b) that perfectionism is, to a great extent, a negative interpersonal phenomenon. This affected my choice to focus on nondisplay of imperfection rather than other facets of perfectionism that some argue are more positive (e.g., high standards).
My background in alcohol research came from a postdoctoral fellowship with Sherry Stewart. Alongside the long history of behaviorist research at Dalhousie University where I received my Ph.D., I tend to think about psychological phenomenon of various sorts in operant conditioning terms. That is, I conceptualize drinking primarily as a function of positive and negative reinforcement. Thus, I focused primarily on negatively reinforcing motives (i.e., drinking to cope and conformity) in this paper and did not consider alternatives to this framework. Indeed, I am so inured to this perspective, I find it difficult to think of other explanatory frameworks that might be equally useful.
I am also in the position of being both the principal investigator and the primary statistical analyst on this paper. As a statistically-minded person, my bias in interpretation is often to think about statistical (rather than conceptual) limitations to research. As an analyst, I have a broad overarching bias towards more parsimonious models. That is, I value models with fewer parameters and add model complexity only when forced. This bias shows in a few places, such as using item parcels instead of items for some measures, deliberately choosing some 3-item scales in the design process, and using fixed slopes so that the model would be as parsimonious as possible. Perhaps most notably, two analyses in the paper were reviewer requests and both involved increased model complexity: Adding alcohol quantity to the model and (b) running a supplementary random slopes model. My own resistance to these components in the text is likely due to my bias for model parsimony.
That is a lot of text, so I could see a statement like that being an online supplement rather than in the body of the text. But I think that the process, when combined with reading the paper, would help readers understand my own biases and choices so they could gauge for themselves whether that influenced the final results. That is, my own biases shaped which forking paths I chose both in the design and analysis of my own study, and knowing those biases might help readers identify my blind spots. Though reflexivity statements are not a cure-all for psychological science, I think that they are an under-utilized tool that could be productively applied to increase transparency in quantitative studies.