Multicolinearity: Why you should care and what to do about it
Multicolinearity is a problem for statistical analyses. This large, unwieldy word essentially refers to situations where your predictor variables so highly correlated with one another, they become redundant. Generally speaking, this is a problem because it will increase your Type II error rate (i.e., false negatives). In the most severe cases, multicolinearity can produce really bizarre results that defy logic. For example, the direction of relationships can sometimes reverse (e.g., a positive relationship becomes negative). If multicolinearity is an issue in your analysis, the results cannot be trusted.
Jeremy Taylor did a nice job explaining multicolinearity in his blog in layman’s terms, and I wanted to expand on it by giving an example using real data.
Sample Data Illustrating Multicolinearity
I’m going to use some archival data I have on hand from 123 university students. Let’s say I had a hypothesis that feeling disconnected from other people leads to increased depression. I measure three variables: Loneliness, social disconnection, and depression. I start by looking at the correlation between each of these variables and find that all three variables are positively related to each other.
Okay, looks like loneliness and social disconnection are strongly correlated with depression. However, note also that the correlation between loneliness and social disconnection is absurdly high (r = .903). This suggests that these two variables are redundant: They’re measuring the same thing. Watch what happens if I ignore this, and run a multiple regression with both loneliness and social disconnection as simultaneous predictors of depression. A portion of the output is below:
If these results are interpreted using the p < .05 criterion, we would conclude that neither loneliness nor social d isconnection uniquely predicts depression. This is obviously nonsense, since we can see from the correlations above that there is a pretty strong relationship between these variables. Moreover, if I calculate the R2 value for this analysis (% variance explained), I can see that overall, these two variables explain about 37% of the variance in depression, with p < .001. This kind of incongruence is a classic sign of multicolinearity, and can be further diagnosed from the output.
In the above output, you can see a statistic labeled “VIF.” This stands for Variance Inflation Factor. Ideally, this should be close to 1. As it gets larger, it indicates more redundancy among predictors. I’d love to give you a clear cutoff value for this statistic, but people can’t seem to agree on one. As a rule of thumb, a VIF of 10+ is almost certainly a problem and a VIF of 5+ should be seen as a warning sign. Generally speaking, though, when you encounter a pattern of results like those described above, multicollinearity is a likely culprit.
I have multicolinearity, what do I do?
These are 3 common recommendations for handling multicolinearity:
- Remove one of the offending variables from the analysis. So in the example above, I could drop “social disconnection” from my study, because it is redundant with loneliness.
- Combine the offending variables into a single variable. There are a variety of ways to do this, but one simple way to do this in the above example would be to standardize, then sum loneliness and social disconnection together into a single variable. Other approaches might involve deriving composite scores using factor analysis, or using latent variables in structural equation modelling.
- Recruit more participants. Generally speaking, standard errors get smaller as the number of participants is increased, so the problems associated with multicolinearity can often be mitigated when the sample size is large.
Overall, many analyses are pretty robust to all but the most severe cases of multicolinearity. However, understanding this basic concept – and how to mitigate it – is certainly an important part of any researcher’s toolbox.