Categories
Uncategorized

Pathfinder Monster Database Plots

I want to incorporate more R into my classes at Dalhousie. Problem is, I am a pretty bad R coder– I spent much of the past decade or so with SPSS and Mplus. But there’s lots of evidence that R is the future of science. I find that the best way to learn is project-based, so I’m going to start blogging on R code. I’m going to focus on topics that are inherently interesting to me, with a focus on data visualization. If I keep it fun, I’m more likely to stick with it.

So, to start I’m going to analyze data from the Pathfinder Monster Database, a comprehensive database of all 2812 monsters from Paizo’s tabletop roleplaying game, Pathfinder. I’ve played Pathfinder for years now and there are a lot of crunchy numbers in there. Probably why I like it so much! I’m going to look at the relationship between creature type two outcome variables (a) Armor Class (i.e., how hard the creature is to hit) and (b) Challenge rating (i.e., how tough the monster is overall). The goal is to see what creature type is “toughest” overall.

The data needed a little bit of cleaning (e.g., changing “Dragon” to “dragon” for some entries), but it was in good shape overall. I decided to try out ridge plots as the way to visualize the data, since I’ve never used them before. First thing to do is load the necessary libraries into R.


library(ggplot2)
library(ggridges)
library(dplyr)
library(ggExtra)

Next, since I want the two plots to be in order from highest to lowest values of AC/CR, I need to use the next bit of code which requires dplyr. This creates two new variables I can use to re-order the y-axis with later. I also created a color palette of 13 random colors, since there are 13 creature types and I didn’t like the default ggplot2 colors here.

<h1>Order variables by AC</h1>
avg <- mydata %>%
group_by(Type) %&gt;%
summarise(mean = mean(AC))

ACorder &lt;- avg$Type[order(avg$mean)]
<h1>Order variables by CR</h1>
avg2 <- mydata %>%
group_by(Type) %&gt;%
summarise(mean2 = mean(CR))

CRorder &lt;- avg2$Type[order(avg2$mean2)]
<h1>Create color palette</h1>
pal &lt;- rainbow(13)

Ok, now I can create the two plots using the geom_density_ridges() function. This needs the ggridges package to function, as base ggplot2 can’t do this.


ggplot(mydata, aes(x = CR, y = Type, fill = Type)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none") +
scale_y_discrete(limits = CRorder) +
scale_x_continuous(limits = c(0,30), breaks = seq(0, 30, 5)) +
scale_fill_manual(values = pal) +
labs (y = "", x = "Challenge Rating")

ggplot(mydata, aes(x = AC, y = Type, fill = Type)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none") +
scale_y_discrete(limits = CRorder) +
scale_x_continuous(limits = c(0,50), breaks = seq(0, 50, 5)) +
scale_fill_manual(values = pal) +
labs (y = "", x = "Armor Class")

So, the toughest monster types in Pathfinder are dragons, followed by outsiders. The weakest monster types are vermin and animals. The ranking of toughness by CR and AC are exactly the same, as it turns out. However, the distribution for oozes are way different than everything else: These creature types tend to be really easy to hit, but are still tough because of lots of other abilities and immunities. The positive skew in the distributions for CR are interesting, since it shows that there are generally a LOT more monsters under CR 10, which makes sense given that very few games get to such high levels.

I like ridge plots. They work a lot better than overlapping histograms when there are lots of groups and lots of cases. There was a bit of difficulty with numbers less than 1 for the CR plot (e.g., some CRs are 1/3). Without the “scale_x_continuous(limits = c(0,50)” function, the graph displayed values less than 0, which is outside the range of actual data. I believe that the graph is now bunching all the CRs that are less than 1 (~217 data points) as “0” on the graph above. Overall, a fun first attempt, and neat data to work with.

Datafile and syntax available on the blog’s OSF page.


Categories
Uncategorized

Generalized Linear Models for Between Subjects Designs

There aren’t many good, easy-to-understand resources on Generalized Linear Models. This is a shame, because they are usually a substantial improvement over more conventional ANOVA analyses, because they can much better account for violations of the normality assumption. Check out some tutorial slides I created here:


They only cover between-subjects designs. Maybe some time I’ll also make one for generalized mixed models, which take the best of GLiM and multilevel models and combine them into one.

Categories
Uncategorized

Basics of SEM Tutorial

Attached are some slides that I’ve used to teach my PSYO 6003 Multivariate Statistics students the basics of structural equation modelling, which may be of some use to people using it for the first time. Check them out here:


Categories
Tutorials Uncategorized

Multicolinearity: Why you should care and what to do about it

Multicolinearity: Why you should care and what to do about it

Multicolinearity is a problem for statistical analyses. This large, unwieldy word essentially refers to situations where your predictor variables so highly correlated with one another, they become redundant. Generally speaking, this is a problem because it will increase your Type II error rate (i.e., false negatives). In the most severe cases, multicolinearity can produce really bizarre results that defy logic. For example, the direction of relationships can sometimes reverse (e.g., a positive relationship becomes negative). If multicolinearity is an issue in your analysis, the results cannot be trusted.

Jeremy Taylor did a nice job explaining multicolinearity in his blog in layman’s terms, and I wanted to expand on it by giving an example using real data.

Sample Data Illustrating Multicolinearity

I’m going to use some archival data I have on hand from 123 university students. Let’s say I had a hypothesis that feeling disconnected from other people leads to increased depression. I measure three variables: Loneliness, social disconnection, and depression. I start by looking at the correlation between each of these variables and find that all three variables are positively related to each other.

multico1

Okay, looks like loneliness and social disconnection are strongly correlated with depression. However, note also that the correlation between loneliness and social disconnection is absurdly high (r = .903). This suggests that these two variables are redundant: They’re measuring the same thing. Watch what happens if I ignore this, and run a multiple regression with both loneliness and social disconnection as simultaneous predictors of depression. A portion of the output is below:

multico2

If these results are interpreted using the p < .05 criterion, we would conclude that neither loneliness nor social d isconnection uniquely predicts depression. This is obviously nonsense, since we can see from the correlations above that there is a pretty strong relationship between these variables. Moreover, if I calculate the R2 value for this analysis (% variance explained), I can see that overall, these two variables explain about 37% of the variance in depression, with p < .001. This kind of incongruence is a classic sign of multicolinearity, and can be further diagnosed from the output.

In the above output, you can see a statistic labeled “VIF.” This stands for Variance Inflation Factor. Ideally, this should be close to 1. As it gets larger, it indicates more redundancy among predictors. I’d love to give you a clear cutoff value for this statistic, but people can’t seem to agree on one. As a rule of thumb, a VIF of 10+ is almost certainly a problem and a VIF of 5+ should be seen as a warning sign. Generally speaking, though, when you encounter a pattern of results like those described above, multicollinearity is a likely culprit.

I have multicolinearity, what do I do?

These are 3 common recommendations for handling multicolinearity:

  1. Remove one of the offending variables from the analysis. So in the example above, I could drop “social disconnection” from my study, because it is redundant with loneliness.
  2. Combine the offending variables into a single variable. There are a variety of ways to do this, but one simple way to do this in the above example would be to standardize, then sum loneliness and social disconnection together into a single variable. Other approaches might involve deriving composite scores using factor analysis, or using latent variables in structural equation modelling.
  3. Recruit more participants. Generally speaking, standard errors get smaller as the number of participants is increased, so the problems associated with multicolinearity can often be mitigated when the sample size is large.

Overall, many analyses are pretty robust to all but the most severe cases of multicolinearity. However, understanding this basic concept – and how to mitigate it – is certainly an important part of any researcher’s toolbox.