Bayesian Paired Comparisons for power ranking MLB teams (2019-2024)

Baseball is back! Spring training for the 2025 season started just a few days ago. Over the course of the off-season, here and there, I have been pulling this project together. The goal: power-rank each team.


Standings allow us to rank teams based on who has more wins and fewer losses. Not all teams play one another, over the course of a season, however. For followers of NCAA football, this is often a source of consternation in the build-up to the playoff selection: do teams that play and win a lot against weaker opponents deserve to have as high of a rank as a team that may have relatively more losses but has played against much tougher opponents?


The task of sorting this out seems relatively complicated at first blush. And it kind of is, thankfully we have a class of statistical models that do not force us to deliberate in a committee. Instead, we can estimate each team's latent (latent meaning that we cannot directly measure it easily since not every team plays against one another) ability based on who they win against and who they lose against.


A class of statistical models allow us to do just that: Paired comparisons. One type of these models is called the Bradley-Terry model.


\(y_i = logit^{-1}(\alpha_{a}-\alpha_{h})\)\(\alpha \sim \mathcal{N}(0,1)\)

Where \(y_i\) is equal to 0 if the home team won and 1 if the away team won. \(\alpha_a\) is a paramater for the estimated latent ability of the away team while \(\alpha_h\) is a parameter for the estimated latent ability of the home team. I then sort the estimated \(\alpha\) parameter in descending order to get estimated rank for each team.



I also run a version of the model that accounts for home-field advantage. To do this, I simply add an intercept term to the model above. Doing so, adjusts the logged-odds for the home team winning which can act as a parameter to estimate the home-field advantage.


\(y_i = logit^{-1}(\alpha_{a}-\alpha_{h} + \gamma)\)\(\alpha \sim \mathcal{N}(0,1)\)\(\gamma \sim \mathcal{N}(0,1)\)

Finally, I run an extension of the Bradley-Terry model which is sometimes referred to as the Davidson model which allows for ties and an ordered set of outcomes. That is, rather than simply predicting whether or not a team won or not, now my model is supposed to predict the magnitude of the win. That is, in this final model, \(y_i\) is equal to 1 if the home team won by 5 or more runs, 2 if the home team won by between 2 and 4 runs, 3 if the home team won by 1 run, 4 if the home and away team tied, 5 if the away team won by 1 run, 6 if the away team won by between 2 and 4 runs, and 7 if the away team won by 5 or more runs. One issue with this model, is that ties are not common at all due to the allowance for extra innings. This gives my ordered logistic regression some problems due to how rare those events are relative to the other possible outcomes.


\(y_i = ordered\_logit^{-1}(\alpha_{a}-\alpha_{h} + \gamma)\)\(\alpha \sim \mathcal{N}(0,1)\)\(\gamma \sim \mathcal{N}(0,1)\)


I pulled the boxscores for the 2019, 2020, 2021, 2022, 2023, and 2024 regular seasons from the MLB API. With these data, I fit these three models using cmdstanpy. I fit these models and retain the last 2000 simulations (or draws) of the estimated rankings. The full repository containing all of this can be found on my GitHub.

The results of these models are in the plots below. The dot represents the median ranking for each team that the model has estimated. The bars reflect the range of the model's uncertainty about the ranking of the team. This range reflects that the team's rank is expected to fall within this range 95% of the time.





There are a few tweaks I could make to these models:


- Play around with the priors for the parameters a bit more. Though from testing, I don't think they will make too much of a difference here.

- I could do things like a hierarchical model where I model player ability and sum the ability of the players for each team. One issue with this approach is that the players for each team tends to remain relatively stable over time. While pitching may change, the rotation of pitchers tends to remain somewhat stable. So, it may be something to try, but my a priori expectation is that it won't help the model much.

- There may be more that I am not thinking of, but Spring Training is underway and I am just too excited to sit on this project too much longer!