The True Champions of England

Forfatter

Christoffer Sejling

Dato

April 9, 2025

Again and again and again

The lives of football fans follow a cycle. In September the season starts. You hear the murmuring in the streets. This year is our year! Already in December we start to modify our expectations. It’s only nine points to first place. If our best players wouldn’t have gotten injured we could have been the leaders. Maybe we can make a good run when they return! Then comes March. Top four should be secured. Let’s just get our players fit for the next season. And finally, summer comes. Our young players are more experienced now. We just need to get that striker, and next year, the title will be ours! Unfortunately, it’s Real Madrid that gets that one striker that every team wants, and the story repeats itself.

Second place in 2023, second place in 2024, and poised for second place for the third time in a row in 2025. It is a lot better than a never ending stream of fourth places and the fifth and sixth places of the recent years that Arsenal fans have experienced… But still, there is no Premier League title in sight. Soon, our young and promising players will have peaked or decided to move on to Real Madrid, Barcelona, Manchester City, or Bayern Munich. Then, even the second placings will cease. But! What if Arsenal is not actually the second best team right now? The 0-1-3 point system is, after all, completely arbitrary. What if we break the system, bend the cycle, and look at the match results in a different way? Let’s consider how the Premier League table would look if, instead of using the arbitrary 0-1-3 point system, we use a universal scoring system based on a sound and recognized statistical study design, namely the head-to-head experiment.

Going head to head

Warning! This section contains the mathematical foundation for the forthcoming analysis in some detail.

The head-to-head experiment is a mathematical construction used to formalize and analyze experiments or data samples where we can envision observations result from a match-up or head-to-head between two contestants. Under the head-to-head structure, it is possible to estimate the (relative) power levels, i.e., likelihoods to win, of the contestants in the experiment. While it has a sports-like feeling to it, it can be useful in medical settings too, for example when asking questions of the type which of these two patients appear to have the worst prognosis? The extracted power levels (or severities in the medical setting) can be used to summarize the rankers’ beliefs and knowledge about a topic, which may be based on years of experience and tacit knowledge that can otherwise be very difficult to put a number on.

In mathematical notation we consider \(X_1,\ldots,X_n\) independent objects to be scored, each with common domain \(\mathcal{D}\). Let \(W: \mathcal{D}\times \mathcal{D}\rightarrow \{0,x,1\}\) be a so-called random win-function. Our sample of interest is then of the form \(W_1,\ldots,W_N\), where \(W_i\) is a realization of the random variable \[ W(X_l,X_k) \] with some \(l,k\in \{1,\ldots,n\}\) and \(i \in \{1,\ldots,N\}\). Another way of indexing this sample is to base the numeration on comparisons between the indices \(l\) and \(k\), such that for each comparison we have \(m_{lk}\) observations. The sample then takes the form \[\{W_{lk}^{(j)}\},\] where \(l,k=1,\ldots,n, j = 1,\ldots,m_{lk}\). This setup can be expanded to include a rater index, meaning that the win-decision mechanism can be rater-specific. In that case we observe a set of win-decisions for each rater \(r=1,\ldots,K\) such that the sample takes the form \[\{W_{lk}^{(r,j)}\},\] where \(l,k=1,\ldots,n, r = 1,\ldots,K,j = 1,\ldots,m_{lkr}\). Summing all rater and comparison specific sample sizes, we obtain the full sample size \(N\), \[ \sum_{l>k}\sum_{r=1}^K \sum_{j=1}^{m_{lkr}}M_{lk}^{(r,j)} = N. \] In the case where we include multiple raters in the experiment, we need to handle the intra-rater correlation appropriately.

We are interested in estimating the power level of each contestant \(X\). As in the Bradley-Terry model we parametrize the probability that index \(l\) beats index \(k\) as \[ P\left(W_{lk}^{(r,j)} = 1 \,|\, W_{lk}^{(r,j)} \neq x\right) = \frac{\lambda_l}{\lambda_l + \lambda_k}, \] where \(\lambda_l\) can be understood as the power level of contestant \(X_l\). If we treat a draw as survival until one time unit, the parameters \(\lambda_i, i = 1,\ldots,n,\) can be understood as hazard rates. If we assume that the instances that each contestant wins act as competing events, the likelihood function is \[\begin{align*} l_W(\lambda) = \prod_{l>k}\prod_{r=1}^K \prod_{j=1}^{m_{lkr}}&\left(\frac{W_{lk}^{(r,j)}\lambda_k+ (1-W_{lk}^{(r,j)})\lambda_l}{\lambda_l + \lambda_k}\left(1-\exp\left(-\left(\lambda^{(r)}_l + \lambda^{(r)}_k\right)\right)\right)\right)^{1\left(W_{lk}^{(r,j)}\in\{0,1\}\right)} \\ &\exp\left(-\left(\lambda^{(r)}_l + \lambda^{(r)}_k\right)1\left(W_{lk}^{(r,j)}=x\right)\right). \end{align*}\] We parametrize the rater and team specific power level as \[ \lambda_k^{(r)} = \exp(\zeta_k + \eta_r), \] where \(\zeta_k, k= 1,\ldots, n\) are the contestant-specific parameters and \(\eta_r, r=1,\ldots,K\) are the rater-specific parameters. In this parametrization the random rater effect acts multiplicatively on the contestant specific power levels. Then, \(\exp(\zeta_k)\) is interpreted as the contestant specific power level, while \(\exp(\eta_r)\) is interpreted as the rater specific random multiplicative effect. We formulate the following Bayesian hierarchical model that can be used to fit the data, \[\begin{align*} &\zeta_1,\ldots,\zeta_N \sim \mathcal{N}(0,10), \\ &\eta_1,\ldots,\eta_K \stackrel{\text{sum-to-zero}}{\sim} \mathcal{N}(0,1), \\ &\lambda_k^{(r)} = \exp(\lambda_k + \eta_r), k = 1,\ldots,n, r = 1,\ldots,K, \\ & S_{lk}^{(r,j)} \sim \text{Bernoulli}\left(\exp\left(-\left(\lambda^{(r)}_l + \lambda^{(r)}_k\right)\right)\right), \\ &W_{lk}^{(r,j)}\sim\text{Bernoulli}\left(\frac{\lambda_k^{(r)}}{\lambda_l^{(r)} + \lambda_k^{(r)}}\right), \text{when } S_{lk}^{(r,j)}=0, \end{align*}\] where \(S\) is the indicator of a tie. For our case the raters can for example be the match officials, such that there is a tendency for games officiated by one referee to either go more towards draws, or to tend towards either of the teams winning. In theory, the model could be even more flexible such that rater effect also depends on the specific contestant or team considered, but with the low count of matches that each referee officiates for each team, we would not be able to reliably fit this model.

What’s the score?

We are done with all the thinking, and now we simply need to plug in the numbers and get the result. For this analysis we use as a data sample all the matches played in the Premier League 2023/2024 season including information on who was the head match official for each game. The teams playing the matches are considered the contestants in our model framework, while match officials are considered the raters. We use rstan for the computations with 4 chains, each of 2500 warm-up iterations and 2500 posterior samples. The algorithm used is the No-U-Turn sampler variant of Hamiltonian Monte Carlo. The convergence of the program is satisfactory, meaning that the chains have mixed and effective sample sizes look good. For the sum-to-zero constraint we use a covariance structure that causes this to happen, \[ \eta \sim \mathcal{N}\left(0, \sqrt{1 - K^{-1}}^{-1}\right). \] The prior on the rater-specific offset is chosen narrow (weakly informative) as specified earlier because we believe that raters with the same background can only be so different. With this framework and these settings we obtain the below results showing first estimated power level parameters and then posterior predictive draws of point totals for each team.

Comparing with the actual Premier League Table of the 23/24 season does reveal some interesting changes. Wolverhampton apparently did much batter than their point total reveals. Manchester United goes from 8th to 6th. Tottenham should have gotten the last spot in the Champions League. And then there is the question of who is the real Premier League winner…

Posterior samples of the power level of each team in the Premier League 2023/24 season on the log scale.

Posterior predictive samples of the point totals for each team in the Premier League 2023/24 season.

The Premier League table for the 2023/24 season (source: Flashscore.dk).

We are the champions!

The results are in and the decision is clear. Arsenal are the 2023/2024 Premier League champions… That is, if we can convince the English Football Association that ours is really the best way of assessing who the winner is.

In that respect we should be aware that the above model is only valid under certain independence assumptions, one of these being that the event times, which is to say the time until a win is achieved by each team during any given match, are independent. In various applications such event times do indeed exist. An example of this could be when considering survival times for different patients, in which case they can indeed be said to be independent. In the Premier League setting however, these event times are simply a construct of our minds and do not actually exist. The importance of this violation is hard to judge, but one could imagine that given the general level of a team, matches between teams can be expected to on average fit the relative power levels of the two teams, when no additional information is available. A relevant piece of information could be whether one team in a given match up can be more satisfied with a draw than under normal circumstances, leading to a longer expected event time for both teams in the match up and a potential correlation between our mind-construct event times. This also indicates that with enough conditioning on circumstances, the degree of violation of the assumption can perhaps be lessened. Another independence assumption in the model framework is that matches are independent. This cannot be said to be generally true for football matches played in the same league, even though team managers always claim that they are only focusing on themselves before going into a game. We suspect that the violation is not so critical for the results since such circumstantial influences can go in many directions across a season.

Overall, we see that this method for assessment of power levels cannot be considered perfect. Which is then less perfect, the relative power level assessment or the 0-1-3 point system? A deciding factor for the comparison between the two methods is probably something that does not have to do with the accuracy of either method at all… You cannot easily follow the standings throughout the season with the method that we have used here. While the Bayesian paradigm allows easily for match-week-wise updates, it is not easy to know how results translate to changes in the estimated power levels. Another interesting but probably expected point that we get from the results of this analysis, is that there is substantial overlap in posterior power level distributions between many teams. This means that simulating the same season again and again would likely not lead to the same champion each time. In that regard, Manchester City’s four championships in a row are all the more impressive (or simply very unlikely to happen ever again).

We finish on the note that we should probably not forget that arbitrariness and randomness adds to the thrill of watching the Premier League every year. As for my personal experience with being an Arsenal fan, all I can say is that I am no longer disappointed with last season’s result. I know now that Arsenal was indeed the best team of the 2023/2024 Premier League season, when you look at things in a different way.