Predicting World Cup 2018 with Ordinal Logistic Regression

Eugenie Chung | 7/2/2018

Topics: Regression Analysis, Statistics in the News

A score by the dominant left foot. Great passes into the penalty area. The 4-4-2 or 4-3-3 formation. The controversial yellow and red cards. The curse of penalties. There is something for everyone to enjoy in this month of world-class football.

According to a recent article on BBC, England has 4% chance to win the World Cup 2018. This is based on calculation using computer simulations.  Points are awarded for each match based on the probability of a win/draw/defeat based on the ranking of each side. While this sounds disappointing, I decide to carry out some analysis using past data.

 Poll: Who's going to win? (closed - see the results!)


Editor's note: We put together a quick poll to see who you were predicting to become the 2018 World Cup Champions. Get the analysis on the results here.


It is more or less common knowledge that the average age and cap (number of times a player represents his country in international matches)  of the squad have some impact on chances of winning or moving on to the next stage of the game in a football tournament. Hence I am gathering squad data from past 20 World Cups and calculate the average age and cap of the teams reaching top four positions. I also take into account the location of the tournament and the continent the team is from. World Cup 2018 is being held in Russia which will suit quite well with teams from European countries, with small or no time differences to adjust to. Teams from countries further afield will probably need to get over with jet lag and the cooler temperature. Will this make a difference? Let’s find out.

Below is a screenshot of some of the data in Minitab worksheet.

 070318-world-cup-01

The dataset consist  of the following.

Year: the year the tournament took place

Location: the location of the tournament, SA=South America, E=Europe, Other=countries not in Europe of South America

Team: continent of origin of a team, SA=South America, E=Europe, Other=countries not in Europe of South America

Average age: mean age of the squad

Average cap: mean cap of the squad

Position: the final position achieved by the team in the tournament

Out of the 20 World Cup tournaments  we have so far, 10 of those were held in Europe, 7 were held in South America and 3 held in countries other than in Europe or South America.

Because our response variable is the position achieved, which is discrete, we will use ordinal logistic regression. Ordinal Logistic Regression is used to model the relationship between a set of predictors and an ordinal response, in our case, we have positions obtained in tournament 1,2,3 and 4.

To begin the analysis, I go to Stat > Regression > Ordinal Logistic Regression and fill in the dialog box as shown below.

 070318-world-cup-02

Minitab provides three link functions which gives a wide range of models.

A link function transforms the probabilities of the levels of a categorical response variable to a continuous scale . Once the transformation is complete, the relationship between the predictors and the response can be modelled using linear regression.

070318-world-cup-03

By default, Minitab treats the largest numeric value as the reference event for the response. However, in this case, we want to focus on finding out factors that have impact on a team winning. Hence I change the order from 4 3 2 1 to  1 2 3 4.  I also adjusted the reference level for the categorical factors as we want to focus on certain locations.

070318-world-cup-04

One of the key statistics in the results is the odds ratio. It compares the odds of two events. The odds of an event are the probability that the event occurs divided by the probability that the event does not occur.

Odds ratios that are greater than 1 indicate that the first event and the events closer to the first event are more likely. Odds ratios that are less than 1 indicate that the last event and the events that are closer to it are more likely.

Looking at our data, the odds ratio for average age predictor is 0.98. The “first” event in our case is the event of achieving first position, in other words becoming the champion. The odds ratio implies that as age increases, the team is less likely to achieve this position. In other words, for each unit increase in age, the odds that the team will become champion instead of achieving 2nd,3rd or 4th position decrease by about 2%.

On the other hand, the odds ratio for the average cap predictor is 1.04. This implies that for every unit increase in average cap, the chance of winning championship increases by 1.04 times.

Apart from the odds ratio, we can also assess the coefficient to determine whether a change in the predictor variable makes any of the events more or less likely.  

Positive coefficients make the first event and the events that are closer to it more likely as the predictor increases. Negative coefficients make the last event and the events closer to it more likely as the predictor increases. The coefficient for average age is negative which means as average age of squad increases, it gets more unlikely to get first position. The coefficient of average cap, on the other hand, is positive. Hence as average cap increases, it is more likely to win the tournament.

As for categorical predictors, the odds ratio compares the odds of the event occurring at different levels of the predictor when comparing with the reference level. In this example, the reference level is “Other”.  Therefore, we can say that for a tournament held in Europe, the odds the team will achieve first place is 1.36 the odds of it being held in other places. This makes sense as a large proportion of teams in the tournament are from European countries. With small time difference to adjust to means the players can achieve better.

Referring to the team predictor, European teams are three times more likely to achieve champions than team in other area while South American team are about 6 times more likely to achieve champion comparing to teams from other continents.

Having fitted a model, it is crucial to verify if we have obtained a good model. We can refer to the Goodness-of-fit tests table.

070318-world-cup-05

Because the p-values are high, we can conclude that the model is a good fit. Last, but not the least, the logistic regression can be used to calculate event probability. Event probability is the chance that a specific outcome or event occurs. To calculate this, I rerun the analysis again and store the event probabilities as shown below.

 070318-world-cup-06

Minitab stores the event probabilities in the worksheet without the need to work out the figures using the coefficients from the regression model. This gives us the probability of achieving different levels of the response for given values of the predictors. The default column names start with EPROB, followed by a number. Because we have four different outcomes, there are four columns of probabilities for each row of data.

070318-world-cup-07 

With this useful feature, I can also make prediction for new observations. What’s the probability of England winning? How about the chance of winning for five-time champion Brazil? Using data on squad from FIFA website, I calculate the average age and cap for these three teams and these are the results:

 

 Country

Average age of squad

Average cap of squad

England

25.56521739     

20.95652174

Brazil

28.13043478

29.86956522

 

And after putting these figures into the Minitab worksheet, we can rerun the analysis with additional settings as shown.

070318-world-cup-08 

Based on the above, the model indicates that the probability for England to win is around 0.21 or 21% (well much higher than what the article predicts) while Brazil has a probability of 0.45 or 45% to win.  

Not knowing how far England will go but I will definitely enjoy the drama being unfolded on the pitch!