Probability as a game

Matteo Quartagno
Mar 2, 2020
9 min read

Uncertainty, Random Variables, Sport and Probability.

When I was 18, I decided to study Math, thinking that it was possible to model everything using equations. During my undergrad studies, though, I realised that that was, to the very least, an optimistic illusion: we definitely live in a haphazard world, where no universal rule can be used to describe associations or relationships without error.

At times I wonder whether this is just due to our limited knowledge: maybe if we had any way to model the behavior of every single quark in the world, we could really predict what will happen anywhere and at any moment with pinpoint accuracy.

But, luckily, this is not the case (yet), and what we are left with is a world that is largely subject to chance (and possibly free will, if you believe in that). Simple deterministic equations are therefore not enough to describe reality, we need to model chance as well.

Mathematicians have developed a tool to do exactly that: Random Variables. While a variable is basically anything that we might want to include in a model of reality, the term random clarifies that the specific value taken by this particular kind of variable is the result of a game of chance. To put it in simple terms, every single value that the variable might take is given a number that tells us how likely that variable is to take that specific value: this is known as a probability distribution.

A sport geek

Now, this is the first post where I am going to use something that I believe to be incredibly useful at explaining concepts in statistics and probability: sport. Ever since I was a kid, I always loved sport. Sure I love practising it, but here I am talking more about watching it. My father passed me all his passion for skiing, tennis and athletics in particular. I often wondered why I like sports so much. Aside from my father, the rest of my family, mom and brother, never liked sports in TV, and often asked me what I found so interesting in them. After years, I think I realised what the reason is and, at the risk of seeming a geek, or even a weirdo, I am going to disclose it: results of sport events are incredibly fun to model, predict and analyse. And this is for a simple reason: they are nothing but realisations of random variables.

Let's start from a simple example: Eliud Kipchoge is about to start running a marathon. This 35-yo from Kenya is the best distance runner on earth, as he holds the world record having covered 42,195km in just 2h01m39s. Roughly the time it takes me to find the will to get dressed to go jogging. What time will it take him to run this specific marathon? We all know that just because he ran once in that WR time, it does not mean he will repeat the exact same time again. There are certain factors that we can use to predict his final time today: what Marathon is he about to run? Is it one with a flat route like Berlin, where he set his record, or is it one like Venice, where he will have to run up and down the stairs of several bridges? What's the weather like? Is it hot like in Rio or moderately cold like in London? How many rabbits (i.e. pacesetters) are there? All of these questions can help us predict the final time but, as I said before, we will never be able to know every single variable that we'd need to predict the exact final time with absolute certainty. Did he sleep well? Did he train optimally in the last 6 months? How's his bowel? Did he digest breakfast?

Since we cannot know the answer to all of these questions, what we can do is treat his final time as a random variable. Rather than create a model for his exact final time, we can create one where we give every specific running time a specific probability. There will be a most likely one (generally, but not necessarily, the mean) and some that will be very unlikely (trust me, he's never going to run in 1h30, nor in 3h50). Yet, the set of plausible times is going to be almost infinite. This is a situation where the random variable we'd like to use should be continuous.

The most famous continuous random variable is the normal distribution. There is one way to plot a random variable, and this is to plot its probability distribution. For the normal distribution this is known to have a typical bell shape. How should we interpret that? The top of the bell is the most likely value. For Kipchoge's marathon we can conservatively assume that this will be around 2h05m. Then, going further away from this value, all other possible times are progressively less likely, until they become almost impossible below 2h or above 2h10m. Interestingly, you can see that the plot is symmetrical around 2h05m: from this it follows that the probability that he will run the marathon in 2h06m is exactly the same as that that he will run it in 2h04m. I might dedicate a whole post to this little miracle of a distribution in the future, but for the moment let's move on.

Let's play a game

Most sports where time decides winners and losers can be modelled similarly: swimming, running, skiing... But then there are other types of sports, like games: my definition of "game" is a sport where team A faces team B, and what matters is only which of the two teams wins. The outcome of a match cannot take a possibly infinite number of values, like the running time of a marathon, but only 2. We therefore say that it is a binary variable.

Still, we cannot say with certainty who is going to win the match. Hence, if we want to model the result of a match, we need a binary random variable. A Bernoulli random variable is one of this kind. Because of how simple the set of possible results is, we only need one number to describe it. This is p, the probability that one team, say A, will win. A basketball game, or a volleyball game, are good examples of games whose results can be assumed to be a random draw from a Bernoulli distribution. Similarly to the continuous example, rather than including known factors affecting the result in a mathematical model yielding the exact final result without uncertainty, we can put them in relation with the numbers that describe our random variable, that are known as the parameters. For the Bernoulli random variable, we just said p is the only such number. So, known factors (which is the home team? What are the starting lineups? Are there any players missing?) can be assumed to affect probability p but then some uncertainty remains on the final result.

There are two other possible distributions that a variable can have, very similar to the Bernoulli one. If we were to repeat a certain game under the same conditions multiple times (e.g. in the NBA play-offs, ignoring the home/away advantage) then we could use a slight modification known as the binomial distribution, that requires to specify not only the probability that A wins each match, but also the number of matches N that will be played. In volleyball and basketball, only a single team necessarily wins any match, but if we were watching a league football game (soccer if you know what the hell 50°F means), then there could be a third possible outcome: a draw. Hence, our variable would not be binary anymore, but rather trinary or, as we usually call it, categorical. A multinomial distribution is one in which the possible results are a limited number, more than two and less than infinite (3 in our example), and each is given its own probability. If Brasil faces Poland, for example, the probability that Brasil wins might be 50%, a draw could be given a 40% probability and the Polish might be left with the remaining 10% probability to take the whole 3 points home.

Waiting and counting

Instead of just predicting who won a match, we might want to guess the exact final result. As usual, even using all the available info, it would be impossible to know exactly how many goals Brasil will score. But what we can do is give a probability to each possible number of goals. What we need is a random variable that counts, like Poisson.

Again we'd have a most likely value (say that Brasil on average will score 2 goals per match, and Poland 1) and we could use known factors to predict how the final result will differ from that average on a specific occasion (is Neymar injured as usual? How much vodka did players drink last night?). However good the info we have, unknown factors will affect the result, and so rather than model the exact final number of goals, we'd model it as a random variable, with the Poisson distribution giving a probability to all possible number of goals each team will score.

Going back to considering time as an important factor, in some sports, like boxing, what we might want to predict is the time till one boxer knocks the other out. There are several distributions that can model this, known as time-to-event distributions, the most famous of which is the exponential distribution. Technically, though, with sports like boxing, a single random variable is not enough to model (and possibly predict) the result of a match. Sure we could do something simple like modelling the simple probability that one boxer wins the match. We have seen that a Bernoulli distribution would be enough for that. But if we wanted to predict the precise development of the match, we would need to combine at least four random variables: time till A is knocked out, time till B is knocked out and... points. For the first two we can use exponential random variables (as they are time-to-event), and for the latter Poisson (as they are counts). These can be combined so that if either of the two times is less then 12 rounds, then A or B are the winners. Otherwise, the winner is the boxer scoring more points.

Hunt-and-Ski

This last example shows how random variables can be combined in all sorts of ways, somehow reflecting how things work in the actual world. Most outcomes of simple events are generally summarisable as draws from continuous, binary, categorical, count or time-to-event variables. But then, combining these together, seemingly infinite combinations can be derived. This sometimes generates so called mixtures of distributions. The last example I am going to give is that of a sport that, as a mountain man, I love very much but that most people outside Norway or Germany are possibly not very familiar with: biathlon. This is a sport that combines two apparently unrelated disciplines, cross-country skiing and rifle shooting. Most sports derive from simple challenges: who runs faster? who skis faster down the mountain? Who throws a stone the farthest away? Although biathlon may not seem to be such a sport, it actually originated in Scandinavia from a simple challenge: who's the best hunter on skis?

A simple way to model the final time of a biathlete, is to combine the two components. Skiing time, similarly to marathon running time, can be modelled with a normal random variable. In an Individual race, for example, every athlete has their own average time to cover 20km on skis. A normal distribution gives a probability to all times close to that average expected time.

Shooting can be assumed to follow a binomial distribution instead. Remember, this involves modelling the result of several binary events (target hit/not hit). Here, N=20 (the number of targets) and P is the probability that an athlete will cover each target. At the end of each season, data from the previous races can be used to estimate P for each athlete quite well. Still, the individual realisation that will decide the result of each specific race, remains random, for the joy of betting houses.

Finally, in order to mix the two distributions, the mean skiing time for each athlete is increased by a minute for every missed target.

Chance is fun

So, to sum up, although it is tempting to try to model reality using equations that give us a single definite answer to a certain question (what time will Eliud take to run the marathon?), in most real life problems we can only consider a limited number of factors that affect the variable we are interested in. We cannot know all the factors affecting the result, and hence we cannot create a model that tells us with no doubt the result of the next race. But, if we use our available info properly, we can create our model of the probability that the result will be a specific one. This is, if you think about it, what makes life, and sports, more interesting. Imagine a championship where all the times, the same team won. Something like italian Serie A. That wouldn't be fun.

Probability as a game

Recent Posts

Comments

Join my mailing list