Polls, operating characteristics, USA and evaluating success
After the recent US elections, one familiar refrain occupied social media platforms: "polls have failed once again!". But did they? How do we establish that? Polls use statistical methods, and it's never easy to evaluate the performance of a statistical technique based on a single outcome.
Disclaimer: I am not an expert in polls myself, so take my words with a pinch of salt, or more generally as comments on drawing inference from data using statistics. Let's try and do a quick Q&A:
Q: Why do polls need statistical methods? A: This question is less obvious than it might seem. In my experience, most lay people think that polls use statistics because they return percentages. Of course, calculating a percentage has little to do with actual statistics, and can be done by pretty much anyone. Statistical methods are instead useful in polls for two main reasons: they help taking representative samples and communicating uncertainty around results. What do we mean by "taking representative samples"? There is a population, in this case that of American citizens of voting age and rights. In order to know who will win the elections, we should ask every single member of this population what they are going to do on election day. This is clearly practically impossible, so the idea is to contact, somehow, a smaller group of people that is representative of the whole population. This gives an estimate of what the final results of the elections will be. What polls have in common with most statistical problems, though, is that communicating their results as a single number (52.12345%) is not really enough. We need to communicate the uncertainty around that number. This is for the same reason why if you were given a box full of sand and asked to guess how many grains of sand it contains, it would almost be impossible to guess the exact number. Giving an interval for the correct answer would be much less hopeless, and much less dependent on chance alone.
Q: What makes a poll good? A: To keep things simple (perhaps too much), there are, broadly, three properties that are of interest:
1) Predictive ability. Did the poll rank the opponents accurately? In terms of US elections, we need to look at the single states and see whether polls guessed correctly who was going to win the electors.
2) Precision. As we hinted above, it is virtually impossible to guess the exact number of votes a candidate will get. As usual, what we can do is give an interval that is likely to contain the true number. For example, it makes little sense to say "according to our poll, 51% of Americans like robot wars", as this gives way too much certainty to our mean estimate (51%). What we usually give is an interval. In survey language, this is usually known as the margin of error. If this is, say, 3%, after our poll we are X(usually 95)% confident that the true proportion lies somewhere between 48% and 54%.
The smaller the margin of error, the more info we think we gain from our poll. How do we make a poll more precise? The simple way is increasing the sample size. Of course, the larger the sample size, the more difficult and expensive it is to take our poll. In general, it is not just that the interval should be smaller and smaller. It should also be correct, in the sense that it should reflect appropriately all sources of uncertainty around the true value.
3) Representativeness. It is not enough to take a larger and larger sample to get good results. If we asked all Dems about their voting preferences, we'd have an extremely precise estimate, but this would not be representative of the voting preferences of our target population, the whole American one. On top of having a small margin of error, we need our estimate to be unbiased. This means that, were we to hypothetically repeat our poll infinite times, the mean of all the repetitions should match the population estimate. How do we achieve that? By making sure that every single member of the population has exactly the same probability of being sampled. (well, things are a tiny bit more complicated in most actual polls, but no time to go into that here).
Q: How do we establish whether polls were correct?
A: Predictive ability might seem, at first sight, the most important property. However, it represents an example of one of the biggest enemies of statisticians: dichotomisation. The poll does not simply return a binary answer to the question "who will win?", but much more. So it appears silly to just throw away all the remaining info, and only focus on a single detail: the difference between the observed means. Also, in a lot of cases who will win is quite clear and there is possibly no need to even run a poll to know that. If we ran a poll asking a sample from the whole British population "is your name Joanna?" and got 40% Y and 60% N, "No" would correctly be the winning answer, as the majority of Britons are not called Joanna, but the poll would nevertheless have been terrible at achieving its goal of representing the whole British population.
So, in order to evaluate whether polls worked, it is also important to verify whether they were representative, and so unbiased, and whether they reflected in their margin of error all sources of uncertainty appropriately.
Q: So, we are ready to ask the ten million dollar question: did polls work this time?
A: Answering this question is never easy, mainly for a reason: the only theoretically 100% accurate way of answering it (under the frequentist framework, which sees each experiment as one of an infinite sequence of possible repetitions of the same experiment) is, for every pollster, to repeat the final poll infinite times, and check whether 95% of the time the interval given as result includes the right estimate of vote shares. This is of course impossible, as we only have single repetitions of each poll. What we can do is:
1) Evaluate whether, in the long run, polls by that specific pollster on different subjects cover the true value correctly, i.e. 95% of the times. However, each poll covers a different event, and each could be wrong at some point. If I have always cooked an amazing pumpkin pie these past years, it's not to be excluded that this year I might forget it in the oven and burn it.
2) Aggregate the results of all the final polls from all pollsters. Of course, this is not ideal, as including few terrible polls could make the good ones seem bad. But if our goal is to evaluate if the general sentiment that polls failed has any foundation, rather than if a specific poll was well done or not, then this is perhaps the best possible way of doing it.
So, let's start from polls about the nation-wide popular vote. These are perhaps the least relevant, but the easiest to evaluate. So, I took the last 30 polls published in the week leading to the elections, for which margin of error was available. First I looked at the confidence interval for the Democratic vote share, ordering from the lowest to the highest predicted proportion, and comparing against the observed result (in dicated by the vertical bar at 51.3%).
Results are actually not that bad. Roughly, intervals are distributed symmetrically around the observed proportion. The mean estimate is quite close to the final result (51.0% vs 51.3%), suggesting no bias. A little bit more polls did not cover the right value compared to the expectations, though (five, while the expected value was one or two). Given the lack of bias, this might even just be due to slightly too small margins of error, possibly not taking into account all sources of variability correctly. For example, these polls were the last carried out before the elections, but still some of the interviewees were contacted about a week before Election Tuesday, and hence the variability between voting preferences a week apart might not have been taken into account properly. (To be clear, this is just pure speculation!). What is much more striking, though, is the corresponding plot for Republicans:
Here, polls were clearly biased (mean across polls = 44.2% vs election result = 46.9%) , with most not even covering the correct vote share and none with mean estimate larger than that. This suggests that the main problem with polls was estimating the proportion of Trump voters. One caveat is that some of the polls included "undecided" as a possible answer. Hence, we expect both democratics and republicans to overperform compared to polls, by gaining some of the votes of the undecided. We will come back to this point later.
What about state-wide polls? Answers can be very different depeding on the specific state. For example, this is what we see with results of polls in California, aggregating the results for Dems and GOP and sorting for Democratic vote share:
Note that way less polls were conducted in California compared to the whole of the US, so these last 10 polls refer to quite a wide period (starting from late Summer) so that we expect their margin of errors not to reflect the variability in voting preferences over such a long time frame. So, overall, results of California polls are at least compatible with the final results. What about Florida, though?
Here, democratic votes seem to have been clearly overestimated, something we cannot just put down to the "undecided" making up their mind. Most polls still gave Biden as the winner, though towards the end it was clear that we were heading towards a 50-50 race. Yet, the difference in the end was quite substantial between the two candidates, and in favor of the underdog.
Q: So, stop talking, and just answer the question: were polls off or not?
A: It depends on the polls. In certain states (e.g. California) they seemed generally accurate, in others (e.g. Florida) way less so. Nation-wide, the share of democratic votes was predicted quite accurately, but Republicans clearly overachieved. Of course, the problem is that the outcome of the elections is really decided by a handful of states, and those were often the ones with the biggest error. Still, Biden's lead in most of them was such that the predictive ability to declare the final winner of the elections was still good.
Q: So what are the possible reasons (some) polls were wrong?
A: The day after the elections, I launched a Twitter poll asking what were the likely reasons for the failure of some of the polls. Of course the irony of a poll about a poll does not escape me, and it's also obvious that Twitter polls do not rank very well in terms of methodology. They are not much representative (people interested in what I have to say might not be on average poll experts, but you are all fantastic people, the best in the world, I swear!) and they report results as a single estimate, with no uncertainty. On a note of caution, I am not exactly an influencer, so the sample size is pretty limited, 33 votes only! Nevertheless, below the results and what I mean by each of the options:
1) "Silent Trump" voters: these are people who were not honest with poll interviewers for whatever reason, but then ended up voting red. Often, this happens when people feel "ashamed" of their voting preference, because media bias tends to give more voice to the opposite view or because of a sense of "guilt" (e.g. someone voting red in a historically super-blue family, or vice-versa). This always happened in Italy with Silvio Berlusconi, who was a controversial figure because of his scandals and multiple trials, consistently under-performed in polls but managed to remain the most influential figure of Italian politics for about 20 years. Trump has many things in common with Berlusconi, and the fact that in both the last elections he got more votes than most predicted suggests this might be one such thing. Another possible explanation that would go under this label is for example Republicans messing up with pollsters because of low trust in their unbiasedeness.
2) "MNAR data": one day (threat follows) I will do a whole post on missing data, a topic I am pretty familiar with, as I did my Ph.D around it! For the moment, let's just say that the MNAR acronym stands for Missing Not At Random. What does it mean?
Inevitably, a lot of people contacted by pollsters are not reached. Some turn down calls, some refuse to respond to the questions, some have changed their phone number etc etc.
If people are no more likely to be missing based on any characteristics, than data are Missing Completely At Random, and we can do pretty much nothing about it and still get good results.
If missing people are systematically different from the observed ones, but the difference can be predicted by their known characteristics (age, gender, job, etc), then data are Missing At Random; we have to use more sophisticated analysis methods, but we can recover the true answer.
Finally, if the people who are missing are just systematically different from those observed, and this is not just explained by the value of some observed data, then we are pretty much doomed: data are Missing Not At Random and there is nothing else we can do other than play with the numbers, making assumptions such as "what if missing people were systematically this different from observed ones?".
In polls, the kind of bias arising from data being MNAR is often called non-response bias. One can try to use all the available information as much as possible to mitigate it, making sure analyses are valid under the Missing At Random assumption, but we can never be sure we have used all the necessary info. We can see whether there is a difference in age, gender, job, place of residence, between responders and non-responders, and adjust the analysis accordingly. But we cannot adjust for residual differences between the two groups, unexplained by available data. For example, people who are not responding are more likely to be older, live outside the cities, etc. If polls are done on-line, some people might even lack internet connection, or be old enough not to know how to use a smartphone! If we have enough data to recover the likely biases that a naive analysis ignoring differences between responders and non-responders would introduce, than fine.
Otherwise, think at the example that I gave before of Republicans messing up with pollsters. If, instead of messing up, they simply decided not to answer, than we'd have a clear case of MNAR data. Adjusting for age and other characteristics would be useful to mitigate the bias (because e.g. older people are more likely to vote conservative), but not to eliminate it.
"Undecided" can also be considered missing data in a poll. If all people who answered that they were undecided later chose to vote GOP, data would be missing not at random and we have seen that, apart from key states like Florida, this explanation alone could (potentially) justify differences between polls and election results.
3) "Sampling issues": this is a quite vague definition, that can encompass lots of situations. One example is people that are "hard to reach" not being sampled and represented enough in on-line polls. More generally, it represents any issue related with trying to obtain a sample as representative as possible. Not saying that this has been the case here, but Florida is a state with a lot of immigration from Latin countries, particularly Cuba. If sampling is done according to population characteristics too old to reflect the current demographics, this can bias the results in favor of the more traditional ethnicities compared to the ones of the more recent immigrants. In general, this is quite a more technical possible issue, one that is less easy to investigate for lay people (like myself!) but that is perhaps the most interesting for experts. This is because you can always improve your sampling techniques to do better polls, but you can hardly do anything if interviewees lie to you or do not pick up the phone, other than dress yourself as a seer and guess how much they are going to lie/refuse to respond.
4) "They were not off": as it is probably clear from this blog post, my view is that there was indeed some issue with the polls this time around. A devil's advocate would probably say that polls are explicitly done to reflect the voting preference at a specific point in time, that is different from election day, and so they will not reflect variability satisfactorily from a frequentist perspective. Further, they might argue I did not exclude from my analysis the "undecided", who could considerably move the estimates. But personally I do not find these explanations convincing enough. What I agree with, though, is that it is not a very valid conclusion to state that polls are useless, and should not be looked at anymore. They cannot tell the whole story, otherwise we wouldn't even need an election. But they are an important piece of information, that can be used in conjunction with any other available information in order to predict the outcome of an election. Knowing that they can fail, gives us a warning and tells us we should simply weight them slightly differently next time around, but I am far from convinced we should just abandon them.
A final consideration to conclude: there are, of course, those who think polls are just rigged. But I will never understand this reasoning. I know nothing about behavioral science, but as a sample of 1: in a clearly bipolar electoral system like the American one, surely, as an elector, I'd rather be more likely to vote if I knew that my party was trailing, rather than leading, in the preferences!
Comentarios