In the ongoing saga of American politics, we the voters have seen some pretty improbable things this election cycle. But to many, the starkest instance of the improbable has been Donald Trump’s rise to presumptive nominee for the GOP. But I won’t be talking about the situation in question – instead I want to discuss the state of data journalism in the wake of this campaign season.
Nate Silver, the face of political data journalism today, has been constantly under fire this election cycle because of his early discounting of Trump’s odds of securing the Republican nomination. In the months leading up to the primaries, Silver gave Trump less than a 10% percent chance of getting the nomination. This continued right up until the first primaries when Silver acknowledged the need for model corrections.
Two weeks ago, Nate Silver released his mea culpa detailing how and why he gave Donald Trump the odds he did early on in the campaign. To summarize the reasons, between the lack of high quality historical data, low sample size of similar ‘outsider’ candidates, and the intrinsic difficulty of predicting a single rare event, Silver didn’t feel comfortable that the data was good enough to develop a model. Thus, the forecasts that Nate Silver and his website, FiveThirtyEight, presented were based on more subjective parameters, admitting that punditry superseded their usual methodologies. So if they had relied on their usual techniques from what data was available, would the predictions have been much different? Perhaps, perhaps not.
The question I want to pose is, given the complex decisions that analysts make, how can data journalism really be the best it can be for public interest? While we would like to think that data is this purely objective thing above reproach, that’s just usually not the case – the act of analyzing data and constructing models is finding a story to tell, and this necessarily comes with all the premises and assumptions of the author. The real kicker is that in these stories, the author must necessarily distill and contextualize the results of the model, because people are just not good at intuitively grasping probability and statistics.
Now we have this game of telephone, with a story that passes from data to model to analyst to reader, and throughout all of this, we expect every layer to carry enough nuance and clarity to relay that message wholly and completely, without bias. The unfortunate truth of the matter is that data journalism often provides an equal or even greater opportunity for the injection of bias into reports relative to classical journalism – people have biases, models have biases, and much of the time even the data have biases. And while most in the field will seek to minimize those concerns, there’s no getting away from it completely.
So what is to be done about this? When typical punditry is allowed to editorialize and spin at every stage, must data-driven reporting be the antithesis? It’s not as if statistics are some arcane tools meant for some elite group of people, so perhaps it is naïve to think that the act of applying mathematical formulae makes one more credible. And yet, it does lead to the expectation of credibility.
While we would like responsibility to rest solely with the writers and analysts who develop these stories, a more cynical position might point to a different suggestion; maybe the readers should be expected to be critical for the right reasons. Readers need to understand that the occasional incorrect prediction will happen, even given pristine data and well-developed methods. Sometimes the mistakes will be in good faith and sometimes they won’t, but perhaps the readers share in the duty to search for the truth; to receive information, ask critical questions, and then decide what to do with the results. Then maybe the readers across the country may be a little less quick to vilify a method or a person because of the occasional bad prediction or because a story in the data was at odds with their political views.
Though maybe that’s just a little too optimistic?