What does it mean to be a scientist inside a political organization? What does it mean to treat data science as...
a science? That was how Michael Simon, who leads the data science and engineering teams at the CIA, kicked off a talk he gave at last month's Chief Data Officer Summit in New York.
Simon, stressing that his remarks were his own and did not represent the official view of the CIA, said he was there to shed some light on the CIA's data science program. He said he hoped the information might be relevant to the corporate data chiefs in the audience.
The conference was in early December, more than a month before a visit to CIA headquarters by the new president sparked a national debate about the art and science of estimating crowd size. But even then, there was a note of urgency -- insistence -- in Simon's presentation that stood out from the raft of talks that day by data experts in the financial services, hospitality and other industries.
For data scientists, "the goal is to find the truth -- the truth as best we can possibly know it," said Simon, chief of data science for the mission center for weapons and counterproliferation at the CIA.
Data science is different from advocacy, Simon said. An attorney representing a client starts with a desired outcome, looks for all the evidence that supports it, and presents the case to a judge and jury for adjudication. A data scientist who does science that way is not much of a data scientist.
"Our job is to provide information that the decision-makers need to know, not the information that they want to hear," he said.
Political organizations being what they are, that job is a tall order. Simon presented four features of science that he believes help the team in its pursuit of truth. They are transparency, innovation, the inclusion of multiple perspectives and the quest for causality, as opposed to correlation. Here's a brief summary of these features and how he said they are applied in the CIA's data science program.
No. 1 Transparency
One of the CIA's most prominent intelligence products, Simon said, is the President's Daily Brief, a set of intelligence articles by the intelligence community that assesses some issue or event. The brief is not a "one sentence long" conclusion, Simon said. "It is filled with evidence -- filled with it," and it includes the intelligence community's assessment of the evidence. The agency's level of confidence in the conclusion is based on its level of confidence in the evidence.
"Why is that important? If there is someone who is going to rely on your conclusions, they have to understand the reasoning you have," Simon said. "And, therefore, scientists are in the business of explaining why they got the answer, not just the answer. We hold that strongly."
No. 2 Innovation
The CIA data science program has a long history of innovation, Simon said, adding: "There is not much that I can say in this forum about some of the innovative things we have done. Let's just say there is a long history of trying to estimate the quantity of things -- to be more accurate in our estimates of things."
One of the ways the data science program ensures it's making progress in this effort is to publish its methods. "If you publish it, you are less likely to make mistakes. ... And, of course, we don't want to waste anybody's time; the point is not to reinvent something that is already done."
Michael Simonleads data science and engineering teams at the CIA
But, as Simon noted, there is a tension between being innovative and requiring people to publish their work. "I find it rare to see someone publishing or touting their failures."
To counter this tendency to mask failure, Simon advocated that researchers publish their work in phases, documenting in just a few pages what the project is -- the parameters of the first experiment, for example -- and the results. And they should continue doing that as the project progresses.
"I have not figured out a way to get people to publish their failures, but we're getting much closer," he said. "Because if you get to phase three or four or five -- if you have all the history for all the phases -- then the record is there, and someone could follow what happened." He noted this documentation is especially important in a workforce that changes assignments frequently.
No. 3 Multiple perspectives
The need for multiple perspectives in data science is built on the assumption that there is more than one way to look at the world, Simon said.
"The literature is clear: Multiple perspectives outperform single ones. It is in true in election forecasting, weather forecasting in volcano predictions. Some may not like it, because they are strong advocates of their theories -- theories they came to having assessed the evidence," Simon said.
But data science programs -- at the CIA, but also in corporations -- need to encourage what most good scientists know in their hearts, namely that a minute, day, week or years after they present their theories, someone will come up with a better one. "So, in the back of their minds is a little bit of humility." Humility should be baked into data science programs.
Having multiple perspectives also helps prioritize what is important -- and what uncertainties are most important, he said, showing a map of the various data models used by the National Oceanic and Atmospheric Administration to predict the weather. "The red circles they have drawn are the places there is the greatest uncertainty -- where the models disagree the most."
No. 4 The search for causality
Scientists know that correlation does not imply causation. "You don't want to have a spurious correlation where two things are connected in some time or space, but actually a third thing is causing both -- and there is no connection," he said.
It's not good science, but it's also dangerous. "The recipients of the information typically are not just going to sit there and do nothing ... they are going to do something about it. If they are going to do something about it, then you have to know what caused what," he said.
If he tells the president, for example, that there is an 80% chance of this or that occurring, "the president should say, 'What do we do about making that 80% a 90% -- or should we focus on the 20% instead of the 80%?'"
"If I only focus on correlation, I am not helping anybody," Simon said.
Read more SearchCIO conference coverage:
Goldman Sachs CIO compares digital ambitions to Google's
Gartner video notebook: 'Aggressive learners' high on IT skills wish lists
SIMposium 2016: How to raise the business's IT IQ