Data science is a team sport
by Alyson Wilson, professor of statistics and the Associate Vice Chancellor for National Security and Special Research Initiatives at North Carolina State University
“Data” is everywhere – we can’t turn on the television without seeing an ad about how data is going to transform our business or solve a health care mystery. About a decade ago, it was common to say that “a data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.” Over time, a more nuanced definition has emerged that is illustrated using the data science life cycle. Data science is the set of skills that works across the cycle from data generation, collection and processing through data storage and management to data analysis, visualization and interpretation. Many disciplines are involved in this skill set, and when data science is applied to a specific domain question, the team becomes larger.
As a statistician, I was trained with skills more on the right-hand side of the life cycle. Right out of graduate school, I went to work for a five-person company in El Paso, TX, named Cowboy Programming Resources. The company specialized in helping the Army evaluate new orupgraded air defense artillery systems. Our goal was to test how well soldiers could use the systems to accomplish their missions. As a statistician, the questions I faced were different, and in some ways more complex, than the biomedical applications I had researched in school.
We were interested in understanding how the air defense artillery systems would work in combat, which is inherently unpredictable. Soldier reactions and unit dynamics affect the outcomes. Most of the time I felt like I had both too much data and not enough data simultaneously. Our test events might last six weeks, with 400 soldiers in the field. While we had access to every radio message sent within the battalion, we could never test every mission-realistic scenario using every combination of factors and conditions. I could look at every keystroke typed, but we were simulating our key outcome, which was how the battalion performed specific actions under hostile fire.
Data science is a team sport. As data grows in volume, velocity, variety and veracity, solving complex problems can’t be done in a silo.
Assessing mission performance stretched my understanding of, and thinking about, statistics. I had always thought about statistics in an experimental context, where we form a scientific hypothesis, plan data collection, gather data, analyze evidence and make conclusions. While that thinking still held, every part of it stretched. I wanted to understand mission performance, but I couldn’t really test it. I had data, but it wasn’t always about the exact thing I wanted to know. I became very interested in questions that require piecing together many different kinds of information to answer.
In many ways, I was already doing data science – 20 years before that term became popular. I find it useful to think about data science in terms of the 4 V’s: variety, volume, velocity and veracity. I was working on statistical methods to address variety, or how we combine heterogeneous information to solve problems. Within data science, statisticians also work on volume (how to use increasingly large data sets), velocity (how to make inferences from streaming data) and veracity (how to use messy data that might have been collected for something other than your problem).
I find that I am often working with interdisciplinary teams to tackle such questions, integrating distinct expertise to solve complex problems. As a statistician, I don’t immediately know what Army missions entail or how to measure the degradation of equipment or how a radar fails. But in working with a multidisciplinary team, I partnered with military officers, materials scientists and engineers to understand how these components of a mission work.
After working at Cowboy Programming, I moved to Los Alamos National Laboratory, where I spent much of my time helping to assess the reliability of the US nuclear stockpile. The United States stopped full-system testing nuclear weapons in the mid-1990s, but the national laboratories still provide annual estimates of stockpile reliability. From one point of view, when testing stopped, our sample size went to 0. But from another perspective, we had lots of information: data from historical tests, simulation models, subcomponent function tests, expert knowledge, degradation tests. Again, a multidisciplinary team came together to use many sources of information to answer questions.
Now at North Carolina State University, I am the Principal Investigator for the Laboratory for Analytic Sciences (LAS). LAS is a mission-oriented partnership among academia, industry and government that solves problems of interest to the intelligence community (IC). We say, only half-jokingly, that every business in the world is asking how to use data to gain strategic advantage; of course, this is of interest to the IC as well. LAS was formed because the IC recognized that much of the innovation in the big data space was occurring because businesses were asking these questions. At LAS, we build partnerships to bring together the basic research of academia with the implementation know-how of industry and the complex problems of intelligence and national security. We work on problems as diverse as data triage, or how to find the records you need from among trillions you have stored; machine learning integrity, or how to maintain workflows at scale; and human-machine collaboration, or how to make your computer more of a partner and less of a tool. SAS is a longtime partner with LAS, and our current work focuses on automating the analysis of a corpus of data with heterogeneous media, with the goal of developing a flexible modeling pipeline that can be easily tailored to the needs of the IC analyst.
The problems that we work on at LAS could not be solved without multidisciplinary collaboration. Nontraditional participants help us make sense of the data and information relevant to solving these problems. For example, one of our LAS projects, the “Social Sifter,” identifies key social media accounts that are part of a coordinated effort to spread misinformation. Experts in language, marketing, psychology and statistics helped create the interface and algorithms, which can quickly comb through volumes of online information to find these misinformation spreaders.
Data science is a team sport. As data grows in volume, velocity, variety and veracity, solving complex problems can’t be done in a silo. Multidisciplinary teams are critical to turning data into informa-tion. As statisticians we have a critical role to play.
Let's stay connected!
You may contact me by email regarding news, events and offers from JMP. I understand I can withdraw my consent at any time.