Medical History Matters in Era of Big Data

By Eric Bock

Dr. Joanna Radin lectures virtually to an NIH audience on Big Data.

Scientists involved in machine learning often don’t know the origins of the datasets they use to write and test algorithms, including where the Pima Indians Diabetes Database (PIDD) came from, explained Dr. Joanna Radin at a recent virtual NLM history talk.

“The history of the PIDD makes political and economic subjectivity visible in ways that are of enormous consequence to practitioners and participants in medical and machine learning,” said Radin, associate professor of the history of medicine and history at Yale University.

Machine learning is a sub-discipline of artificial intelligence that dates to the 1950s. Radin explained: “It focuses primarily on algorithms capable of learning or adapting their parameters based on a set of observed data without having been programmed to do so.”

In 1987, Dr. David Aha and several graduate students at the University of California, Irvine, built an archive of datasets called the UC Irvine Machine Learning Repository. It offered programmers the ability to download large, well-validated datasets needed to test algorithms.

One of the oldest files in the repository was the PIDD. It became a standard for testing data-mining algorithms in predicting diabetic status. What made the PIDD so valuable is that programmers knew it reliably predicted when people get diabetes.

The Pima, who refer to themselves as Akimel O’odham, which translates to “River People,” are an indigenous community. Most of the community lives on the Gila River Indian Reservation located in Arizona. In the early 1960s, the institute now known as NIDDK began conducting an epidemiological study of arthritis among the community. The researchers quickly learned that the inhabitants had one of the highest recorded rates of diabetes.

By 1965, Radin said, every resident older than 5 in the study area had been asked to participate in a longitudinal study of diabetes. Estimates suggest that 90 percent of the reservation enrolled in the study. NIH researchers later worked with computer scientists to digitize long-term patient data.

“Medical information collected has been regarded as a valuable resource for improving general knowledge about the disease,” she said.

In the 1990s, Aha, who had been collecting data for the repository, had completed a postdoctoral fellowship at Johns Hopkins. It was around this time that a researcher from the Applied Physics Laboratory, located near Johns Hopkins, came into possession of the dataset.

“What surprises me is that we get from a situation where we’re using data collected from indigenous people for diabetes research and, by 2006, the dataset became so standard that it’s being used to teach people how to use R, a popular statistical software package,” Radin said. “Such uses are purely about teaching people about how to use data.”

A few years ago, she spoke to an expert in machine learning who was writing an algorithm for New York City’s electricity provider to predict where fires might spark in the underground power grid and trigger an explosion of manhole covers. To test and optimize the algorithm, the expert fed it several complex, widely available data sets, including the PIDD.

The story of how participants in an NIH research study on diabetes and how their data was used to refine an algorithm that would predict where a manhole cover would erupt “is exemplary of the history of big data,” Radin said.

Today, the use of patient data from all kinds of communities is part of the public health response to Covid-19 outbreaks around the world. Officials are using contact-tracing mobile apps to identify people who might have been exposed to the virus.

While these technologies might help limit spread of the virus, Radin said it’s important to think about their implications, given the history of big data. How, for example, will private companies or governments use mobility data in the present and future?

“Sometimes even the best of intentions are upended by the momentum of the technological systems that people find themselves in,” Radin concluded.

July 24, 2020

Vol. LXXII, No. 15

July 24, 2020

Vol. LXXII, No. 15

Medical History Matters in Era of Big Data

The NIH Record