skip navigation nih record
Vol. LXIV, No. 15
July 20, 2012

next story

Butte Mines Public Databases for Therapeutic Gold

On the front page...

Dr. Atul Butte spoke at NIH on June 20 in the final Wednesday Afternoon Lecture of the season.

Dr. Atul Butte spoke at NIH on June 20 in the final Wednesday Afternoon Lecture of the season.

In the “olden days,” scientists first had hypotheses, then went out and collected measurements to test their ideas. But in an era when data is pouring in by the zettabyte (1,000 quadrillion), it makes increasing sense to say, “The data’s there already, we just need to ask intelligent questions of it.”

That’s precisely the approach taken by Dr. Atul Butte of Stanford University, who on June 20 dazzled a Wednesday Afternoon Lecture Series audience with at least three characteristics: an entrepreneurial zeal (he has started or consulted for dozens of companies, leading him to quip at the outset, “Therefore you can’t believe a word I say.”); a rapidity of speech that must certainly have been tutored by the Internet’s blazing speed; and an uncanny faith in the cleverness of high school students to make the most ambitious use of the many Everests of data now piling up around the globe.

Butte repeatedly made the point that smart teens, unafraid of scrounging about the Internet backyard in which they grew up, can out-research the tenured classes, and he has the data to prove it: five high school interns who have passed through his Stanford lab have placed in the top 300 in such prestigious science contests as those sponsored by Intel, Westinghouse and Siemens.


We are in the middle of a data deluge,” he said. “We already have the measurements in many fields…we have so much data that the new magic is in figuring out ‘What’s the cool question I want to ask of this data?’ That’s 99 percent of the work in my lab now.”

Butte is an unabashed fan of NIH, especially its data repositories at the National Center for Biotechnology Information. “Without NCBI, my research career would not exist,” he said. He is also an alumnus of the summer program here—he was a summer student in DCRT (now CIT) in 1991—and lived on campus at the Cloister from 1993 to 1994 as a participant in the HHMI-NIH Research Scholars Program. He now holds 5 NIH grants and participates on 11 others, from 9 institutes/centers.

“Data-driven science is the next big scientific revolution,” he declared, displaying a prop—a 96-well Affymetrix microarray that can perform such feats as sequencing genomes. But even that chip is already 15 years old, harkening to an era when complete human genomes bore billion-dollar price tags. Butte says we’re now looking at a $33 genome, and more likely free genetic sequencing done by companies to whom it will be worth more to have the data than to bother charging people to get it.

“It’s amazing how much data we have on the Internet,” he enthused, “just from this one high-throughput modality.”

Butte reported that, in the U.S. repository alone, there are 761,000 publicly available microarrays, and another 213,000 in global repositories. “Suffice it to say, it’s growing like crazy. Soon there will be 1 million publicly available microarrays—that’s up from zero in 2002.” The number is now doubling every 2 to 3 years, and is actually just slowing down to the rate of Moore’s Law; it had been tripling for years, Butte said.

A high school science fair entrant can now download more than 31,000 samples of breast cancer data, representing more than 1,000 independent experiments on breast cancer, Butte noted, “and it’s almost as easy as downloading a song on iTunes…That’s more samples than any one researcher will ever have in their lab, and the same is true for hundreds of diseases.”

Butte sees his role as empowering the next generation of scientists with the best questions to ask of that data.

“The entire Framingham Heart Study is now online,” he said. “You can download 14,000 peoples’ genotypes. You can download 10 or 20 years worth of data…Sitting in there might be the next big diagnostic for disease that [researchers] just haven’t thought about looking at yet.”

When Butte’s lab set out to discover and validate a potential serum marker for acute myelogenous leukemia, they could have done old science—put up posters and flyers around the medical center asking for serum and plasma, and fill out lengthy forms—or hit the easy button, Google.

They chose the latter and found a company, ConversantBio, offering exactly what they needed, at a cost of $55 per patient. “So we bought them all, and we validate our markers this way…I love biobanks and biorepositories,” he said.

Validation methods are increasingly commoditized, Butte said, by companies with names like “These companies are competing [for scientists’] business. It’s easier than [shopping at]”

Butte’s team is effectively outsourcing experiments. Often, the data comes so cheaply that he can answer quality concerns simply by ordering the same experiment to be done by two different companies, then comparing. He insists that three of the four steps in the translational research pipeline are now commoditized. However, “nobody’s ever going to outsource asking good questions. That will never go out of style. That’s all we do in my lab, in fact. You can buy all the rest.”

Butte’s home-run example of his signature informatics approach involved a review of 130 experiments focusing on 3 species and 4 tissues, looking for a common element in type 2 diabetes, a global health problem (an estimated one-third of all children born in the U.S. since 2000 will get it). “We still need new therapies for it and we still don’t know how you get it,” he said.

To their surprise, they found a pair of genes associated with low blood sugar that could become therapeutic targets.

“Sitting in public databases are many findings like this,” he enthused. “The kids call it ‘crowdsourcing’ to ask the Internet to help with your project. We could call this ‘retroactive crowdsourcing,’ getting help from scientists from the work they’ve already done.”

Butte decried the notion that “if it’s free, and it’s on the Internet, it must be valueless. Especially at a site like NCBI, this just isn’t true.”

He sees the fields of environmental studies and epidemiology as especially ripe for an informatics approach. If one wants to consider the environmental causes of disease, no fruit hangs lower than NHANES (the National Health and Nutrition Examination Survey, a program of studies designed to assess the health and nutritional status of adults and children in the United States). Butte practically salivated, “All that data publicly available for you to do the kind of science you want to do with it…”

Butte also thinks we’re on the verge of EWAS— environment-wide association studies—that would be like GWAS genetic surveys. Since whole-patient genomes are on the near horizon (“It will be faster than Jiffy Lube,” he predicted. “By the end of the decade it will cost about $33—you’d pay more than that to park in Bethesda.”), the new challenge for medicine will be “how can I compensate for my genome? The environment will be the new prescription for the physician.”

Butte thinks “risk-o-grams” will become available, combining the contributions of both nature and nurture to one’s likelihood of falling ill. “But we have not yet found the gene for compliance in medicine,” he joked.

In an era when undergraduate students have 69 complete freely available human genomes to interrogate over on, “we can either get with the program,” Butte concluded, “or just be scared of this.”

The entire talk can be viewed at http://videocast.

back to top of page