We are in the middle of a data deluge,” he said. “We already have the measurements in many fields…we have so much data that the new magic is in figuring out ‘What’s the cool question I want to ask of this data?’ That’s 99 percent of the work in my lab now.”
Butte is an unabashed fan of NIH, especially its data repositories at the National Center for Biotechnology Information. “Without NCBI, my research career would not exist,” he said. He is also an alumnus of the summer program here—he was a summer student in DCRT (now CIT) in 1991—and lived on campus at the Cloister from 1993 to 1994 as a participant in the HHMI-NIH Research Scholars Program. He now holds 5 NIH grants and participates on 11 others, from 9 institutes/centers.
“Data-driven science is the next big scientific revolution,” he declared, displaying a prop—a 96-well Affymetrix microarray that can perform such feats as sequencing genomes. But even that chip is already 15 years old, harkening to an era when complete human genomes bore billion-dollar price tags. Butte says we’re now looking at a $33 genome, and more likely free genetic sequencing done by companies to whom it will be worth more to have the data than to bother charging people to get it.
“It’s amazing how much data we have on the Internet,” he enthused, “just from this one high-throughput modality.”
Butte reported that, in the U.S. repository alone, there are 761,000 publicly available microarrays, and another 213,000 in global repositories. “Suffice it to say, it’s growing like crazy. Soon there will be 1 million publicly available microarrays—that’s up from zero in 2002.” The number is now doubling every 2 to 3 years, and is actually just slowing down to the rate of Moore’s Law; it had been tripling for years, Butte said.
A high school science fair entrant can now download more than 31,000 samples of breast cancer data, representing more than 1,000 independent experiments on breast cancer, Butte noted, “and it’s almost as easy as downloading a song on iTunes…That’s more samples than any one researcher will ever have in their lab, and the same is true for hundreds of diseases.”
Butte sees his role as empowering the next generation of scientists with the best questions to ask of that data.
“The entire Framingham Heart Study is now online,” he said. “You can download 14,000 peoples’ genotypes. You can download 10 or 20 years worth of data…Sitting in there might be the next big diagnostic for disease that [researchers] just haven’t thought about looking at yet.”
When Butte’s lab set out to discover and validate a potential serum marker for acute myelogenous leukemia, they could have done old science—put up posters and flyers around the medical center asking for serum and plasma, and fill out lengthy forms—or hit the easy button, Google.
They chose the latter and found a company, ConversantBio, offering exactly what they needed, at a cost of $55 per patient. “So we bought them all, and we validate our markers this way…I love biobanks and biorepositories,” he said.
Validation methods are increasingly commoditized, Butte said, by companies with names like AssayDepot.com. “These companies are competing [for scientists’] business. It’s easier than [shopping at] Amazon.com.”
Butte’s team is effectively outsourcing experiments. Often, the data comes so cheaply that he can answer quality concerns simply by ordering the same experiment to be done by two different companies, then comparing. He insists that three of the four steps in the translational research pipeline are now commoditized. However, “nobody’s ever going to outsource asking good questions. That will never go out of style. That’s all we do in my lab, in fact. You can buy all the rest.”
Butte’s home-run example of his signature informatics approach involved a review of 130 experiments focusing on 3 species and 4 tissues, looking for a common element in type 2 diabetes, a global health problem (an estimated one-third
of all children born in the U.S. since 2000 will
get it). “We still need new therapies for it and we
still don’t know how you get it,” he said.
To their surprise, they found a pair of genes
associated with low blood sugar that could
become therapeutic targets.
“Sitting in public databases are many findings
like this,” he enthused. “The kids call it ‘crowdsourcing’
to ask the Internet to help with your
project. We could call this ‘retroactive crowdsourcing,’
getting help from scientists from the
work they’ve already done.”
Butte decried the notion that “if it’s free, and
it’s on the Internet, it must be valueless. Especially
at a site like NCBI, this just isn’t true.”
He sees the fields of environmental studies and
epidemiology as especially ripe for an informatics
approach. If one wants to consider the
environmental causes of disease, no fruit hangs
lower than NHANES (the National Health and
Nutrition Examination Survey, a program of
studies designed to assess the health and nutritional
status of adults and children in the United
States). Butte practically salivated, “All that
data publicly available for you to do the kind of
science you want to do with it…”
Butte also thinks we’re on the verge of EWAS—
environment-wide association studies—that
would be like GWAS genetic surveys. Since
whole-patient genomes are on the near horizon
(“It will be faster than Jiffy Lube,” he predicted.
“By the end of the decade it will cost
about $33—you’d pay more than that to park in
Bethesda.”), the new challenge for medicine will
be “how can I compensate for my genome? The
environment will be the new prescription for
Butte thinks “risk-o-grams” will become available,
combining the contributions of both
nature and nurture to one’s likelihood of falling
ill. “But we have not yet found the gene for
compliance in medicine,” he joked.
In an era when undergraduate students have
69 complete freely available human genomes to
interrogate over on CompleteGenomics.com,
“we can either get with the program,” Butte
concluded, “or just be scared of this.”
The entire talk can be viewed at http://videocast.