skip navigation nih record
Vol. LXVII, No. 10
May 8, 2015
cover

previous story

next story


Pi Day
‘Doing the Math’ Adds Up to Big Genomic Discoveries

Dr. Eric Lander speaks at Pi Day 2015 at NIH.
Dr. Eric Lander speaks at Pi Day 2015 at NIH.

Never underestimate the power of math. It turns out that much of what we know and are learning about the human genome relies on computational mathematics. Equations and algorithms are helping researchers decode evidence across the huge genomes of many species and have even debunked data from earlier scientific experiments. But much more data—from ever-larger population samples—will be needed to truly drive biomedical progress, said Dr. Eric Lander, founding director of the Broad Institute of MIT and Harvard and one of the early architects of the Human Genome Project.

A question that underlies genomic research is: “What’s the math under the hood?” said Lander, who delivered the inaugural NIH Data Science Lecture on Pi Day eve, Mar. 13, in the Porter Neuroscience Center.

When researchers began decoding the genetics of disease 35 years ago, said Lander, “The idea was that somewhere in this massive human genome of 3 billion bases, there would be 1 base that would be wrong and that would explain the problem.” But how would they find that 1 base?

Having all of these data was one thing; deciphering it was another—and that’s where the math came in. “All the nucleotides in the world and all the polymerases…wouldn’t have made a difference but for a lot of the math under the hood,” said Lander. For example, in assembling a genome, “I have zillions of little fragments of DNA and I’ve got to figure out which ones overlap.”

So the biology community turned to math experts who devised all sorts of algorithms and extensive graphing to trace the overlaps. They developed algorithms to find overlaps between DNA fragments and to recognize regions of evolutionary conservation.

Lander discussed such concepts as Hilbert sets, Orenstein-Uhlenbeck diffusion and the Chen-Stein method of Poisson approximation. Such terms were undoubtedly unfamiliar to some members of an audience that included many high school students—the room was packed with teenagers sitting on the floor and standing along the walls.

When researchers started tracing chromosomes of families and inherited patterns of disease, Lander explained, they had to deal with incomplete data. Members of the team exploited a mathematical trick called the E-M algorithm. They started with a possible solution, estimated the missing data and repeatedly solved the problem until it all added up. “This was essential to genetic linkage analysis,” said Lander. “That’s what’s under the hood.”

The day-long celebration included a poster session in the Porter Bldg. atrium. Dr. DJ Patil, chief data scientist of the U.S., shows off his Pi wear.
The day-long celebration included a poster session in the Porter Bldg. atrium. Dr. DJ Patil, chief data scientist of the U.S., shows off his Pi wear.

In 2003, scientists completed the 13-year project of sequencing 99 percent of the human genome based on the data and technology available at that time. Soon after, the math showed that some long-held beliefs about the human genome were wrong. For example, researchers discovered that there are thousands fewer protein-coding genes than previously thought, said Lander, and most functional information actually lies in noncoding DNA.

New techniques allow us to follow millions of the three-dimensional positions of points in the genome simultaneously. Ultra high-resolution mapping has illuminated how the human genome—which is 6 meters long—folds into a single cell. Said Lander, “It turns out one can work out the whole folding of the genome by math!”

Genetics: Driving Medical Progress

Recent genomic studies are helping us pinpoint which regions of chromosomes correlate to disease—from heart disease to diabetes to inflammatory bowel syndrome. Thousands of loci are associated with hundreds of different diseases and traits, said Lander.

A study of schizophrenia examined 6,000 patients and found no common genetic markers. When the sample expanded to 20,000 people, 5 loci were found. With 50,000 patients, they found 62 genes and, thanks to an international consortium, a sample size of 110,000 schizophrenic patients yielded 108 genome-wide results. “The math was what gave them the faith to keep going,” said Lander.

Studies of type 2 diabetes identified 64 related genes. Today, researchers are identifying dozens of cancer genes by analyzing thousands of malignant tumors. “In each case, we believe the math,” said Lander. “It tells us there’s more over the horizon.”

Collins and Lander pose with high school students, who formed a large part of the audience at NIH’s observance of Pi Day.

Collins and Lander pose with high school students, who formed a large part of the audience at NIH’s observance of Pi Day.

Photos: Bill Branson

Big Data to Knowledge

“We have only scratched the surface,” said Lander. “For common genetic disease, for almost all important phenotypes and all important diseases, we still are far from having a complete picture. We’re going to need huge collections to be able to do that.”

The recent Precision Medicine Initiative announced by President Obama seeks to enroll 1 million volunteers to contribute their genetic information to improve diagnosis and treatment of disease. While this is an important start, Lander said society needs an even larger, global patient-driven movement to study every major disease in each ethnic group. He noted, “One big aspect of the future of quantitative biomedicine is going to be these large populations from which we’re going to have to learn. We must turn our health care system into a learning system.”

What would Pi Day be without actual pie? These guests at NIH’s observance found more than mere intellectual nourishment at the celebration.
What would Pi Day be without actual pie? These guests at NIH’s observance found more than mere intellectual nourishment at the celebration.

Another exciting field is single-cell genomics. Math is driving innovation in this area as well; the goal is a complete cell atlas. New technology is making it possible to sequence RNA in individual cells and read it out by massive parallel sequencing.

Currently, cells are classified by structure and function. “What we really want is some utterly unbiased way to know every cell type in the human body,” said Lander. “We want to know every cell state the cell type can be in, how they differ according to location and environment, what every cell transition is about and the lineage history of the cell.”

Learning from populations of cells and from larger populations of people will help unlock the genetic basis of diseases and will ultimately lead to new technologies and therapies. “We’re learning more and more from every sample,” said Lander. “Extracting all of this amazing data from human health care will be an incredibly powerful driving force for medical progress.”

The complete Pi Day program, including Lander’s talk, can be viewed at http://videocast.nih.gov/summary.asp?Live=15906&bhcp=1.


back to top of page