NIH Record - National Institutes of Health

Human Pangenome Boosts Accuracy, Reflects Diversity

Image
Multi-cultural, diverse group of people shown looking up
A reference genome is created by assembling parts of the genomes of many different people into a single sequence.

Photo:  rawpixel.com/shutterstock

Genetic differences between people can cause or alter the severity of various diseases and influence the effectiveness of treatments. Scientists identify such genetic variants by comparing an individual’s genome sequence to a standard, which is known as a reference genome.

A reference genome is created by assembling parts of the genomes of many different people into a single sequence. The original reference genome was developed by the Human Genome Project two decades ago. It has been continually updated as genome sequencing has become more accurate and more data became available. But a single reference genome can’t represent the genetic diversity of the human species. In particular, larger genetic variations, known as structural variations, are difficult to identify using a single reference genome.

An NIH-funded consortium has developed a reference “pangenome” that represents more human genetic diversity. The pangenome resembles a transit map, with different lines representing each component genome. The lines overlap where the sequences match and branch out where the sequences diverge. A first draft of the pangenome was published in Nature. Four companion papers were published as well.

To estimate the completeness of the genomes, the researchers compared them with the first complete human genome sequence released in 2022. On average, the genomes covered more than 99% of the expected sequence. More than 99% of each genome was accurately assembled.

The pangenome captured nearly all human genome variants that have been identified using the existing reference genome, called GRCh38. But it also went beyond the existing reference in several ways. The researchers found more than 1,100 cases of gene duplication in the pangenome that were missing from GRCh38. The pangenome also contains more than 100 million more base pairs—the “letters” of DNA—than GRCh38. 

Structural variations can be especially hard to detect using a single reference genome. These involve the deletion, duplication or rearrangement of long DNA stretches. Most of the new base pairs found in the pangenome were in regions that were previously unresolved due to structural variation. The researchers identified previously unknown structural variations at several locations where many such variations are possible. In all, the average number of structural variations identified more than doubled.

The authors note that the published pangenome is only a first draft. The consortium ultimately hopes to produce a more detailed pangenome that incorporates genomes from 350 people. Having a diverse reference may help ensure that future genomic research can benefit people of all backgrounds.—adapted from NIH Research Matters

The NIH Record

The NIH Record, founded in 1949, is the biweekly newsletter for employees of the National Institutes of Health.

Published 25 times each year, it comes out on payday Fridays.

Assistant Editor: Eric Bock
Eric.Bock@nih.gov (link sends e-mail)

Staff Writer: Amber Snyder
Amber.Snyder@nih.gov (link sends e-mail)