Tuxedo Junction
Salzberg Uses Latest Tech Tools to Find, Analyze Novel Genes
The global scientific community rejoiced when the first human genome was sequenced in the early 2000s. If we could find every gene in the human body and eventually identify what each one did, we could decode the many mysteries of disease. Now, as we continue to build upon that groundbreaking international research collaboration, is it possible there are more genes we haven’t yet discovered?
“We in the scientific community have been trying to figure out for a very long time how many genes the human genome has,” said Dr. Steven Salzberg, Bloomberg distinguished professor and director, Center for Computational Biology at Johns Hopkins University. He spoke at a recent seminar in Lipsett Amphitheater.
While we still don’t know our exact number of genes—intervals on the genome that get transcribed and provide function to an organism— researchers wildly overestimated that we had millions of them when the genetic code was first cracked in 1964. Ongoing research has shrunk that estimate to about 20,000 protein-coding genes, and that number continues to evolve.
Even the two major gene databases—NCBI’s RefSeq and the NHGRI-European Molecular Biology Laboratory consortium GENCODE—disagree on the total number of human genes. Salzberg, whose lab develops computational tools to analyze DNA and RNA sequences, wanted to help settle this discrepancy. Three years ago, his lab embarked on a project to rebuild the human genome catalog.
“RNA sequencing data has really transformed our ability to figure out what genes are present in the genome,” and which are functional, Salzberg said.
As he began deep sequencing tissues, his lab got overloaded with massive reads in the many millions per sample. It would take nearly a day to map and assemble the transcripts of each expressed gene.
“You don’t do just one experiment when doing RNA sequencing,” he said. “Typically, you do many experiments and compare the different conditions to one another to see what genes are differentially expressed between healthy and diseased tissues.”
The need for speed led Salzberg’s team to update their “Tuxedo Suite” of computational tools, software that includes Bowtie and TopHat to do RNA sequence alignments and Cufflinks to assemble the RNA reads and quantify the levels of gene expression. Their newer HISAT2 is as accurate and 50 times faster than TopHat, he said. And the new, faster StringTie, which replaced Cufflinks, enabled Salzberg to build a genome catalog.
Last year, Salzberg’s group published a catalog called CHESS (Comprehensive Human Expressed SequenceS). They started with a massive RNA sequence dataset of 900 billion reads, he said, and found 30 million transcript variants across 700,000 locations on the genome.
“Our strategy was to run everything through this new Tuxedo pipeline, align everything with HISAT2 and assemble it with StringTie,” he said. “Then we compared all the assemblies to each other.”
Most of the 30 million transcript variants were not genes, said Salzberg, but transcription noise, that of extraneous RNAs. After comparing their data to what’s in RefSeq and GENCODE, they filtered out all but 1 percent of found transcripts.
“In case you’re disturbed by this, thinking: ‘It can’t be that 99 percent of transcription is a waste,’ you’re correct. It’s not a waste,” said Salzberg. In a subsequent calculation, they found that all the transcripts they discarded collectively added up to one-third noise “and two-thirds were parts of the 43,000 that we think are real genes.”
The CHESS catalog includes 224 new protein-coding genes and 2,600 novel non-coding RNAs. Salzberg also found more than 100,000 novel transcripts from known protein-coding genes.
“Something’s happening,” he said. “Something’s getting transcribed, whether or not it gets turned into a protein.”
Meanwhile, Salzberg’s lab is also involved in a project that may have uncovered previously unfound genetic base pairs. We’ve known for decades that every person has 3 billion base pairs of genetic letters, representing the complete set of DNA in the human body. Is it possible we each might have millions more?
Working with CAAPA, the Consortium on Asthma among African Ancestry Populations in the Americas, Salzberg’s lab sequenced the genomes of 910 Africans from across the Americas and the Caribbean, looking for genetic markers for asthma and allergy. For 2 years, they worked to assemble genetic pieces not found in GRCh38, the first fully sequenced genome from the NIH-led effort.
“We know from genetics that Africans are a pretty diverse population, more diverse than Europeans,” said Salzberg, “and we thought we might find a lot of interesting chunks of DNA that are just missing from GRCh38.”
With this project, Salzberg hopes to help rectify the lack of diversity from that first human genome reference consortium. When the draft of the first sequenced human genome was published in 2001, the original plan was to compile samples from dozens of people. But due to time constraints, 65 percent was based on one person, a man from upstate New York, and the rest was a mosaic from other people sampled.
For the CAAPA project, Salzberg’s team took all the genetic pieces that didn’t map to GRCh38, assembled them and, after removing redundancies, wound up with 296 mega-bases and genes not annotated in GRCh38, some of which turned up in hundreds of the 910 people.
“This is the African Pan Genome,” Salzberg said, “the regular genome plus at least another 300 million bases.”
The sequence Salzberg is calling the Pan Genome has many insertions that might also be present in the general population.
“They’re probably not African-specific,” he said. “They are just human sequences that are missing from GRCh38, again pointing to the need for more reference genomes than we have right now.”