|(This is the second of a multi-part series exploring how technology enables NIH’s mission.)
The ability to access and share data is a straightforward concept. Each day at NIH, we log on to computers and personal devices to collaborate with coworkers, peers and friends. What if access to this data meant that a cancer diagnosis could be more easily treated or prevented?
For researchers who use The Cancer Genome Atlas (TCGA) through the Cancer Genomics Hub (CGHub), the ability to share and access data is allowing for a better understanding of cancer development and leading the way to personalized medicine.
TCGA, with more than 30 cancer types, including 9 rare tumors, catalogs an unprecedented amount of data. With one sample generating more than 300 billion bytes of data, a collection hub was required to coordinate geographically dispersed genome sequencers and genome analysis centers. CGHub serves as a central repository for genomics information for three different National Cancer Institute programs, including TCGA.
Located in San Diego, CGHub is managed by the University of California, Santa Cruz, and provides cancer researchers a secure repository for storing, cataloging and accessing TCGA’s lower-level sequence data set.
Accessing and sharing data in real time is only a part of the challenge. As of 2014, CGHub contained approximately 2 petabytes of data; however, long before then, NIH and UC Santa Cruz were concerned with data integrity, and specifically, backups.
“By physically shipping data, we had tapes backed up in San Francisco, closer to Santa Cruz, but we wanted a backup on the east coast, closer to NCI,” said Mathangi Thiagarajan, CGHub project manager. “In 2012, we shipped the first batch of tapes to NCI-Frederick and copied the data to a tape archive.”
After NCI-Frederick caught up with data archiving, they started to explore the possibilities of transferring data across the network.
“In 2012, we started this project and could barely receive 1 terabyte per day,” Thiagarajan explained. “In 2013, we continued to make incremental progress. By 2014, we could receive up to 14 terabytes per day.”
Thiagarajan attributes this exponential increase to NCI’s efforts to add network streams and the Center for Information Technology’s initiative to upgrade the NIH-wide network, including the local firewalls in June 2014.
“Today, the average speed of the stream is obvious to our staff. The data is downloaded faster on each stream. Single stream speeds have improved so much we have cut back on the number of connections to CGHub, which in turn has helped their global performance,” said Thiagarajan.
Although the data at CGHub will move from UC Santa Cruz to NCI’s Genomic Data Commons at the University of Chicago in 2016, the ability to transfer data across the country is an important advance.
“This upgrade was extremely crucial,” said Thiagarajan. “[Among] data providers and data consumers, there will always be competition for bandwidth, so we need to have a good working relationship with our technology teams so they can understand our work and mitigate issues. Knowing we have the capability to do 14 terabytes per day is incredibly useful. New projects and opportunities come up every day.”
Researchers work at an NHGRI-supported large-scale sequencing center.
Photo: Broad Institute of MIT & Harvard
Mathangi Thiagarajan, CGHub project manager, says, “Today, the average speed of the stream is obvious to our staff. The data is downloaded faster on each stream. Single stream speeds have improved so much we have cut back on the number of connections to CGHub, which in turn, has helped their global performance.”