The Golden Age of Public Databases: Speeding Biomedical Discovery
Public databases impact not only how research is done but what kind of research is done in the first place.
The setting: a scientific conference in January 2008. The speaker, Bruce Ponder, MD, PhD, an oncology professor at Cambridge University, is describing a previously unknown link between a particular gene (FGFR2) and breast cancer. A prominent researcher in the audience, the late Judah Folkman, MD, raises his hand to propose a hunch: could another gene (for endostatin) in the same network also be related to breast cancer? The speaker doesn’t know.
After the session, another audience member, Kenneth Buetow, PhD, pops the question into a public database, the National Cancer Institute’s Cancer Biomedical Informatics Grid (caBIG; ), a web accessible collection of interoperable software tools and data sources. Voilá! The information highway kicks out a preliminary research result: variants of the endostatin gene are associated with breast cancer and can be protective against the disease.
In years past, such a research result might have taken months or years to obtain through diligent laboratory work. And this example illustrates the exciting potential for web-based shared databases to transform research, says Buetow, who founded the caBIG project in 2003 as associate director for biomedical informatics and information technology at the NCI. “First you can do in silico discovery and hypothesis generation, which drives your experiment,” Buetow says. “Then after your experimental discovery, you can perform in silico validation and extension. Essentially, we can more meaningfully join the beginning and end of an experiment through information technology.”
The impact of public databases on the research process is slowly becoming known—with effects on not only how the work is done but also on what kind of research is done in the first place. Before the golden age of public databases will be able to fully translate into innovative medical advances, however, certain challenges will need to be overcome. But those who work extensively with databases say the benefits will outweigh the costs. “It seems like a no-brainer that a portion of our investment in biomedical research should be in the archiving, annotation and maintenance of the resulting data and knowledge,” says Russ Altman, MD, PhD, professor of bioengineering at Stanford University. “This ensures that we will maximize the availability of previous discoveries, which will in turn help us to maximize new discoveries.”
The time is ripe for a database revolution. High-throughput experiments are generating unprecedented amounts of data, which many researchers believe will be valuable for years to come—and should be shared widely today. To that end, funding agencies such as the National Institutes of Health and journals such as Science and Nature are mandating that certain data be placed in public repositories.
The Molecular Biology Database Collection, maintained by the journal Nucleic Acids Research, this year lists 1,078 databases in its collection—110 more than appeared in 2007. NIH’s National Center for Biotechnology Information (NCBI) maintains more than 40 of these databases, storing molecular, genomic, and scientific literature data. These databases alone see roughly 2 million visitors and 3 terabytes of downloads every day.
Advances in technology have helped fuel this proliferation, says George Moody, research staff scientist in the Harvard/MIT Division of Health Sciences and Technology. He is the architect and caretaker of PhysioNet (http://www.physionet.org), a growing archive of freely accessible collections of digitized physiologic signals and time series measurements and related open source software. “For the types of databases that PhysioNet is mostly concerned with, the instruments that gather the data are almost invariably digital now. Our databases are large, but storage is cheap, and adequate network bandwidth is also cheap. So we can afford to collect them and make them available, and users can download them for little or nothing.”
There are also cultural changes behind these trends, says Atul Butte, MD, PhD, assistant professor of medicine at Stanford University, who uses public web-based databases extensively in his research. Increasingly, science is influenced by new movements in “openness,” he says—open-source software, open-access publishing, and so on. This coincides with an increased culture of sharing what were previously proprietary tools of the biomedical trade, such as reagents and protocols. Sharing data is a natural extension of that movement, he notes.
And successful current-generation databases can thank previous database projects for blazing the path, says Teri Klein, PhD, senior scientist in the department of genetics at Stanford University, and director of The Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB; http://www.pharmgkb.org/), which integrates, aggregates and annotates genotype and phenotype data, pathway information and pharmacogenetics. “The database pioneers have proven their value,” she says. “With better understanding and acceptance of databases comes greater usage.”
One of the oldest and most successful of these pioneering databases is NCBI’s GenBank (http://www.ncbi.nlm.nih.gov/ Genbank/), which recently celebrated its 25th anniversary. An annotated collection of all publicly available DNA sequences, GenBank contains nearly 83 billion gene sequences from more than 260,000 different species.
One of the reasons for GenBank’s success is its partnership with journals, according to research by Nathan Bos, PhD, senior staff research scientist at the Johns Hopkins Applied Physics Laboratory, whose chapter, “Motivation to contribute to collaboratories: A public goods approach,” will soon appear in a book called “Scientific Collaboration on the Internet.” Most genetics journals now require authors to deposit their sequence data into GenBank as a prerequisite for publication. This partnership began in the late 1980s as a way to encourage researchers to deposit data directly, rather than rely on GenBank’s staff to input published sequences by hand.
In fact, this system appears to be the most effective in solving the “public goods” problems, Bos concludes. The social dilemma around public databases is that it is difficult to motivate researchers to freely give away their hard-earned data—even though such sharing is ultimately for the greater good of the entire community and, therefore, beneficial for the researchers themselves. The partnership between journals and GenBank works because it ties rewards and sanctions together for the researchers, Bos says.
Of course, not all databases have such a clear mandate. In some ways, biomedicine has been slow to adopt information technology, Buetow says—even though the same tools have already transformed other sectors.
Yet without having access to integrated data resources, Buetow says, “we in biomedicine are going to hit a wall.” Many biomedical phenomena are complex and need a systems-level approach, for which large shared databases are a natural tool. “We already are increasingly aware, for example, that cancer emerges through complex networks of alterations and we're going to need combinatorial therapies,” he says. “But it’s beyond the capacity of a single human neural network to be able to integrate all that information. We need this complex network of information sources.”
Speed and Synergy
Breakneck speed is one of web-based databases’ biggest attractions. Data tasks that would have previously required researchers’ valuable time—to track down, request, transport, and enter—can now be accomplished with a few clicks, even for a casual visitor.
For genetics researchers, quick and easy research verification through databases like GenBank is more than just a luxury, says Nobel laureate Richard Roberts, PhD, director of New England Biolabs, Inc. and director of the restriction enzymes database REBASE (). “It is possible to check a new sequence against all known sequences within a very short time frame and know you haven’t missed anything. This is very important for avoiding duplication and knowing when your data and inferences truly are new,” he says.
Big research projects can also be accelerated through integrative databases. For example, in 2006, the team of Howard Fine, MD, chief of the Neuro-Oncology Branch at NCI's Center for Cancer Research, published a paper in Cancer Cell showing that stem cell factor (SCF) is critical in the genesis of malignant gliomas, the most common form of brain tumors. They had reached the conclusion through exhaustive in vitro and in vivo studies, Buetow says. But today, he points out, the same conclusion could easily be reached through synthesis of data in the Repository of Molecular Brain Tumor Data (REMBRANDT; http://rembrandt.nci.nih.gov), which Fine launched in 2005 to archive information on gene expression, copy number alterations and clinical information from several thousand patients with malignant gliomas.
In a broader sense, the right databases can speed a researcher’s entire career along, Moody says. “Simply being able to begin a study with suitable data already in hand can mean eliminating the first two years of what would have been a three-year project,” he says. “For a young researcher, graduate student, or researcher seeking to broaden his or her experience, not having to get a grant before beginning to look at the data can mean the difference between doing a project or not.”
The best databases offer benefits beyond simple speed, however. Exploring multiple connections in the data can lead to a unique synthesis of knowledge. For example, a large, multi-national team of scientists recently used the data available in the Cancer Genome Atlas (TCGA; http://cancergenome.nih.gov/), which houses multidimensional molecular cancer data. The team found that the molecular etiology of glioblastoma, the most aggressive kind of brain tumor, was characterized by a combination of factors: gene mutations, copy number changes, epigenetic silencing, and expression alterations. The work was published in Nature in September 2008. “If you could only look at one dimension of that data, and didn’t have the other data accessible in electronic resources, you’d never be able to see this,” Buetow says. “The conclusions are an emergent property of being able to see all the pieces together.”
In some cases, databases can upend the typical research cycle. “A lot of times, a scientist starts with a question, then collects data and answers the question,” Butte says. “But now we can start with public data. Then we figure out a new, useful, and valuable question we can ask and answer. And a completely different question can be asked of the data when it’s put together, beyond the initial question asked. It’s a shift in the scientific method.”
For example, in research currently under publication review, Butte and his team integrated and analyzed publicly available data in the NCBI Gene Expression Omnibus (. nlm.nih.gov/geo/) and the Unified Medical Language System (. nlm.nih.gov/research/umls/) to discover a genetic similarity between Duchenne muscular dystrophy and heart attacks. This finding might have clinical value, he says, because while there are no drugs currently developed to treat muscular dystrophy, there are several available to treat heart attacks. If the pathways are similar, the heart attack drugs might be helpful in treating muscular dystrophy. And the conclusions were reached by mining publicly available data.
Databases can also divert where researchers devote their efforts. “Public data collections free us from the need to recreate them many times over,” Moody says, “and that means that priorities shift to favor collecting novel types of data, making better use of scarce resources rather than replicating existing databases.” Collections can be expanded in ways that balance depth and breadth, he says. “For example, researchers can collect data from populations complementary to those already well-represented, gather multidimensional data sets that can lead to insights about relationships among variables, or else make use of existing data in novel ways.”
A ripple effect in other fields is becoming clear, as the biomedical community taps into quantitative disciplines for help in dealing with the vast amount of data generated. “This has created unprecedented demand for advanced computational tools and interdisciplinary expertise to capture, store, integrate, distribute and analyze data,” Klein says.
The reach of databases extends beyond the laboratory. For example, clinical cardiac arrhythmia analysis has benefited enormously from databases, including the MIT-BIH (Massachusetts Institute of Technology/Beth Israel Hospital) Arrhythmia Database and the American Heart Association Database for Evaluation of Ventricular Arrhythmia Detectors. “Nowadays it is taken for granted that computers can do a reasonably decent job of detecting important cardiac arrhythmias,” Moody says, which is a task with utility in both the clinic and the laboratory. “Without shared annotated databases, we wouldn’t have reliable arrhythmia detectors.”
Databases are changing how people work with each other, too. Perhaps most importantly, shared data let researchers in different centers and countries collaborate in novel ways, Klein says. For example, the international warfarin pharmacogenomic consortium (IWPC), which PharmGKB helped broker in 2007, merged datasets totaling more than 5,000 patients from 11 countries and four continents. Their goal was to develop an algorithm for dosing warfarin, an anticoagulant. This merging of data had many benefits, she says: an increased impetus for data sharing, better quality control, and greater statistical analysis power. To address concerns about ownership, the consortium first made the dataset available only to consortium members, Klein says; the entire dataset will be released upon publication of the manuscript.
Being reviewed by many eyes can also increase the value of data, in much the same way that software is improved through an open-source approach. “When many motivated observers examine the same data, their analyses can be compared,” Moody says. “Not only do we learn more about the data as a result of peer review of the data, but we learn more about the analytic methods themselves, about their strengths and weaknesses.”
Sometimes a new database can ignite a new research focus, Moody says. For example, in 2000, Moody and his team created a public, annotated database of polysomnography data and issued an open challenge to the scientific community: find ways to diagnose sleep apnea using a single ECG signal—a cheaper and less intrusive technique than standard polysomnography methods. “What surprised us was that at least a dozen research teams from around the world took up the challenge,” Moody says. Now clinicians can diagnose sleep apnea with commercially available, clinically certified software that uses their methods, he says, and researchers can also use open-source software based on these methods in their own studies. Better yet, Moody says, “because the researchers had worked independently on a common problem with common data, new collaborations among them formed easily.” The challenge is now an annual event with a different topic and data collection each year.
Even beyond collaborations, freely available data now means that a broader universe of people—not just well-funded labs—can join the research process. Researchers in developing countries have easy access to the types of data they could not afford to generate themselves, Buetow says. For example, he points out, only a little more than half of the usage of GenBank stems from the United States. “We are now able to tap into a global biomedical community of innovative thinkers, such as the billions of imaginations that are present in India, China, Latin America and in other places,” he says. “Our capacity to solve problems should grow exponentially.”
Hurdles, Both Technical and Cultural
Still, many technical challenges remain in the widespread adoption of databases. The most prominent plague is probably that of noise: inaccuracies in entries and annotations can greatly reduce the value of a dataset. “Some biologists think that there has been a proliferation of databases with low-quality information,” says Altman, whose lab developed PharmGKB. “The quality of annotations and curation is absolutely key for the reliability of the databases.”
In some cases, big (and potentially noisy) repositories are essential—and useful. But it helps if small annotated collections containing the same sort of information also exist. For example, the Broad Institute’s ChemBank and NCBI’s PubChem both house small-molecule structures and screening data. PubChem relies on submission of data and structures from outside sources; ChemBank data are generated and annotated internally. ChemBank also adds value to its data through the storage of other information, such as plate locations, raw screening data, field-based metadata and standard experiment definition. Although PubChem is tremendously useful as a large repository, ChemBank’s annotation and curation offers some distinct advantages, says Paul Clemons, PhD, director of computational chemical biology research at the Broad Institute of Harvard and MIT.
For some researchers, however, the issue of noisy data isn’t a crucial one. Butte says his approach regarding data accuracy is the same as President Reagan’s during the Cold War: “Trust, but verify.” In data mining studies, it’s not hard to throw out data suspected of having errors. Plus, having several labs contributing similar data into a public database will end up increasing their reliability, he says. “Some people feel current data are too noisy. I argue they are good enough. As Voltaire said, ‘Perfection is the enemy of the good.’”
As mountains of data continue to grow, helping researchers reach them in practical ways will become increasingly difficult. Data need to be both accessible and integrated into other data sets, says Mark Ellisman, PhD, professor of neurosciences and bioengineering at the University of California at San Diego and director of Biomedical Informatics Research Network (BIRN). “We need more effective ways to bring data together on the fly in ways that can be visualized and understood by a researcher,” he wrote in Fall 2005 in The National Academies’ Issues in Science and Technology. One approach is to prescribe the specific meta-data entities to be used by all sources, such as is done by caBIG. The other is developing flexible methods based on dissimilar standards, such as is done in BIRN. Both approaches have benefits and can provide fertile areas of research, he says.
Yet many challenges facing database use are cultural, not technical. Adapting to the needs of clinical scientists is one such hurdle. Although results of clinical tests are increasingly being captured in electronic health records, the incorporation of clinical data into large public web-based databases still lags, largely due to privacy concerns and clinical researchers’ unwillingness to share. “We don’t have an equivalent of GenBank for de-identified patient records,” Butte says, but he believes that could change, as he wrote in a perspective in Science in April 2008. For example, he says, although clinicians and hospitals might view clinical data as a trade secret, health care networks can pool de-identified data—thus de-identifying the source of the data as well as the individual records themselves. There are projects currently underway to achieve an integrated clinical database, Butte says, particularly at Informatics for Integrating Biology and the Bedside (i2b2), a National Center for Biomedical Computing based at Partners HealthCare System in Boston, Massachusetts. But, he notes, more effort will be needed to make this database available to the entire research community.
Scientific competition is another cultural obstacle. “Given the way research is funded, many researchers are justifiably hesitant to share their data,” Moody says. “They worry about giving those who compete with them for scarce research funds a look at what they themselves have had to spend some of those scarce funds to develop.” Funding agencies can help change this culture, he says, by visibly rewarding the responsible sharing of data among researchers.
Funding agencies and the scientific community can help boost the role of the informatics field, Buetow says. “We will need to recognize the true scientific benefit of creating, maintaining and using these databases. That can’t be a second-class activity if we want the databases to be of high quality. Hopefully there will be further recognition of the importance of biomedical information as a full partner in the biomedical enterprise,” he says. “That’s beginning to happen.”
Quirks of Funding
In a time of increasing competition for biomedical resources, the question of money looms large. How can funding agencies know the true value of a database?
The simplest method, perhaps, is to examine the citations a database garners in the scientific literature, which provides an indication of its level of use. For example, the annual number of publications based on the MIT-BIH Arrhythmia Database, which has been available since 1980, continues to increase over time. But in general, that’s not enough, Altman says. “The reliability of citations to various databases is very low, and often the citation is to the paper whose results are in the database, and not to the database itself.”
Ironically, relying on citations would punish the most popular databases. “Widely used and well known databases often don't get cited anyway,” Roberts points out. “It becomes assumed that people know what they are and where to find them.” And in general, statistics can be misleading. “It can be difficult for funding agencies to assess the worth of a database using traditional peer review mechanisms,” Roberts says. “Often the study sections or panels that review database grants lack the expertise to provide a critical assessment. I think there should be a special mechanism set up to review all databases and one that mainly uses expert assessments.”
Rather than relying on traditional academic metrics, funding agencies might also do better to turn to commercial metrics when evaluating the worth of collaborative databases, Buetow says. For example, caBIG started when NCI realized that each of its 63 designated cancer centers was independently generating its own information infrastructure. Now with a common infrastructure, a direct return on investment can be calculated, he says, by measuring the difference between the cost of caBIG and the costs of each group collecting data and developing tools individually.
Still, measuring the hypothetical cost of each research team creating its own database isn’t perfect, Moody says. In some cases, it can even substantially undervalue the database. That’s because peer review of shared data leads to better quality data, he says, plus the use of the same data in multiple studies generates objective comparisons and insights—which is added value that wouldn’t otherwise be measured.
Indeed, many researchers cite funding for maintenance as a top challenge facing the future of public web-based databases. “It's easy to get funded to build a database, but it is hard to get funding for maintenance,” Altman says. This is because federal research agencies’ desire for novelty is built into their infrastructure. As a result, he says, it’s hard to compete with exciting new research ideas.
In the early stages of a database, it’s easy to show a connection with a particular research project that moves science forward, Clemons says. “Ironically, once something becomes more useful to more people, it’s really harder to pin down a particular beneficiary and show how their grant really benefited from this activity and how continued support should continue to include a focus on software development.” For example, in its early days, ChemBank was funded by NCI, but as the database became more useful to more types of researchers, showing its sponsors that it was an appropriate investment became more difficult; the database’s benefits were now being distributed throughout the research community and not specifically in the sponsor’s area of interest—cancer.
This instability might hurt some databases more than others. “There is an argument that the NCBI should have all major databases since they are likely to last forever,” Altman says. Databases created by individual research teams are more vulnerable, since they must re-compete for grant funding every five years. “Why would someone put data in there if the existence is not guaranteed?” he says. On the other hand, he acknowledges, a competitive-funding system might be the very situation that fosters innovation in database technology.
One solution would be to have long-term funding competitively but readily available after a research team has already established a useful database, Roberts says. “It is unreasonable to expect a database team to undergo the vagaries of peer review every three years or so. One poor review and absence of funding can wreck the database,” he says, because a suspension of even a year or two in data collection and team continuity severely harms the enterprise. “Since factual databases like GenBank are now critical to modern biology, the government needs to make sure they continue without interruption,” he says. “I think it would be a good idea to centralize both the review and the funding of biological databases.”
New business models might also help, Buetow says. For example, many modern libraries are starting to consider raw databases to be primary information resources in addition to their collections of books. Also, “groups like Google are actively courting the immortalization of key reference datasets” in non-biologic fields, he says, and they’re interested in hosting some raw biomedical datasets.
As the technology and culture of biomedicine continue to change, so too will its practice of storing, sharing, and synthesizing data. Teasing apart the factors driving the evolution may not be simple. “I think large public databases are a symptom of changes in science and they themselves are also changing the face of science,” Klein says.