Tools to Understand the Federal Research Portfolio: From Ontologies to Topic Mapping
Computation helps evaluate the nature of the NIH research portfolio in ways that were previously very difficult.
What biomedical research does the federal government fund? How is it allocated across important diseases? Has that changed over time? Answering these questions at any level of detail is tougher than you might expect.
The National Institutes of Health, for example, award 80,000 grants each year. But when they want to evaluate funding for a particular area of research, “It’s difficult to know: Have you covered what you think you’ve covered?” says Edmund Talley, PhD, Program Director, National Institute of Neurological Disorders and Stroke (NINDS), NIH.
Of course, federal funding agencies do analyze and report to Congress on their research portfolio. At the NIH, for example, the Research, Condition, and Disease Categorization (RCDC) Process categorizes all NIH grants according to 233 categories that it is required to report to Congress and the public. This categorization is transparently available at the NIH RePORTER website, an online searchable database of NIH grants. But, Talley says, congressional reporting categories don’t necessarily cover the entire realm of research. And currently, NIH is the only agency using this system, so it can’t be used to assess research across funding organizations.
Now researchers have developed two very different yet complementary computational approaches to dig deeply into the federal research portfolio. The first, developed at Stanford, relies on ontologies—structured, hierarchical categorizations of research—to answer specific questions about the federal research portfolio across all funding agencies. The second approach, NIH Map Viewer developed by Talley and a diverse team of computer scientists, uses text mining to cluster topic words from NIH grant abstracts and then visualize the results.
Both tools can help program officers—as well as grant applicants—evaluate the nature of the NIH research portfolio in ways that were previously very difficult, if not impossible.
Ontologies Get Real
Nigam Shah, PhD, assistant professor of medicine at Stanford University School of Medicine, would like to see federal funding agencies categorize their grants using a common ontology, at least for disease research. To make the case for that idea, Yi Liu, a graduate student in Shah’s lab at Stanford, set out to demonstrate that existing ontologies could be used—in an automated way—to discover interesting information and trends in research activity across all federal agencies.
He used a decade of grants data (1997 to 2007) from the Research Crossroads database, which covers 33 different funding institutions, including the NIH, National Science Foundation, Health Resource and Service Administration, Centers for Disease Control, Food and Drug Administration and others. That database can now be searched on Bioportal, the website of ontologies created by the National Center for Biomedical Ontology. Liu also created a workflow to annotate both the grants data and a decade’s worth of PubMed journal articles associated with US institutions using ontology terms from the human disease ontology (DO).
Up to this point in Liu’s research, Shah says, “anyone can do this at the BioPortal.” Indeed, a simple search of the Research Crosswords database provides any user with counts of grants in any disease category. But Liu went several steps further, looking at three measures of funding.
First he looked at sponsorship—the level of funding for a particular disease topic relative to the impact factor–weighted count of publications in that topic area. So, for example, Liu found that drug abuse and Alzheimer’s disease are highly sponsored but are less commonly represented in high impact journals compared with cancer or heart disease. Liu’s analysis can’t explain this discrepancy—which could have many causes, including how expensive the research is whether the topic is a new research area; and whether it’s been hard to produce results with a significant impact on the disease—but his work makes it easier to spot the differences.
Liu also studied allocation—the level of support for a disease area as a function of mortality rates, which he used as an imperfect surrogate measure of disease burden (other measures are possible). “Allocation looks at sponsorship in the context of the size of the problem,” Shah says. “Are we spending enough? Overspending? Under-spending?” For example, the work showed higher funding for cancer than for heart disease, which has higher mortality rates.
And finally, Liu looked at trends across time. He determined whether, for a given disease, funding has reached a plateau, dropped off, or increased over the years—useful information for agencies hoping to make smart funding decisions.
In the end, Liu says, “I was pretty happy about being able to see the big picture from a pretty granular database.”
It’s a proof of concept—a demonstration that the various agencies that fund biomedical research should switch from ad hoc categories to an existing, shared ontology. “We don’t really care which one,” Shah says. “But why not use an ontology based on the Unified Medical Language System (UMLS) that the National Library of Medicine is building and funding?”
Still, using ontologies has its limits, Shah concedes. “If your research interest is one for which there is not a good ontology, then this approach is simply not going to work,” he says. For example, areas such as liberal arts, political science, or even basic research are difficult to classify hierarchically.
As an alternative to ontologies, Talley and his team created the NIH Map Viewer, a tool that uses text mining, topic modeling, and visualization as a way of digging deeply into the NIH portfolio. Such topic maps have two clear advantages over ontologies: They pick up phrasings and words that are not in a pre-classified hierarchy; and they cluster words together based on their shared usage. “In topic mapping, you have a bag of words and you want to learn how it’s organized—to extract structure from it,” Shah says.
NIH Map Viewer was built on earlier work by the team. “We had already produced nice data using abstracts from the Society for Neuroscience annual meeting, but we didn’t know if the method could scale to the entire NIH, or if it would provide a coherent view at both the local and the global level,” Talley says.
The text-mining method they used is called “latent Dirichlet allocation” or LDA. It had been invented a few years before, but hadn’t been tested on many real-world problems, Talley says. “There were a lot of open questions about how to evaluate what makes a good topic.” LDA is a kind of component analysis, and that was a problem: The components don’t have to be meaningful in order to be predictive. Talley needed good topics, not just a good model. “We had to come up with a way to assess topics in an automated way,” Talley says.
In the year since the work was published in Nature Methods in June 2011, the team has continued to tune the parameters of the algorithm so that it now does quite a good job of extracting topics from text. “That’s an accomplishment for us,” he says. “The new topics will be available in the next few months.”
The visualization piece of the project starts from a layout map based on similarities between the grant abstracts. Documents are clustered based on their internal texts rather than by external labels given to them by NIH, Talley says. The baseline map resembles a web with interconnected strands that represent grants with their feet firmly planted in several fields.
On top of this static baseline map, users can query for topics as well as other categories of interest, as described in the following sidebar. The NIH Map Viewer is now available at https://app.nihmaps.org, as well as from a “Links” tab within NIH RePORTER, the NIH portfolio search tool.
“This has been an experiment where we’ve said ‘Let’s get it out there and see where the value is,’” Talley says. “Ultimately, I think this or something like this will be valuable for policy officials. It’s a new way of looking at grants.”
Eventually, Talley hopes to see a system that can provide both the accurate recall of text mining and the clarity of ontologies. This is an area of intense interest, he says. “These are complementary techniques that really need to be merged.”
An NIH Map Viewer Test Case: A topic search for “software”
In the NIH Map Viewer (at https://app.nihmaps.org/nih/browser), when users enter a search term in the topic window, a dropdown menu appears listing several possible topics containing that term. For example, “software” produces two possible bags of words, one of which begins with “software database bioinformatics web tool resource annotation visualization….” After selecting this topic and setting a threshold for recovering only the best topic matches (in this case the default 20 percent was used)—a search generates a list of 594 grants, all marked on the map with pushpins (as shown on the following page). Users can change the pushpin coloring to represent institute, funding level, or a number of other categories. At the same time, these categories are displayed as a bar chart in a separate window. An additional window lists similar topics and users can drill down into a particular topic using the “topic info” button, which opens a separate page. There, users are given a wealth of information—including co-occurring topics, similar topics, and a list of grants—to help them evaluate whether the topic lives up to their expectations. For example, the topic info page for the map shown here helps the user ponder, for example, “Is this topic really about software development?”
By comparison, a keyword search for “software” in NIH RePORTER produces a list of 3823 grants. “You’re pulling everything and there’s no way to really focus it,” Talley notes.
Users can also come to NIHMaps.org directly from a RePORTER search using the “Links” tab. Each grant from the search is displayed as a pushpin. Zooming in and scrolling over each pushpin identifies each grant by name. If a search produces only a handful of grants, the RePORTER’s list might be adequate, Talley notes. “But when you start talking about hundreds, a list becomes intractable, and you need a way to organize the information,” he says. “Our statistical analysis of this layout algorithm suggests that it’s tuned to perform especially well when you start getting a hundred or more documents, which is where clustering becomes really useful.”
Talley and his colleagues are also continuing to improve the NIH Map Viewer. For example, it’s now possible to save and share a search as a link; and in the bar chart, users can turn different categories on or off. Talley’s team is also generating a map based on similarities between grants and publications that cite NIH grants. The combined map of grants and publications has higher resolution, Talley says. “The overall quantitative performance improves, and we get more clusters in places where we know we want more clusters.” He hopes to release the new map in a few months.