Parsing PubMed

iHOP organizes interconnected information

Text-mining tools such as iHOP (Information Hyperlinked Over Proteins) are doing for biological literature what hyperlinks and search engines do for the Internet: organizing interconnected information in a fast, intuitive, searchable manner. And in January 2007, the service started to provide daily updates—extending the information network by about 2,000 new papers every day.


With genes and proteins acting as hyperlinks between sentences and abstracts, a large part of the PubMed knowledge base becomes a giant, navigable information network, says Robert Hoffmann, PhD, a postdoctoral fellow at Sloan-Kettering Institute who started the iHOP project while a researcher at the Protein Design Group at the National Center for Biotechnology (CNB) in Madrid, Spain. “The new version provides current information on even more genes and chemical compounds, covering 1,500 organisms ranging from human and chimpanzee to yeast and HIV,” Hoffman says. He and his colleagues also extended iHOP’s results to include drug interactions, and they've provided new ways to interact with the data—such as displaying “breaking news” found in papers from the past two years.


Freely available online since 2004, iHOP parses millions of PubMed documents and selectively grabs information specific to 80,000 different biological molecules. The program displays a list of relevant sentences snagged from the parsed documents, effectively summarizing the interactions and functions of a given protein or gene. The user can also browse statistical overviews of interaction partners and associated drugs, collect interesting sentences into a logbook, and create graphical representations of the results.


The computational machinery behind iHOP has continually evolved since the program’s introduction, Hoffman says.


The most important enhancement this year—daily updating—was also the most technically demanding, requiring the daily processing of about 2,000 new publications. “It is a huge challenge to parse the literature on an ongoing basis, with thousands of new papers per week,” says Chris Sander, PhD, of the Computational Biology Center at Memorial Sloan Kettering Cancer Center. “Robert and our team can now do this as the result of new software running on a multiprocessor machine that is better suited to processing large-scale text data.”


The problem, Hoffmann says, is that most parallel computing pipelines (known as Message Passing Interface frameworks) are designed for repeated number crunching, not the sort of memory-intensive, semantic database processing that text mining requires. So Hoffmann developed his own computational pipeline capable of annotating millions of documents within a few hours on an 80-node cluster, making daily iHOP updates a reality. “We’re now in a good position to make the next move toward annotations of full text sources, as well as the algorithmic exploration of gene networks,” Hoffmann says.


Text-mining tools such as iHOP are great for focusing on pertinent key fragments in the literature, says Russ Altman, MD, PhD, chair of the Department of Bioengineering at Stanford University. ”There is so much published that it’s hard to keep track of all the relevant information, especially in journals that end up having unexpectedly relevant material,” Altman says. “iHOP is an example of an approach that helps biologists filter lots of literature.”


iHOP is freely accessible at



Post new comment

The content of this field is kept private and will not be shown publicly.
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Enter the characters shown in the image.