An Automated Supertree
A Model for Extracting Literature-based Knowledge
Too much scientific knowledge is buried in published literature. Case in point: The phylogenetic relationships among microbial species are locked into numerous publications about individual species and their close relatives. And because those publications don’t include machine-readable data, the information is difficult to extract. Thus, efforts to create supertrees (large trees assembled from a combination of many smaller phylogenetic trees) typically involve a handful of graduate students doing a massive cut-and-paste job—connecting trees bracket by bracket on a computer. “It’s mind-numbingly tedious,” says Ross Mounce, PhD, Open Access Grants Manager at the Arcadia Fund.
As a postdoctoral research associate at Cambridge University, Mounce set out to create a microbial supertree by using computer vision to extract information from smaller phylogenetic trees in a single journal. “We point the program at the file and it will do its best to extract phylogenetic data from the image,” Mounce says. The result is not the best tree, Mounce says, but a proof-of-concept for developing a scalable, automated process. “It’s a solvable problem,” he says, that has been made easier in the United Kingdom by recent changes to copyright laws. As long as a researcher has legitimate access to a published piece of literature (through a university library, for example), “it’s legal to do sophisticated analyses on it without asking permission of the copyright holder,” Mounce says. Without that legal right, it would be nearly impossible to perform scalable syntheses of the literature.
“The future is really exciting, because if you had an ongoing reproducible pipeline, you could have a tree of life that self-updates every day,” Mounce says. The same is true for any piece of scientific knowledge: “You could check back and see a self-updating synthesis of the current evidence on any topic,” he says. “That’s the idea really.”