Untangling Integrative Analysis
How researchers are combining disparate data types and simulating systems that contain many different moving parts
13 years ago Markus Covert, PhD, read a New York Times article that changed his life. The article quoted a prominent microbiologist who suggested that the ultimate test of one’s understanding of a simple cell wouldn’t be to synthesize an artificial version of the thing, but rather to build a computer model of it—a model that could predict all of the proteins expressed by the cell’s genes, their behaviors and interactions. “I think about that article every day,” says Covert, who was a graduate student at the time and is now an assistant professor of bioengineering at Stanford.
To be fair, he’s done more than just think. In 2012, Covert himself appeared in the Times, garnering widespread attention for having created a computational model of the bacterium Mycoplasma genitalium. Covert’s whole-cell model simulated all of the microorganism’s molecular components and their interactions over the course of its life cycle; accounted for the function of every annotated gene product; and predicted a wide range of behaviors with a high degree of accuracy.
It was also a model of many parts. Twenty-eight, to be exact—28 individual submodels, each describing a different cellular function (ribosome assembly, cell division, DNA repair, etc.). Those submodels are defined by thousands of parameters and compute a comparable number of unknowns represented by 16 different categories of cell variable (chromosome, mass, geometry) that in turn represent different data types. “Chromosome,” for example, might refer to the degree of chromosomal replication, or the location of every single protein on the chromosome; “mass” might refer to the mass of DNA or of proteins; “geometry” might refer to cell radius or shape. Over the preceding decade, Covert explains, he and his colleagues had come to the conclusion that no single computational approach would suffice to model a whole cell; instead, the task would require “a lot of different approaches”—approaches that would somehow need to be integrated into a unified whole.
That integrative ethos is becoming increasingly common. This is true whether the problem under investigation requires combining disparate data types, such as the ones flowing from next-generation high-throughput sequencing technologies; or simulating systems that contain many different moving parts, each one amenable to different mathematical treatment. And the trend will only intensify as integrative modeling and analysis becomes the modus operandi of biomedical research in general. As Bernhard Palsson, PhD, Covert’s former doctoral advisor at the University of California, San Diego, says, “It’s clear that over the next 10 years, this kind of activity will take center stage in the life sciences.” And it will likely come in many forms.
Integrating Data about Gene Regulation
Mark Gerstein, PhD, and his colleagues at Yale University have been leading contributors to the ENCODE project, which aims to delineate all of the functional elements in the human genome. Using techniques that they originally developed in model organisms such as worms and mice, Gerstein and his team recently employed several different types of ENCODE data to build an integrative model of transcription that can predict gene expression based on the presence of particular regulatory elements. Among other things, ENCODE has established that as much as 18 percent of the human genome, most of which was once considered to be “junk” DNA, helps regulate the 2 to 3 percent that actually codes for proteins.
The team began by building individual models that correlated expression with different kinds of regulators—most notably, the transcription factors and histone modifications that are found at transcription start sites directly upstream from sets of genes, and which exert considerable influence over whether or not those genes are transcribed and therefore expressed. Transcription factors are proteins that activate or repress the flow of genetic information from DNA to messenger RNA. Histones—the spools around which DNA winds within the chromosome—are modified in various ways that also affect gene regulation.
The models used machine-learning methods to look at the values of thousands of these regulators in small regions around the transcription start sites; multiplied them by coefficients in order to weight their relative significance; and added them all together to create accurate predictors of gene expression. “That’s the stuff of integration right there,” says Gerstein, who is professor of biomedical informatics, molecular biophysics and biochemistry and computer science. By comparing the relative impact of the various regulators, Gerstein was able to determine which transcription factors and histone marks were most important to prediction. As reported in a recent paper in Nucleic Acids Research, the team found distinct differences in predictive strength based on location, with transcription factors achieving their highest predictive power in a small region of DNA centered around the transcription start sites, and histone modifications demonstrating high predictive power across a wide region around the genes. As a final step, Gerstein and his colleagues built a model that included both histone modifications and transcription factors, but discovered that integrating the two did not improve accuracy. “They’re actually somewhat redundant; you can’t do better by combining them,” says Gerstein—a surprising result that may help illuminate the basic biology of transcriptional regulation.
Interestingly, Gerstein doesn’t consider the integrative aspect of the undertaking to have been especially challenging. “In a sense, the integration is carried out in the actual mathematical machinery as it’s put together,” he says, referring to the automated manner in which the machine-learning algorithms go about sorting and multiplying, adding and predicting. Instead, most of the heavy lifting comes earlier: before the data on the various regulators can be fed into the models, they must first be normalized and placed in the same coordinate system, put in the correct format and properly scaled. “There’s a huge amount of upstream work [required] to be able to do this integration,” Gerstein says, adding that the project is “a nice case study” of “the overall process of putting all this information together and making predictions.”
Integrating a Whole Cell Model
The idea that the “integrative” part of an ambitious integrative analysis project should turn out to be fairly straightforward might seem surprising. But it’s hardly uncommon.
For example, yoking together 28 individual submodels representing different biological functions into a single, integrated über-model of M. genitalium might appear to be a Herculean task—especially when many of those functions operate at different time scales, and are computed using mathematical approaches ranging from Boolean logic to stochastic methods. Yet while Covert and his colleagues did indeed describe integration as a “key challenge” in the Cell paper announcing their results, it wasn’t the only one. And in the end, it was amenable to a reasonably simple solution. At least, for the most part.
“We decided that we could assume that at a short timescale, [the submodels] were independent,” Covert says, adding that in this case, “short” meant less than a second. There were exceptions to this rule, most notably in the case of energy, which was in such high demand amongst all the submodels that Covert and his team had to develop a special means of allocating it before anything else could be set in motion. Once that had been worked out, however, Covert, et al., could simulate the whole-cell model by proceeding in one-second timesteps, using the same method employed to integrate ordinary differential equations. For each timestep, they collected the latest data computed for every variable and fed it into the 28 different submodels. Each submodel would then return fresh data, which served as the inputs for the next time step. “Integration,” Covert says, “happens at the level of data.”
So decoupled, each submodel could even be run serially; though in practice, multiple whole-cell simulations run concurrently on a 128-node computer cluster. Thus far, the team has plowed through thousands of simulations, including hundreds of wild-type cells and hundreds more in which M. genitalium’s 525 genes have been disrupted one by one.
Debugging an integrated simulation of this kind can be hairy, and Covert gives credit to a former Google engineer who helped the team develop automated testing procedures for their tens of thousands of lines of MATLAB code. Still, echoing Gerstein, Covert says that a good deal of the toughest work took place long before anything was integrated. And much of that work involved selecting the most appropriate mathematics for each of the 28 cellular functions, a task that took many years to complete.
Those choices were driven by how well understood each function was, and how much quantitative data was available for it. The most detailed submodels, like the ones for RNA and protein degradation, use stochastic processes to allow for variability. The sparsest rely on Boolean operations. Others still employ flux balance analysis, which analyzes the flow of metabolites through a metabolic network without specifying their actual concentrations. “We really tried to let the process itself, and our understanding of it—together with the data that had been generated with regard to it—be our guide,” Covert says. All of the code is available online, and Covert looks forward to the day when someone writes a competing submodel and then runs the whole-cell model with both versions to see which works best.
Integrating a Multiscale Genome-scale Metabolic Network
Flux balance analysis lies at the heart of many cellular models and plays an important role in multiscale modeling efforts as well. Typically used to investigate metabolism, the method begins with the reconstruction of a genome-scale metabolic network that describes all of the metabolic reactions that are likely to occur in a given cell based on its DNA, and can then be used to model its various metabolic pathways. Palsson, who helped pioneer the approach, refers to such networks as “supply chain models”—albeit ones that map the relationships between all of the metabolites and enzymes that carry out the biochemical reactions necessary to sustain life.
Recently, Palsson and his colleagues combined a metabolic model for the bacterium Thermotoga maritima with a model of macromolecular expression that describes the synthesis of every single one of the organism’s proteins. (They created the same kind of integrated model for E. coli, as well.) The expression model, which is based on a network that represents the biochemical reactions that drive transcription and translation, simulates the machinery that a cell uses to build its gene products, and therefore accounts for many things that a standard metabolic model ignores. By integrating the two different kinds of models, Palsson and his team vastly expanded the range of cellular phenomena they could compute and predict. “You just wouldn’t believe what we are calculating with this model now,” he says before going on to list regulons (collections of genes all governed by the same regulators); metabolic engineering designs; and a variety of cellular functions. In the future, Palsson would like to add genetic regulation to the modeling mix, using the kind of data Gerstein has been exploring with the ENCODE project.
Plenty of challenges remain. Palsson points out that modeling the kinetics and thermodynamics of the many biochemical reactions that take place within a cell is computationally difficult, and will require algorithmic advances. “We understand a lot of individual events,” he says, “but putting them all together in a coherent whole is tough.”
Integrating Multiscale Models of Tissues
It’s equally difficult to simulate the behavior of a population of cells distributed in three-dimensional space, such as one might find in a bacterial infection or a major organ.
That was precisely the problem tackled in a study recently published in PLoS Computational Biology.
Ron Weiss, PhD, and his colleagues at the Massachusetts Institute of Technology developed a novel combination of computational methods to design and analyze an artificial tissue homeostasis system—one that uses a synthetic gene network to cause stem cells to grow and differentiate into a stable population of insulin-producing beta-cells of the sort found in the pancreas. (Such a system could be used to help treat Type I diabetes, in which beta-cells are destroyed as the result of autoimmune defects.)
The network is comprised of several discrete modules assembled from standard genetic circuitry components: toggle switches and oscillators to control population growth; sender-receiver systems to permit intercellular communication. The question, says Weiss, who is a dual associate professor of biological engineering and of electrical engineering and computer science, was whether he and his colleagues would be able to predict the behavior of the entire system once all the modules were connected. “What happens when you take these known modules and try to integrate them into a much more complex system?” he asks.
Weiss and his team designed several different iterations of their system, each one more sophisticated than the last. And they simulated those systems using three different mathematical models that progressively accommodated more and more complexity: one that used ordinary differential equation simulations, and two that used stochastic differential equation simulations to allow for noise and spatial effects. The spatial effects are important because cells that are distributed across space are exposed to different environmental conditions and can’t communicate instantaneously with each other. The noise effects matter because two cells that contain the same genetic circuits can still produce different amounts of a particular protein due to unpredictable fluctuations in gene expression. To his surprise, Weiss found that having some noise in the system was actually helpful. “Normally, in synthetic biology, you think of noise and heterogeneity as being bad things—things that tend to destabilize the system, things that you want to get rid of,” he says. “In our system, the addition of noise actually stabilizes the system and makes it more robust.”
Once again, the low-level integration of the modules—i.e., the act of joining them together to form a larger system—was not the most challenging aspect of the project; for the most part, Weiss explains, it involved defining the interfaces between the various modules and then “gluing one module onto another.” The tricky part was figuring out which bits mattered most to overall system performance—especially since the modules affected one another in unexpected and often non-linear ways.
Weiss and his colleagues first examined how certain module behaviors combined to produce optimal system performance, and then applied Bayesian network inference, which graphically represents the probability that different variables may be related to one another, in order to identify the individual behaviors that had the greatest impact. Weiss feels that such methods represent broadly applicable techniques for aiding integration, just as the team’s decision to proceed from simpler systems and models to more detailed and accurate ones offers a general approach towards system design and understanding that could be useful to others.
Integration from Cell to Organism
If Weiss proved that integrated modeling and analysis could bridge the gap between genetic circuitry and heterogeneous populations of cells, Lars Kuepfer, PhD, showed they could do the same for cells and entire organisms.
Working as part of the Virtual Liver Network, a national initiative funded by the German Federal Ministry for Education and Research, Kuepfer and his colleagues in the computational systems biology group at Bayer Technology Services in Leverkusen, Germany, yoked a genome-scale metabolic model of the kind used by Covert and Palsson to a physiologically-based pharmacokinetic (PBPK) model of the sort used to simulate the availability of drugs in tissues throughout the body. More precisely, they integrated a genome-scale network reconstruction of a human hepatocyte into the liver tissue of a PBPK model representing an adult human being. The resulting multiscale model enables the calculation of thousands of cellular reactions within a whole-body framework containing 200 or so ordinary differential equations and several hundred parameters, ranging from anthropometric details like age and height, to physicochemical ones like the solubility and molecular weight of the compounds under investigation.
Kuepfer and his team were then able to introduce changes at the whole-body level—administering a therapeutic agent or a drug overdose, for example, or generating an abnormally high level of some naturally occurring compound—and track the effects at the cellular level, and even feed the ensuing metabolic perturbations back to the whole-body level, thereby revealing how cellular and extracellular mechanisms influence one another. In a series of case studies, they examined the cellular basis of acetaminophen poisoning; probed the workings of allopurinol, a drug used to treat gout; and looked at how variations in individual physiology (such as liver size) can interact with metabolic disorders (such as an impaired ability to eliminate ammonia) to create otherwise inexplicable levels of biomarkers in the blood.
The Integrative Skill Set
Palsson, who has used metabolic network reconstructions to model the interactions between multiple tissue types, says that multiscale, cell-to-whole-body models like Kuepfer’s are going to make “astonishing progress over the next decade or so,” in part because so many diseases, from cancer to psychiatric disorders, have a metabolic component. Kuepfer is already considering his next steps, and would like to integrate patient-specific metabolome data into the whole-body model. The key challenge, he says, is not the integration per se, which involves using the data generated by the PBPK model to constrain the metabolic one, or pumping the output from the metabolic model into the whole-body simulation. Rather, it is in knowing enough to do both kinds of modeling in the first place.
Kuepfer, who did his graduate work in metabolic modeling and now works for a company that uses pharmacokinetic and pharmacodynamic modeling to evaluate drug candidates, has the tools and experience to work both sides of the street. As things stand today, however, most specialists in genome-scale metabolic network reconstruction probably wouldn’t share his familiarity with PBPK modeling—though Kuepfer does expect a growing number to extend their focus beyond the cellular scale in the future. Palsson, meanwhile, points out that metabolic models, though highly scalable and easy to compute, can also be hard to understand. “How to use them and apply them requires a certain skill set that isn’t commonly available,” he says.
And that might be the common take-away from all of these studies. Figuring out how to integrate the various models and data types involved is one thing; but it is not the only thing, nor is it necessarily the hardest thing. Often, the thorniest issues involve overall design, or conceptual clarity, or individual expertise. Integrative modeling and analysis may hold the keys to many complex computational and biological problems. But they will only lead to meaningful results if researchers give careful thought to the individual components they are attempting to combine—and the problems they are trying to solve.