Big Data Analytics In Biomedical Research
Can the complexities of biology be boiled down to Amazon.com-style recommendations? The examples here suggest possible pathways to an intelligent healthcare system with big data at its core.
“We have recommendations for you,” announces the website Amazon.com each time a customer signs in.
This mega-retailer analyzes billions of customers’ purchases—nearly $40 billion worth in 2011 alone—to predict individuals’ future buying habits. And Amazon’s system is constantly learning: With each click of the “Place your order” button, the company’s databank grows, allowing it to both refine its predictions and conduct research to better understand its market.
These days, this sort of “Big Data Analytics” permeates the worlds of commerce, finance, and government. Credit card companies monitor millions of transactions to distinguish fraudulent activity from legitimate purchases; financial analysts crunch market data to identify good investment opportunities; and the Department of Homeland Security tracks Internet and phone traffic to forecast terrorist activity.
Where is Amazon’s equivalent in healthcare and biomedical research? Do we have a “learning healthcare system” that, like Amazon.com, can glean insights from vast quantities of data and push it into the hands of users, including both patients and healthcare providers? Not even close.
It’s a situation that frustrates and inspires Colin Hill, CEO, president, chairman and cofounder of GNS Healthcare, a healthcare analytics company. “When I go to my doctor for some treatment, he’s kind of guessing as to what drug works,” he says. With the data currently being captured and stored, he says, there’s now an opportunity to take a broader view of the problem. “We need to make this system smarter and use data to better determine what interventions work,” he says.
And there is hope, says Jeff Hammerbacher, who formerly led the data team at Facebook and is now chief scientist at Cloudera, a company that provides businesses with a platform for managing and analyzing big data. “I believe that the methods used by Facebook and others—commodity hardware, open source software, ubiquitous instrumentation—will prove just as revolutionary for healthcare as they have for communications and retail,” he says.
Others agree: “We have to create an infrastructure that allows us to harvest big data in an efficient way,” says Felix Frueh, PhD, president of the Medco Research Institute.
Right now, biomedical infrastructure lags well behind the curve. Our healthcare system is dispersed and disjointed; medical records are a bit of a mess; and we don’t yet have the capacity to store and process the crazy amounts of data coming our way from widespread whole-genome sequencing. And then there are privacy issues (see “Privacy in the Era of Electronic Health Information,” a story also in this issue). Moreover, while Amazon can instantly provide up-to-date recommendations at your fingertips, deploying biomedical advances to the clinic can take years.
Despite these infrastructure challenges, some researchers are plunging into biomedical Big Data now, in hopes of extracting new and actionable knowledge. They are doing clinical trials using vast troves of observational health care data; analyzing pharmacy and insurance claims data together to identify adverse drug events; delving into molecular-level data to discover biomarkers that help classify patients based on their response to existing treatments; and pushing their results out to physicians in novel and creative ways.
Perhaps it’s asking too much to expect that the complexities of biology can be boiled down to Amazon.com-style recommendations. Yet the examples described here suggest possible pathways to the dream of an intelligent healthcare system with big data at its core.
DEFINING BIG DATA IN BIOMEDICINE
Big data in biomedicine is coming from two ends, says Hill: the genomics-driven end (genotyping, gene expression, and now next-generation sequencing data); and the payer-provider end (electronic medical records, pharmacy prescription information, insurance records).
On the genomics end, the data deluge is imminent. With next-generation sequencing—a process that greatly simplifies the sequencing of DNA—it is now possible to generate whole genome sequences for large numbers of people at low cost. It’s a bit of a game-changer.
“Raw data-wise, it’s 4 terabytes of data from one person,” says Eric Schadt, PhD, chair of genetics at Mt. Sinai Medical School in New York City. “But now imagine doing this for thousands of people in the course of a month. You’re into petabyte scales of raw data. So how do you manage and organize that scale of information in ways that facilitate downstream analyses?”
For now, as we wait for next-gen sequencing to work its magic, genomics data matrices remain long and thin, with typically tens to hundreds of patients but millions or at least tens of thousands of variables, Hill notes.
“But on the payer-provider data side,” Hill says, “we’re dealing now with large longitudinal claims data sets that are both wide and deep.” A data matrix might have hundreds of thousands of patients with many characteristics for each—demographics, treatment histories, outcomes and interventions across time—but typically not yet thousands or millions of molecular characteristics.
To a great degree, the two sides of biomedical big data have yet to converge. Some researchers work with the clinical and pharmaceutical data; others work with the biomolecular and genomics data. “The bottom line is,” says Eric Perakslis, PhD, chief information officer at the U.S. Food and Drug Administration, “the large body of healthcare data out there has yet to be truly enhanced with molecular pathology. And without that you’re really not getting at mechanisms of action or predictive biology.” Where there is data, he says, “It’s almost this random thing: Molecular data is collected at a few time points but that’s it.”
Nevertheless, Schadt believes that a world where these biomolecular and clinical datasets come together may arrive soon. “In maybe ten years time,” he says, “all newborns and everyone walking through the door will have his or her genome sequenced and other traits collected and that information will all be crunched in the context of their medical history to assess the state of the individual.”
THE TOOLS OF BIG DATA ANALYTICS
If you have data sets with millions or tens of millions of patients followed as a function of time, standard statistics aren’t sufficient, especially if you are looking for associations among more than two variables, or data layers. “This is not about genome-wide association studies (GWAS),” Hill says. Such studies typically seek to connect genomic signatures with disease conditions—essentially looking at only two layers of data. “When people start doing this from multiple layers of data, that’s where it becomes non-trivial,” Hill says. “That’s where in my mind it gets to big data analytics rather than biostatistics or bioinformatics.”
Many of the tools of big data analytics are already being used in other fields, says Schadt. “We’re almost latecomers to this game but the same sorts of principles applied by Homeland Security or a credit card fraud division are the kinds of approaches we want to apply in the clinical arena.”
The U.S. Department of Homeland Security, for example, examines such things as cell phone and email traffic and credit card purchase history in an attempt to predict the next big national security threat. They want to consider everything together, letting the data speak for itself but looking for patterns in the data that may signify a threat, Schadt says. They achieve this using machine learning in which computers extract patterns and classifiers from a body of data and use them to interpret and predict new data: They know when a prior threat occurred, so they look for features that would have helped them predict it and apply that looking forward. In a clinical setting, that could mean looking at not only which molecular or sequencing data predicts a drug response but also what nurse was on duty in a particular wing during specific hours when an event occurred. “You just want all this information and then crunch it to figure out what features turn out to be important,” Schadt says.
In addition to machine-learning, Hill says, there is a need for approaches that scale up to the interpretation of big data. In his opinion, this means using hypothesis-free probabilistic causal approaches, such as Bayesian network analysis, to get at not only correlations, but cause and effect.
He points to strategies developed by Daphne Koller, PhD, professor of computer science at Stanford University, as an example of what can be done. Much of her work involves the use of Bayesian networks—graphical representations of probability distributions—for machine learning. These methods scale well to large, multi-layered data sets, he says. Hill’s company, GNS Healthcare, has developed its own variation, which they call “reverse engineering and forward simulation” (REFS). “We break the dataset into trillions of little pieces, evaluating little relationships,” he says. Each fragment then has a Bayesian probabilistic score signaling how likely the candidate relationship is as well as the probability of a particular directionality (an indication of possible cause and effect). After scoring all of the possible pair-wise and three-way relationships, REFS grabs the most likely network fragments and assembles them into an ensemble of possible networks that are robust and consistent with the data. That’s the reverse engineered part. Next comes forward simulation to predict outcomes when parts of each network are altered. This procedure allows researchers to score the probability that players in the ensemble of networks are important and to do so in an unbiased way across a large dataset.
Schadt agrees that such data-driven approaches are essential, and he uses them in his own work. But he says big data analytics covers a vast computational space ranging from bottom-up dynamical systems modeling to top-down probabilistic causal approaches—whatever approach (including hypothesis-driven), he says, “can derive meaningful information to aid us in understanding a disease condition or drug response or whatever the end goal is.” Essentially, he says, it’s not the approach that defines big data analytics, but the goal of extracting knowledge and ultimately understanding from big data.
Perakslis views the problem somewhat differently. “In order to get translational breakthroughs, you have to start out with an intentional design, which starts with intentional sampling,” he says. “And to be honest, I don’t think it works well yet hypothesis free.” In a GWAS, he says, of course you look at everything because you don’t know where to look. But the answers you seek can be lost in the noise.
Perakslis is more interested in broadening the types of data that are brought to bear in focused clinical trials. “Too often, when people make decisions, they are only looking at part of the story,” he says. So, for example, if at the end of a Phase III clinical trial a drug doesn’t produce the degree of success needed for approval, the database should be rich with information to help figure out why and where to go from there. TranSMART, a clinical informatics database that Perakslis helped assemble when he worked at J&J, does just that: It integrates different types of data into one location.
CLINICAL & PHARMACEUTICAL BIG DATA: ALREADY ABUNDANT
These days, for certain large healthcare organizations, large quantities of data simply accrue as an inevitable part of doing business. This is true of most hospitals, health maintenance organizations (HMOs), and pharmacy benefits managers (also known as PBMs). In these settings, “You really are getting at big data on the scales of LinkedIn, Amazon, Google, eBay and Netflix,” says Yael Garten, PhD, a senior data scientist at LinkedIn who received her doctorate in biomedical informatics from Stanford University.
For example, Kaiser Permanente (KP), an HMO, has a 7 terabyte research database culled from electronic medical records, says Joe Terdiman, MD, PhD, director of information technology at Kaiser Permanente Northern California’s Division of Research. That doesn’t include any imaging data or genomics data. This special research database has been pre-cleaned and standardized using SNOWMED CT, an ontology of medical terms useful for research. “By cleaning and standardizing the data and making it easily accessible, we hope to do our research faster and more accurately,” Terdiman says.
Medco, a PBM, accumulates longitudinal pharmacy data “because we are who we are and do what we do,” Frueh says. As a large PBM that covers about 65 million lives in the United States, Medco manages the pharmaceutical side of the healthcare industry on behalf of payers. Their clients are health plans and large self-insured employers, state and governmental agencies, as well as Medicare. The company has agreements with some of these clients who provide large sets of medical claims data for research purposes. From the claims data, Medco can extract patient indications, treatments, dates of treatment, and outcomes (for example, whether the patient was hospitalized or not). Putting this multi-layered data together, Medco can search for associations between drug use, patient characteristics, and clinical impact (good, bad or indifferent) in order to determine whether a drug works the way it should.
And at Medco, big data analytics has already reaped dividends by uncovering drug-drug interactions. For example, clopidogrel (Plavix™) is a widely used drug that prevents harmful blood clots that may cause heart attacks or strokes. However, researchers were concerned that certain other drugs—proton-pump inhibitors used to reduce gastric acid production—might interfere with its activation by the body. Using their database, Medco looked for differences in two cohorts: those on one drug and those on the two drugs that potentially interact. The study revealed that patients taking both Plavix and a proton-pump inhibitor had a 50 percent higher chance of cardiovascular events (stroke or heart attack).
A similar study showed that antidepressants block the effectiveness of tamoxifen taken to prevent breast cancer recurrence. Patients taking both drugs were twice as likely to experience a recurrence.
“Both of these studies are prototypical of the kinds of questions we can ask in our database where we can correlate pharmacy data with clinical outcome data,” Frueh says.
Though Medco’s outcomes are impressive, they have thus far relied on fairly straightforward statistical and epidemiological methods that were nevertheless quite labor intensive. “The hands-on analytics time to write the SAS code and specify clearly what you need for each hypothesis is very time-consuming,” Frueh says. In addition, the work depends on having a hypothesis to begin with—potentially missing other signals that might exist in the data.
To address this limitation, Medco is currently working with Hill’s GNS Healthcare to determine whether a hypothesis-free approach could yield new insights. So in the Plavix example, rather than starting with the hypothesis that proton-pump inhibitors might interact with drug activation, Frueh says, “We’re letting the technology run wild and seeing what it comes up with.”
Because GNS Healthcare’s REFS platform automates the process, he says, Medco can take the strongest signals from the data and avoid wasting time on hypotheses that don’t lead to anything. Right now they are confirming whether the strongest findings identified by applying the REFS platform to the Plavix database actually hold up to more in-depth analysis.
ADDING GENOMICS TO THE MIX
The REFS platform developed by GNS Healthcare also functions in contexts that include genomic data. For example, in work published in PLoS Computational Biology in March 2011, GNS Healthcare and Biogen identified novel therapeutic intervention points among the one-third of arthritis patients who don’t respond to a commonly used anti-inflammatory treatment regimen (TNF-α blockade). The clinical study sampled blood drawn before and after treatment of 77 patients. The multi-layered data included genomic sequence variations; gene expression data; and 28 standard arthritis scoring measures of drug effectiveness (tender or swollen joints, c-reactive protein, pain, etc.). Despite being entirely data driven, the second-highest rated intervention point they discovered was the actual known target of the drug. The first-highest rated intervention point—a new target—is now being studied by Biogen.
“To my knowledge,” Hill says, “this is the first time that a data-driven computational approach (rather than a single biomarker approach) has been applied to do this in a comprehensive way.” And although the number of patients was relatively small, Hill says, the study suggests that researchers can now interrogate computer models of drug and disease biology to better understand cause and effect relationships from the data itself, without reliance on prior biological knowledge.
“If you ask me why we’re doing this,” Hill says, “it’s because it’s going to cure cancer and other diseases and there’s no other way to do it than by using big data analytics…. If you do discovery the way it’s been done until now, it just doesn’t cut it.”
Today, rather than deal with the vastness of genomics data, Schadt says, many researchers distill it down to look only at the hundred or so gene variants they think they know something about. But this will be a mistake in the long run, Schadt says. “We need to derive higher level information from all of that data without reducing dimensionality to the most naïve level. And then we need the ability to connect that information to other large data sources such as all the types of data gathered by a large medical center.”
The eMERGE Network, an NIH-funded collaboration across seven sites, is taking a running start at doing this. They are linking electronic medical records data with genomics data across seven different sites. Researchers will be able to study cohorts extracted from this “big data” without having to actively recruit and gather samples from a study population.
To a great extent, though, the eMERGE Network is still building its repository and confirming that it can repeat known results. The analytics are only now getting underway.
Kaiser Permanente, like the eMERGE network, is currently building what will be one of the largest biorepositories anywhere, with genotype data from 100,000 patients. “We hope to reach 500,000,” says Terdiman of Kaiser Permanente.
But Kaiser is still sorting through what sort of platform to use for the data. They are looking at Hadoop—an up-and-coming open-source distributed-computing framework for storing and managing big data—as well as other possibilities. “With 100,000 patients genotyped, and each one has 700,000 SNPs, that’s a pretty big matrix,” Terdiman says. And then when you associate that with phenotypic data from the electronic medical record, he points out, “there’s a combinatorial effect of all these variables such that simple or even relatively fast processors might take weeks to do a single analysis.” GWAS programs usually run on small samples, and Terdiman doesn’t yet know how well they will scale to the full genotyped database. “No one, literally, has had the amount of data to do GWAS studies that we have,” he says.
Really, says Frueh, the data deluge from whole genome sequencing is just beginning. Frueh would love to tie Medco’s data to genomics biorepositories but there just isn’t enough data yet. Frueh notes that he could possibly partner with labs or organizations that have done large GWAS but, he says, unless you’re asking the same questions as the GWAS, you won’t get a lot of depth in those studies, especially after matching people to the pharmacy database. “You go from large to small numbers very quickly,” he says.
Stephen McHale, CEO of Explorys, a big data bioinformatics company based in Cleveland Ohio, says that traditional relational data-warehousing technology can’t efficiently handle the 30 billion clinical elements in their dataset. So Explorys implemented the type of data architecture that supports Google, Yahoo and Facebook. “Our technology is all column store and using MapReduce and those kinds of architectures,” he says, referring to approaches that use large numbers of computers to process highly distributable problems across huge datasets. He says that, to his knowledge, it’s a first in the medical space. “We needed that sort of architecture to support this much data.” And with genomics coming their way, it seems even more essential to use these types of architecture, McHale says. Explorys is now working on some pilot initiatives to integrate genomic data with observational data.
MAKING BIG DATA ACTIONABLE
Extracting knowledge from big data is a huge challenge, but perhaps a greater one is ensuring that big data infrastructure will form the backbone of an effort to push Amazon.com-style recommendations to practitioners and patients.
Garten notes that implementing an Amazon or LinkedIn style recommendation system in biomedicine will be tough. Such systems use machine learning and natural language processing to, in a sense, bucket customers into groups. “But the ability to bucket people together is harder in biomedicine,” Garten says. “The slightest variations can matter a lot in terms of how we metabolize drugs or respond to the environment, so the signal is harder to find.” The stakes are also higher for getting a false result.
But Medco’s experience suggests such bucketing is already possible, at least to some extent. For example, in the Plavix example described above, Medco was in a position to immediately effect a change: “We can pull a switch and say that each and every pharmacist on our list needs to be told about this,” Frueh says. After implementing the rule, Medco saw a drop of about one third in co-use of the interacting drugs. “This is one example where the use of big data in this stepwise process has cut down on the time it takes to get changes into clinical practice,” Frueh says.
In another example, Medco was able to use its infrastructure to increase uptake of a genotyping test for warfarin dosing. First, however, they had to show payers that the test was cost-effective. In a clinical trial conducted in collaboration with Mayo Clinic, Medco showed that genotyping reduced the rate of hospitalizations among warfarin-dosed patients by 30 percent. Armed with that information, payers became supportive of Medco reaching out to physicians to suggest they use the genotyping test before prescribing warfarin. Because of Medco’s big data infrastructure, this outreach could be easily accomplished: Each time a physician prescribed warfarin, a message was routed back through the pharmacy to the physician, suggesting use of the test. The result: an increase in uptake of the test from a rate of 0.5 percent or so in the general physician population up to approximately 20 to 30 percent by physicians in the network.
“This has to do with creating an environment and the operational infrastructure to be proactive,” Frueh says. And Frueh suspects that uptake of the test will continue to grow. “We’re probably at the beginning of what we hope will be a hockey-stick shaped uptake of this test.” The lesson: Big data, and the connectedness of big data to the real world, provides the opportunity to take advantage of teachable moments at the point of care.
As we go from data generation to knowledge about what it means, to making that knowledge actionable, Schadt says, “It will impact clinical decisions on every level.”
PLAYING CATCH UP AND THEN SOME
To some extent, big data analytics in biomedicine lags finance and commerce because it hasn’t taken advantage of commercial methods of handling large datasets—like Hadoop and parallelized computing. “These allow data analytics in an industry-level manner,” Garten says. “That’s something that LinkedIn, Amazon and Facebook have already nailed, and bioinformatics is lagging behind those industries.”
Bioinformatics researchers still spend a lot of time structuring and organizing their data, preparing to harvest the insights that are the end goal, says Garten. By contrast, the private sector has completed the phase of structuring and collecting data in an organized fashion and is now investing more and more effort toward producing interesting results and insights. Eventually, Garten says, “the practices and experience from the corporations with large amounts of data (i.e., LinkedIn, Amazon, Google, Yahoo) will propagate back to the academic and research setting, and help accelerate the process of organizing the data.”
At the same time, bioinformatics actually has something to offer the broader world, Garten says. She and others with a bioinformatics background who have moved into other arenas bring to the table an ability to handle messy data that is often incomplete. The expertise in integrating various datasets in creative ways to infer insights from this data, as is done in translational bioinformatics, is useful for extracting business insights in other industries.
Hill also sees biomedical approaches filtering outward. “REFS is data-agnostic,” he says. It can work on genomic data as easily as clinical data—or, for that matter, financial data. Hill’s company recently created a financial spinoff called FINA Technologies. He also spun off Dataspora, which is focused on consumer ecommerce. “We’ve created a technology that goes all the way from unraveling how cancer drugs work to predicting financial markets,” Hill says. “This technology is applicable to how complex systems work in different industries, and there’s something profound about that.”