The Cure for Cancer Is DataMountains of Data


A few years ago Eric Schadt met a woman who had cancer. It was an aggressive form of colon cancer that had come on quickly and metastasized to her liver. She was a young war widow from Mississippi, the mother of two girls she was raising alone, and she had only the health care that her husbands death benefits afforded heran overburdened oncologist at a military hospital, the lowest rung on the health care ladder. The polar opposite of cutting-edge medicine. To walk into such a facility with stage 4 metastatic disease is to walk back in time to the world of the unmapped human genome, when colon cancer was understood to have a single cause instead of millions of causes resulting in unique variations, when treatment was the same bag of poison, whether you were in Ocean Springs, Mississippi, or Timbuktu. A time without big data, machine learning, or hope.

Schadt had just started the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai Hospital, and when he heard about the woman in Mississippi, he said, simply, Thats exactly the kind of patient we take. By that he meant patients for whom the current standard of care would fail, for whom the future of medicineone in which supercomputers sift through masses of genetic data for patterns that could lead to new treatments and curescould not arrive fast enough.

Schadt isnt a cancer specialist or even a medical doctor. Hes a mathematician and a specialist in molecular and computational biology, and he had never had a single patient in his life. Yet through his new lab at Sinai, Schadt would generate a terabyte of data on this womans cancer, thousands of times what she could have expected in a conventional medical setting, in the hope of finding new ways to combat it. Toward the end, Schadt would sit at her bedside, distraught. They had become close, and the scientist who had never had patients before was seeing the implications of scientific ambition and failure. She died last year.

Seated at his desk at Mount Sinai, Schadt is direct and disarming. At 51, he wears a short-sleeved polo shirt and shorts everywhere he goes, even to black-tie galas or in New York winters, which gives him the unassailable air of a true eccentric, or a high-school football coach. For any medical researcher, its easier to be bullish when youre publishing papers or developing drugs, layers removed from the human impact of your work. But living the effect of your work and watching someone slowly die in front of you, well, thats a deeper humbling than Id ever experienced before, Schadt says today.

Were on this exponential growth curve, where your mind naturally projects all the way into the future, and you think: Were going to figure this out, he says. In the end, we will know what all these cells are doing, what all these perturbations do. The humbling part is that as we are on this growth curve, we are continually struck by the increasing complexity that is revealed.

For a decade weve been talking about the potential of gene sequencing and personalized medicine, how advances in computer processing power combined with an increasingly intimate understanding of our individual genomes has put us on the threshold of an age of miracles. With enough data, the theory goes, theres not a disease that isnt druggable. But as Schadt has learned, its not enough to plumb the depths of an individuals DNA. It requires a universe of data—exabytes worth—to detect patterns in a population, apply machine learning, find the network of mutations responsible for disease, and do something about it. The bigger these data sets become, the more accurate and powerful the models and the predictors become.

You must convince the medical centers and genetic companies that collect our data to not hoard it for their own profit.

The problem is getting these exabytes of genetic data. Turns out you cant just walk up to people, millions of them, and say, Your data, please. You must first persuade them that youll only do good things with it and wont let it fall into the wrong hands. (We do like our privacy.) You must then convince the medical centers and genetic companies that collect this data that, rather than hoard it for their own profit, they should share it so the entire research community can attain the economies of scalethe critical mass of data, individual sets eventually numbering in the millionsthat Schadt and many others believe is necessary to understand the causes of diseases and engineer new treatments and cures.

Right now, that volume of information is simply not available. But companies ranging from tech behemoths to biomedical startups are racing to solve these issues of scale. And Schadt wants in.

If human biological complexity can be likened to an animated movie, then a hundred years ago we had about one pixels worth of understanding of that complexity. With a single pixel, you have no idea what the story is. But with more pixels, hundreds or thousandsor say, 1percent of the whole in pixelspatterns and themes begin to emerge. The beginning of a narrative.

This was the thinking that compelled Schadt to set up the Icahn Institute in 2011 after a decade of developing drugs for Merck. (At one point, half of Mercks metabolic drugs, which treat ailments like heart disease, diabetes, and obesity, were derived from Schadts research.) In the face of widely held assumptions based on the single-gene model of disease and drug development, he came to believe that genes worked not alone but in vast networks to enable disease to penetrate our natural defenses, and we could understand these networks only through deep bioinformatic spelunking. To explore his complexity model, Schadt arrived at Mount Sinai with $150million of financier–philanthropist Carl Icahns money and built a supercomputer named Minerva in the basement to analyze the thousands of genomes collected at Mount Sinai each year. He hired other quants, including Jeffrey Hammerbacher, who had created Facebooks first-ever data team. According to an esteemed oncologist at the medical school, All of a sudden you had all these math nerds running around, people who looked like they should be programming videogames.

We need 100 Mount Sinais to achieve the scale required to recognize the patterns in patient data that guide you to diagnoses and treatments.

It didnt take long for Schadt to realize that he was going to need a bigger boat. In 2014 the Icahn Institute started a joint venture with Sage Bionetworks to try to cure rare childhood diseasescystic fibrosis, sickle cell anemia, Tay-Sachs170 in all. They called it the Resilience Project, and researchers set out to find individuals in the population who carried the DNA variants for those diseases but somehow, through some inoculating buffer, didnt have the disease. In their search for these resilient individuals, Schadt and his team amassed a pool of genetic data from 600,000 people, then the largest such genetic study ever conducted, with data assembled from a dozen sources (23andMe, the Beijing Genomics Institute, and the Broad Institute of MIT and Harvard, most notably). But in searching the 600,000 genomes, the researchers found potentially resilient individuals for only eight of the 170 diseases they were targeting. The study size was too small. By calculating the frequency of the disease-causing mutations in the population, Schadt and his team came to believe that the number of subjects theyd need to be useful wasnt 600,000it was more on the order of 10 million. For all the computational power behind the Resilience Project and what seemed like a wealth of data, Schadt still lacked the quantity and quality of patient information required to crack the genetic code behind resilience.

We need 100 Mount Sinais to achieve the scale required to recognize the patterns in patient data that guide you to diagnoses and treatments, Schadt says. In the five years that Ive been here, Ive realized thats just not going to happen within the medical centers. Theyre too isolated from each other, too competitive, and theyre not woven together into a coherent framework that enables the kind of advancements were seeing in nearly all other industries. Since the major medical centers hold an effective monopoly over their patients data and have little economic incentive to collaborate with one another in critical research areas, Schadt says, the disruption is gonna happen outside the medical establishment.

So thats what Schadt is aiming to build by establishing his own genetic data company, Sema4. The New Yorkbased venture will focus on acquiring and expanding companies that specialize in genetic testing–think cancer-carrier screenings and noninvasive prenatal tests—in order to collect and share millions of individual data sets. On Sema4s searchable platform, doctors will have instant access to a world of genomes to help diagnose their patients. Pharmaceutical companies will pay to use the system to find patient populations for clinical trials. And scientists, their current analytic arsenals amplified through ever more powerful computers and machine-learning algorithms, will finally possess enough genetic data to fuel ambitious research.

Though a handful of tech giants are venturing into the life sciences (see “Big Bets on Biodata,” below) and the National Institutes of Health is asking for a million volunteers to create its own massive biobank, Schadt believes that Sema4 and other startups like itCraig Venters Human Longevity and Patrick Soon-Shiongs Nant-Health chief among themare the most committed to achieving the optimal scale of genetic data. While these companies will compete with one another to collect ever greater stores of high–quality biodata, Sema4 will stand out by making its genetic library accessible and free of charge to academic medical centers and nonprofit researchers around the world. Should any of Sema4s competitors need to harvest information from a subset of Schadts data populations, he says, they could simply pay to access the Sema4 search platform. Or Sema4 and other companies could join forces to assemble large data sets for ambitious endeavors like the Resilience Projectonly bigger.

Big Bets on Biodata

How four tech heavyweights are going all-in on life science.

Gregory Barber


Using machine learning for their Baseline study, Alphabets Verily Life Sciences team will pore over genomic, clinical, and imaging data from thousands of healthy volunteers in the hope of better understanding what makes them healthyknowledge that might help keep people from getting sick in the first place.


In the 1970s, the World Health Organization used IBM hardware to hunt down the last vestiges of smallpox. Today IBM is partnering with hospitals to funnel health data into Watson, its Jeopardy!-winning AI system. The goal is to predict disease, personalize treatment, and even power virtual medical assistants to sift through records and research.


Using Apples ResearchKit, scientists can recruit clinical study subjects en masse and collect real-time health data from participants iPhones. Last spring the company added CareKit, which lets Apple users share health data directly with their personal doctors.


The company is developing tiny sensors to be worn on the skin that can transmit biometric data to remote health monitors (and, potentially, large-scale data aggregators). Microsoft also just announced its plan to use machine learning and biological data to solve cancer.

Still, Schadt argues, the problem of scale cant be solved by companies simply pooling their data. Its about getting the data from the patients themselves. Based on his experience at Mount Sinai, hes seen a leap in recent years in the number of people who are coming around to his belief that there is more upside than down to having a physician know their genetic predisposition to certain conditions. He says that when he got to Mount Sinai in 2011, the hospital was screening a few thousand genetic samples a year. This year, they could screen up to 150,000, most of them collected from patients in the New York region, and at Sema4, Schadt says, we intend to scale that up to 500,000 to a million samples a year.

That growth will occur by buying and expanding existing genetic testing companies all over the country, most of which are now independent from each other but under Sema4 will combine to create a massive network of genetic information governed by a uniform standard of security and consent. Schadt acknowledges that its no simple task to ask a person to give up their biodata to an anonymous corporation. Even though billions of public- and private-sector dollars have been spent to modernize and secure existing data networks, breaches and leaks remain a fact of life. At Sema4, patients will be told, in detail, how their data will be encrypted, anonymized, and scrubbed of identifying information (except for an encryption key). Even in the event of a breach, the chance of someone being identified and exposed is exceedingly low.

There is also the issue of informed consentthe patients understanding and approval of the whats, hows, whys, and how longs of whatever theyre asked to endurewhich impacts both the quality and the quantity of the data being collected. There are companies today that claim access to millions of patient records, Schadt explains. But from the standpoint of what we intend to do, the data is meaningless. Its often inaccurate, incomplete, and not easily linked across systems. Plus, that data doesnt typically include access to DNA or to the genomic data generated on their DNA. To take the example of the Resilience Project, it wasnt simply that the universe of data was too smallit was also that the 600,000 genomes were governed under a hash of various consenting arrangements. If something vital was discovered, hundreds of thousands of participants could not be recontacted or tracked, making the data useless from a practical research standpoint.

Today, most consent forms are designed to be as quick and uninformative as possible, but rather than make it easier for researchers to get high-quality data, this approach actually makes it harder. Studies have shown that the more informed the consent, the better the information, since patients are more willing to participate in follow-up exams and interviews when they appreciate the purpose of the research. (This also allows scientists to track health and wellness over time.) At Sema4, Schadt is adopting a multistage informational processwhich includes a mandatory, must-pass quizso it will be clear that patients understand the full scope of what theyre consenting to. This will require more of a patients time, but Schadt is betting that as more patients understand, more of them will consent to sharing their genetic information.

With this digital infrastructure in place, Schadt envisions a future in which more and more patients share not only their genomes but also medical and lifestyle information collected by monitoring devices like glucometers, blood-pressure trackers, and inhalers. The hope is that, ultimately, these increasingly sophisticated, increasingly patient-friendly tests will be so comprehensive that a patients microbiome can be regularly sequenced, their RNA frequently examined, and their blood cells constantly monitored for signs of trouble.

The virtual monopoly that medical centers like Mount Sinai now exercise over patient data will be smashed, and researchers will finally have the masses of genetic data that the medical breakthroughs of the future require. Can we do better for human well-being if information is more broadly accessible, where youre leveraging the mindshare of the entire planet to evolve the models of disease? Schadt asks. Absolutely. This is medicine as math, not guesswork, and every diseaseeven stage 4 cancermight one day be druggable.

This exclusive online extra accompanies our special November issue, guest-edited by President Barack Obama. Subscribe now.

Read more:


Please enter your comment!
Please enter your name here