Cancer and big data analytics
This feature originally appeared on the new Cornell Research site.
There is no denying that cancer is an incredibly complex disease; a single tumor can have more than 100 billion cells, and each cell can acquire mutations individually. The disease is always changing, evolving, and adapting. To best understand its evolution, clinicians and researchers need to obtain snapshots of the tumor’s genetic makeup. The more frequently such snapshots are acquired, the easier it is to understand how cancer evolves. The measurements underlying these snapshots generate tremendous amounts of information. It’s in this information that Olivier Elemento, Weill Cornell Medicine, wants to identify patterns that will help prevent, diagnose, treat, and ultimately, cure cancer.
To that end, Elemento employs the power of big data analytics and high-performance computing. “My research is fueled by new technologies that allow us to query cancer in ways we weren’t able to before,” says Elemento, who originally trained as an engineer. Early in his graduate studies, Elemento shifted to Computational Biology, he says, after realizing that cancer research had great computational needs.
Today, Elemento’s lab focuses on identifying important mutations in the cancer genome, understanding how the cancer genome changes in time, and discovering new potential cancer drugs.
Patient Samples, Big Data Analytics, and Machine Learning
A major thrust of the Elemento lab’s research is in sequencing cancer genomes to guide patient treatment and diagnoses.
The efforts produce huge amounts of data due to the sheer amount of sequenced DNA. The researchers have to break up a cancer genome into 100 base-pair long fragments and sequence hundreds of millions of these pieces. Custom software and supercomputers then piece all of the data back together.
But sequencing a genome doesn’t provide any information on its own. The challenge is in identifying the critical mutations in a genome.
“Not all of these mutations are equally important,” says Elemento. “We want to find the ones that drive the proliferation of tumor cells, because you can target them and potentially kill tumor cells.”
That’s where additional measurements on patient samples, big data analytics, and machine learning come in. The researchers perform assays that measure the effect of mutations in the genome. One method is to examine changes in the transcriptome—the entire set of genes that are expressed. These assays create enormous amounts of additional information, which are then integrated with the DNA sequencing data.
“Humans have about 25,000 genes, and those genes are expressed at very different levels, and expression levels are perturbed in disease,” Elemento explains. “Normal cells express a pattern relatively conserved across cells in humans. We have to employ sophisticated pattern and machine learning algorithms to identify patterns that are potentially linked to disease.”
A Diagnostic Model for Thyroid Cancer
After researchers are able to identify important patterns in specific cancers, they can leverage the information to develop models that diagnose and treat the disease.
Elemento’s lab has already built a machine-learning model that predicts whether a patient has thyroid cancer by analyzing expression levels of specific genes. Thyroid cancer usually presents in the form of a thyroid nodule, a lump that forms at the base of the neck, and around 5 to 15 percent of these nodules are malignant. Using gene measurements of the nodule, the model is able to predict with greater than 90 percent accuracy—higher than standard diagnostic tools—whether a nodule is malignant or benign. The work was published in Clinical Cancer Research in 2012.
“The only way to get this high accuracy was to use machine-learning algorithms to combine expression levels in a way that was nonlinear,” says Elemento. The model and the technique have now been licensed to a company that is developing a commercial test.
A Database of Cancer Genome Mutations Produces a Tumor ID Card
In addition to developing models, Elemento’s lab is building a database of important cancer genome mutations based on both their own data and the cancer research community’s discoveries at large. It’s a data-intensive project that requires scanning cancer literature and performing constant database upkeep. But the potential rewards are large.
Elemento’s lab provides reports to clinicians that show what he calls the “identity card of a tumor.” With the database, they can quickly identify which mutations in a tumor are most important and relay that information, along with an interpretation, to the clinician.
The more critical mutations that are added to the database, the more powerful the database becomes. “The database is growing very fast,” says Elemento. “We are now thinking about opening it up to the broader community to enable crowdsourcing.” Elemento says that, ideally, many cancer researchers and clinicians would be able to update and access the database, albeit with levels of supervision. With the field moving so quickly, Elemento says that the power of the community to help one another is most exciting.
Collaborating to Understand Why Some Patients Relapse and Others Do Not
Collaboration plays a big role in Elemento’s research, and he’s already developed many connections with researchers and clinicians at Weill Cornell Medicine and beyond.
“We have constant interactions with clinicians, and as a scientist it’s fantastic because I can get a lot of feedback,” he says. “It’s tremendously rewarding, since big data research has so much potential for translation to patients.”
Elemento is working with Wayne Tam, a hematopathologist at Weill Cornell Medicine, to identify and validate biomarkers for relapse in lymphoma patients, with funding from the National Cancer Institute. Around 40 percent of lymphoma patients who undergo chemotherapy will see tumor size reduction, but they will eventually relapse. Elemento and Tam will work toward understanding the reason why some patients relapse while others do not.
Elemento’s stake in the project involves computational analysis of exome-sequencing, transcriptome sequencing, and DNA methylation profiling data of lymphoma tumors. The goal is to identify a biological lymphoma relapse signature. Once the biomarkers are identified, the information can be leveraged to build a model to predict patient relapse likelihood. In a 2015 paper published in Nature Communications, Elemento and Tam have already identified promising biomarker candidates of lymphoma relapse using DNA methylation profiling.
Integrating Multiple Sources of Data for Better Individualized Treatment
The future of cancer research and treatment is ever evolving thanks to new technologies. Elemento says he looks forward to integrating multiple data streams, from sequenced genomes to fitness tracking activity, to create even more personalized cancer treatments.
“The idea is to integrate the information to make better treatments for individual patients,” says Elemento. “Genomic information, phenotypic information, and more, to know what drugs to use and how to use the drugs.”
The research efforts will certainly be challenging, since it will involve a huge amount of data processing and recognition. That’s, however, exactly what Elemento’s lab specializes in, and what he says makes the field most exciting today. Elemento says that “because of all of this technology, it’s a bright future for being better able to understand and treat cancer.”