Making sense of big cancer data

“Cancers contain many more mutations than we thought – but finding those that actually matter is like finding a needle in a haystack,” Dr Dana Pe’er told the audience at the annual Big Data Analytics Conference held at the British Library this month.

So what has Dr Pe’er, associate biology and computer science professor from Columbia University in the US, been doing to find those needles?

With large-scale ongoing international projects like the Cancer Genome Project and The Cancer Genome Atlas, the amount of genetic data being generated from analysing patients’ tumours is now staggering, creating a “tsunami of data” for researchers.

To try to work out how to make sense of it all, Dr Pe’er’s team has been creating algorithms – a set of rules a computer can follow in order to solve a problem – that can crunch through huge quantities of big data to better understand cancer.

This, she says, is leading to a deeper understanding of the disease itself – and ultimately to further progress in the way the disease is treated.

Finding a needle in a haystack

Back in 2010, Dr Pe’er and her lab developed an algorithm called CONEXIC that could pick out key mutations causing cancer – known as ‘driver’ mutations – and used it to find new mutations that drive skin cancer.

The algorithm combined two types of laboratory data. The first came from experiments measuring changes in the structure of chromosomes (packets of DNA) in cancer cells where whole stretches of DNA get repeated – or deleted. These are known as ‘copy number alterations’.

The second data set came from analysis of the activity of different genes on these chromosomes – so-called ‘expression data’.

By looking at data from hundreds of tumour samples, this algorithm was able to pinpoint areas where genes were not only physically repeated, but also activated.

And this allowed the team to hone in on the ‘needle in the haystack’ – the gene faults driving each tumour.

“This predicts it’s a vulnerable gene – the signal pulls out the drivers of cancer,” Dr Pe’er told the audience.

And understanding the genes driving cancer is the first step on the way to developing new, targeted treatments.

Since then Dr Pe’er and her team have continued to develop even more sophisticated algorithms to analyse ever more complex data.

Visualising big data

One way researchers study blood cancers like leukaemia is to use a technique called flow cytometry – a laser-based technology that can separate out individual cells from the blood and measure their individual characteristics.

Typically, data generated from the flow-cytometer is plotted on a two-dimensional graph, allowing patterns to be spotted. In effect, it creates a 2D ‘map’ of the characteristics of the cells in the sample.

Until recently, only two characteristics of the individual cells were plotted against each other at any one time – leading to an incomplete picture of the disease.

But technology has rapidly advanced and now newer machines can measure up to 45 characteristics at the same time, from millions of individual cells. And, this produces a vast quantity of highly complex data.

So a better way to map and visualise this is crucial to better understanding the complexity of leukaemia.

At the conference Dr Pe’er talked excitedly about a program her team developed back in 2013 called ViSNE, which uses an algorithm to visualise that mass amount of data from many different aspects of the cells’ biology.

ViSNE allows researchers to map and visualise multiple characteristics of individual cells against each other, creating a two-dimensional map from what in reality is multidimensional data.

An example of the 2D ‘maps’ generated by the ViSNE algorithm

When used to process data from both normal cells and leukaemia cells, pre- and post-treatment, it lets the researchers see crucial differences that appear as the disease progresses.

And this, Dr Pe’er says, is opening up some new avenues for understanding how drug resistance develops.

Infiltrating cancer’s social network

But to truly understand cancer it’s not enough to just see the different types of cancer cells in isolation. You need to understand how they interact.

Cancers are made of ‘communities’ of different types of cell, and – just like human communities – the dynamics between them change over time.

Dr Pe’er is trying to decipher and measure those dynamics.

Drawing on algorithms originally used to analyse social networks, her latest work, published in the journal Cell last month, involves an algorithm, PhenoGraph, which combines reams of experimental data together, to group cancer cells into ‘networks’, based on their similarities.

The study looked at the relationship between how different acute myeloid leukaemia (AML) cells ‘look’ – in terms of the different molecules on their surface – and how they behave.

Taking data from millions individual cells the algorithm grouped cancer cells into different communities.

When they looked at how these communities differed between different patients, they found a unique group of cells, whose presence or absence was related to how well a patient subsequently fared after treatment.

This sort of analysis, Pe’er thinks, could one day be used to help ‘personalise’ a patient’s treatment.

The challenge now is to take these computer-based discoveries, understand the molecules behind them, and then use this knowledge to improve the way patients with the disease are treated.

Big data, big future

As our understanding of cancer’s complexity is reflected more and more in the vast quantities of data our researchers generate, work like that of Dr Pe’er becomes ever more vital to understand and make sense of it.

One area that could use a little clarity – and something that’s a really important topic in cancer research right now – is immunotherapy. Recent trials show that drugs that target a patient’s immune system can have profound effects – but they don’t work in everyone. And that’s where Dr Pe’er sees big data truly making a difference.

“We need smart and adaptive drugs like our own immune system,” she said, “I believe the next big push should be big data to personalise immunotherapy.”

So we look forward to the future insights Dr Pe’er’s algorithms may provide.

The annual Big Data Conference is sponsored by Winton, an investment management company which bases its trading decisions on scientific methods, and which has a strong interest in data analysis. Winton is funding the bioinformatics laboratories at the Francis Crick Institute, and supporting two group leaders who will carry out life-saving research in the bioinformatics field.

Image: By Henti Smith/Flickr, used under CC-BY-NC-ND 2.0