Skip to main content

Together we are beating cancer

Donate now
  • For Researchers

Doing data the right way…

The Cancer Research UK logo
by Cancer Research UK | In depth

23 April 2025

0 comments 0 comments

research data

The correct handling of big data is vital for the next raft of understanding in cancer research. We must get it right – that’s why CRUK has set up a data science community to support our researchers. We hear from three community leads on their data demonstration projects and what they are aiming to achieve…

Professor Mieke Van Hemelrijck

“The goal is to create a seamless infrastructure where researchers can efficiently integrate and analyse data without compromising security or patient confidentiality.”

Professor Mieke Van Hemelrijck, Transforming cancer outcomes through research Team Lead, King’s College London

Cancer research thrives on data, but the challenge has always been making that data usable across studies and institutions.

We know that valuable datasets exist – clinical records, imaging, genomic data – but too often, they remain siloed, locked behind incompatible formats and institutional barriers. Our project seeks to change that by leveraging the OMOP Common Data Model (CDM) – an open community data standard, designed to standardize the structure and content of observational data – and privacy-preserving data-sharing techniques such as federated learning.

The goal is to create a seamless infrastructure where researchers can efficiently integrate and analyse data without compromising security or patient confidentiality. This could be transformative for research into rare cancers, where data scarcity remains a major obstacle.

Adoption is always a hurdle when implementing new data standards. That’s why our project includes a strong training and dissemination component.

One of the key insights from our recent survey of the CRUK data community was the varied landscape of data access challenges. While some researchers are eager to share data, they face legal and ethical roadblocks. Others struggle with technical issues – datasets stored in formats that aren’t compatible across platforms. A dual approach is needed – we need standardisation to unify how data is structured, and technological solutions to enable secure, privacy-preserving sharing.

This project is tackling both, which is why I believe it has such potential to reshape how we conduct cancer research.

Adoption is always a hurdle when implementing new data standards. Researchers are often hesitant to change established workflows. That’s why our project includes a strong training and dissemination component. We’re developing educational materials and workshops to ensure that researchers feel confident using the OMOP-CDM and engaging in collaborative data-sharing initiatives.

Feedback from our recent CRUK data community survey highlights a strong appetite for this – 68% of respondents identified improved standardisation and accessibility as the most impactful changes needed for data reuse in cancer research.

Our role is to translate that demand into solutions.

While the promise of standardised data is immense, challenges remain. Technical adoption, legal frameworks, and researcher buy-in will require continued effort. However, with a growing community of supporters and real-world pilot studies demonstrating the benefits, this project is laying the foundation for a new era of collaborative, data-driven cancer research. By bringing together diverse expertise – from clinical database management to secure computing – this initiative is not just about improving data access. It’s about accelerating life-saving discoveries.

Dr Frances Pearl

“We are going to build an online CRUK Data Hub where researchers can search for available cancer datasets funded by CRUK. The Hub will provide details on the datasets and how to access them.”

Dr Frances Pearl, Associate Professor in Bioinformatics, University of Sussex.

Last year the Nobel Prize for Chemistry was awarded to the teams that could accurately predict the 3D structure of a protein from an amino acid sequence using AI and Deep Learning (AlphaFold and David Baker). This was only possible, because all the data required to train the AI, was in the same format, in the same database and had been collected and stored in this way since 1971 (Protein Data Bank).

For those of us studying cancer, although there are many multi-omic or imaging studies linked with patient outcomes or drug responses for large sets of cancer patients, often the data produced in these studies are not very accessible.

CRUK has a strategy to harness the potential of big data to transform cancer research. A key focus is making the most of CRUK’s research data by following the FAIR Data principles, which aim to make data easy to find, access, and use.

The CRUK Data Hub will make it easier for researchers to find and reuse cancer datasets, fostering collaboration and the sharing of data.

We are going to build an online CRUK Data Hub where researchers can search for available cancer datasets funded by CRUK. The Hub will provide details on the datasets and how to access them. We will also develop training materials and guidance to help researchers apply for access to these data resources.

The CRUK Data Hub will make it easier for researchers to find and reuse cancer datasets, fostering collaboration and the sharing of data. At the same time, advances in data science and powerful AI tools have rapidly progressed in recent years. Together, these developments offer a unique opportunity to revisit and analyse existing data in exciting new ways.

By connecting old and new datasets, we can uncover deeper insights and make discoveries that could significantly improve patient care. This resource will ultimately lead to more effective research and, most importantly, better outcomes for cancer patients.

Harriet Unsworth

“Our aim is to empower data researchers with the tools to understand, quantify, and where possible, improve the diversity in their data research projects.”

Harriet Unsworth, Digital Cancer Research Team Lead, Cancer Research UK National Biomarker Centre.

The lack of diversity in healthcare and biomedical datasets is a problem for researchers, healthcare systems and citizens.

Data is used across healthcare and medical research for decisions that impact the whole of society: to plan healthcare services, understand the impacts of new treatments, and understand mechanisms of disease. As researchers, we know that if the data we use doesn’t reflect the whole population, then our research outputs may be hard to generalise across all groups of people.

The lack of representativeness can limit the generalisability of research findings and can reduce the effectiveness of predictive tools and interventions for underrepresented groups.

Diversity in data includes different dimensions, such as age, sex, ethnicity and socioeconomic status. Examples of un-diverse data can be seen in participation in genomic studies, where over 80% of participants are of European genetic ancestry, despite being less than 16% of the global population. The lack of representativeness can limit the generalisability of research findings and can reduce the effectiveness of predictive tools and interventions for underrepresented groups.

Important demographic variables are sometimes not recorded at all: 40% of NIHR-funded clinical trials between 2007 and 2017 did not provide ethnicity data for participants. And this is a problem – for researchers using the data, we can’t tell whether the research will be relevant to the public as a whole.

CRUK’s research data strategy commits to ensuring that all parts of society benefit equally from cancer data science and aims to champion diversity in research data.

Data Science for Health Equity (DSxHE) is a global community that brings together experts and enthusiasts working at the intersection of data science and health inequalities to ensure that the latest research and innovations improve health equity. In this project the research data community leads will link up with DSxHE to co-develop actionable, scalable solutions with researchers, clinicians, and patient advocates to embed data diversity into cancer research. DSxHE will set up a new data diversity community, in which solutions can be developed and best practice shared.

Outputs from this project will be driven by insight from the community, cancer patient representatives, and data experts. They may include inclusive data collection protocols, training resources to upskill researchers, and methods to assess the diversity in existing health datasets.

Our aim is to empower data researchers with the tools to understand, quantify, and where possible, improve the diversity in their data research projects, to ensure more equitable and generalisable outputs for the whole of society.

Data-driven Cancer Research Conference 2026

The event for anyone working in cancer research, data science, bioinformatics, computational oncology, computer science, mathematics or statistics including those working at the interface between these disciplines. The event is open to researchers from all career stages from across the world.

Check out the conference.

Register your interest to make sure you don’t miss registration.

Tell us what you think

Leave a Reply

Your email address will not be published. Required fields are marked *

Read our comment policy.

Tell us what you think

Leave a Reply

Your email address will not be published. Required fields are marked *

Read our comment policy.