Share and share alike: how FAIR data principles can save the day

After a peer-review suggested the generation of more data, Dr Philip Dunne’s project on colorectal phenotypes faced calamity. But the discovery of a pre-print article from colorectal cancer organoid expert Dr Chris Tape, together with his very helpful social media posting, saved the day. Here is the inspirational story as told by each group showing just why FAIR data use is vital for speeding up cancer research…

The data recipient

Dr Philip Dunne: “The reality from our side was that such a dataset would require methodologies and expertise we didn’t directly have, and to generate it would have come with a substantial cost both in terms of time and money.”

The goal of my research group has been to put data-driven research as the driver of biological discovery.

Our projects are led by a team of hybrid early career researchers (ECRs) who are just as comfortable with handling, wrangling and interrogating large molecular datasets as they are with hands on interpretation of wet-lab results emerging from the lab.

A key research theme within our group is the development of transcriptional classifiers representing important molecular phenotypes in tumour data in early-stage colorectal cancer. The “standard” approaches to molecular stratification have largely replicated those used in the early 2000’s in breast cancer, where gene-level expression data is the template for clustering algorithms to identify patterns within samples, leading to the development of subtypes.

If you always do what you've always done, you'll always get what you've always got, right?

However, if you always do what you’ve always done, you’ll always get what you’ve always got, right? Therefore, we set out to challenge the status quo of using gene-level information as the basis for biological discovery by using phenotypic signature scores, rather than gene-level data. This has led to the development of a series of pathway-derived subtypes (PDS) in colorectal cancer.

This approach identified the well-understood canonical (PDS1), stromal and inflammatory (PDS2) subtypes in CRC. However, we also identified a group of tumours (PDS3) that accounted for around 25% of CRCs tested. This group all the same mutations and genomic rearrangements as the PDS1 and 2 lesions but appeared to be devoid of almost all phenotypic signalling associated with cancer-related biological hallmarks – traits that were previously thought to be essential for oncogenic development and progression.

Further interrogation of PDS3 revealed insights into previously overlooked biological traits that exist within established tumour datasets and subtyping systems. Even more intriguingly was the fact that when applied to several datasets with clinical outcomes, this seemingly innocuous PDS3 group of tumours displayed the highest relapse rates when diagnosed at stage II/III.

Insurmountable challenge?

At this point, we wrote up our study and submitted it to a journal. So began the painful wait while the peer review process does its thing.

But, after a relatively quick turnaround, our reviews landed back in February 2023. Two of them we felt we could address, one, however, challenged us to go well beyond our initial characterisations and to provide a much deeper understanding of the phenotypes we were seeing. Given the complexity and sometimes uncontrolled nature of sampling in human tumours, we needed to confirm these finding in a much more controlled system.

We were in a position where we were sure that what we were seeing in our patient data was important, yet we needed an equivalent set of independent pre-clinical models to replicate this to ensure that the observations were both real and reproducible. However, the reality from our side was that such a dataset would require methodologies and expertise we didn’t directly have, and to generate it would have come with a substantial cost both in terms of time and money; que a sombre weekend while we all reflected on this.

The genuine worry at this point was that, with grants and contracts ending, we wouldn’t be able to fully support the ECRs leading this work to a successful conclusion.

However, in between the numerous conversations across the co-authorship about ways forward, Dr Chris Tape from UCL posted his article on BioRxiv which defined a series of incredible discoveries around stem cell lineage identities in colorectal cancer models, all of which appeared to be fully available and interoperable.

Our team had admired the work using organoids to look at colorectal cancer cell regulation from the Tape lab over the years, so we lined the pre-print up for journal club and quickly saw how this could be the missing piece we needed in our paper. Some emails to Chris followed but given how well the data and methods were curated by the Tape team, we could start working on it immediately from the pre-print.

Within a few weeks we had redeployed our PDS classifiers and phenotypic signatures of interest into this new data, using both our tools in their data, and their tools in our data. Bingo! We had validated all our findings and filled the specific gap suggested by the reviewer for the paper.

This all happened ahead of our first call with Chris, using the data, tools and visuals he had made openly available, highlighting the value of data reuse.

All about the ethos

The ethos of FAIR principles have been at the heart of a number of our recent projects, which include the development of user-friendly tools that allow non-programmers to perform bioinformatic analyses on molecular datasets that we produce and/or analyse in our group.

In keeping with this, as part of our paper, we have developed an application to ensure the cohorts and subtypes in our study are accessible and re-usable by any researcher, which we made available while the study was under initial review.

At the time of writing this, our paper is undergoing final editorial review and we have grants submitted based on this work that we hope will enable our ECRs to continue their work of disrupting the computational oncology field.

Although funders have invested significant resources in generating large-scale molecular datasets, the value and impact of these datasets are limited if only a small subset of researchers and clinicians have the necessary computational skills and expertise to analyse and interpret the data effectively. Overall, our team are delighted to see CRUK take the lead in supporting data-driven research (and researchers) and ensuring the FAIR principles are embedded into the ethos of research moving forward.

Philip holds a joint appointment as a Reader in Molecular Pathology within the Patrick G. Johnston Centre for Cancer Research, Queen’s University Belfast and a Group Leader at the CRUK Scotland Institute in Glasgow. His work focusses on investigating mechanisms underlying disease progression in colorectal cancer.

The data donor

Dr Chris Tape: “We shared the data largely because we’re charity funded academics and it’s the right thing to do. Transparency is central to the scientific process and sharing data builds trust with the community.”

2022 was the final year of my CRUK Career Development Fellowship and my lab was in trouble.

Our productivity had been hit hard by COVID and I had almost no money left. I’d spent everything on a few ambitious research projects and if I couldn’t land a major grant within 6 months, my lab would grind to a halt — potentially forever. I was feeling the pressure.

Time to write grants. Big grants. However, in the words of a trusted colleague, to date, my lab “had some nice methods papers, but needed to publish some proper cancer research” if I wanted to improve my chances of getting funding.

Fortunately, the “proper” cancer research was coming. For the past three years we had been studying how oncogenic mutations and stromal cells regulate colorectal cancer (CRC) using high-dimensional single-cell organoid screening. Through analysis of around 4,000 organoid cultures, we discovered a continuous CRC stem cell trajectory spanning from proliferative colonic stem cells (proCSC) to revival colonic stem cells (revCSC). Crucially, we discovered that cancer associated fibroblasts (CAFs) could polarise chemosensitive proCSC towards chemorefractory revCSC — providing a plasticity mechanism for stromal drug protection in CRC.

A question of timing

We thought these findings were important and began to write everything up. The problem, as ever, was timing. Publishing a peer reviewed paper takes at least 9 months and is fraught with uncertainty. Reviewers don’t care if your lab is about to run out of money when asking for new experiments. I had 6 months before bankruptcy so there was only one solution: preprint.

In February 2023 my lab posted on bioRxiv, and I wrote a detailed Twitter/X thread describing our findings and we made all the scRNA-seq and mass cytometry data freely available. We also provided executable notebooks to recreate every figure in the paper and provided a video tutorial teaching readers how to make our new custom plots.

We shared the data largely because we’re charity funded academics and it’s the right thing to do. Transparency is central to the scientific process and sharing data builds trust with the community.

We shared the data largely because we’re charity funded academics and it’s the right thing to do. Transparency is central to the scientific process and sharing data builds trust with the community. We also shared the data because I wanted funders so see that my lab generates unique data. If someone else actually used the data that would be great – but we’re just a small academic lab with a limited reputation in CRC research so I presumed no one would be that interested.

To our surprise (and relief) the preprint was very well received. After working on a project for years you often become the worst judge of its value to others. What was once an extremely exciting discovery within the lab, can soon become old news to the authors by the time it’s published. We also hadn’t presented the work externally due to COVID, so while we were very proud of the paper internally, we had no idea if anyone else would care. Fortunately people seemed to like the work.

A few weeks after the preprint went live, I received an email from Dr Philip Dunne. Phil’s lab study CRC using computational approaches and were in the midst of a major paper revision at a journal. Via a completely different approach, Phil’s team had found similar evidence of non-genetic plasticity in CRC patient samples. Their results were very exciting, but potentially a bit noisy and their peer review was on a knife edge.

Human samples are very heterogenous and therefore very challenging to interpret. Confounding variables lurk around every corner and sometimes you just don’t know what you don’t know. By contrast our organoid data was highly controlled. We know each organoid’s exact genotype, what culture media it has been grown in, and what stromal cues the organoids were exposed to. Our preprint data provided Phil’s lab with an opportunity to test their clinical observations against a highly controlled dataset.

When organoids meet computational research…

I was overjoyed to see our proCSC signature overlay with Phil’s PDS1, our revCSC with their PDS2, and our differentiated cells with their PDS3. It was like we were both feeling the metaphorical elephant from different sides and shouted “elephant” simultaneously. Phil’s team could even reproduce our new 3D landscape plots using the video tutorial. Those plots use SideFX Houdini (more commonly used by Disney and Pixar animation studios) and Phil’s team had them up and running by our first meeting. If we’d not made the data and workflows publicly available, that never would have been possible.

For me, data sharing alongside a preprint is a no-brainer. The field see your results are sound, which buys confidence and respect. Readers can then re-use the data for their own studies, increasing the pace of discovery. If we’d not made our data available until after peer review, Phil’s lab would first see it in early-2024, not early-2023 and their work would have unnecessarily delayed.

Finally, I used the preprint when applying for a CRUK Discovery Programme Foundation Award in early 2023. Fortunately the Discovery Research Committee awarded the grant that summer and my lab was saved. The paper was also accepted at Cell alongside another paper from my lab. If we’d not preprinted and shared our data, I don’t think I’d be able to do “proper” cancer research now.

Chris is Group Leader and Associate Professor at UCL where he leads the Cell-Communication Lab at the UCL Cancer Institute. His lab studies how different cell types collaborate in cancer.

Our position on data sharing includes the expectation that all those involved in CRUK-funded research abide by a culture of sharing research data for re-use across sectors.

We realise, however, that applying this principle varies across the cancer research community. Therefore, we are keen, when we see exemplary behaviour, to highlight these outstanding efforts and the obvious benefits of making data available as widely and freely as possible. Sharing data not only maximises its potential to deliver improved outcomes for patients and the public, but to improve the wider research endeavour.

Chris’s and Philip’s story shows the power of FAIR data – that’s findable, accessible, interoperable, and reusable data – can bring to advancing cancer research. FAIR-er cancer research data is the vision of our Research Data Strategy that we published in summer 2022.

Data re-use is achievable when data is shared with detailed metadata, or information about the data that’s being shared, how it was collected, what formats it’s in, etc. This is a critical part of the data re-use life cycle, to enable other researchers to not only replicate the data or analyses generated, but to be able to use the data in other ways, to answer other research questions. It also means sharing the data as soon as possible and not necessarily waiting for publication to do so.

Chris’s lab sharing their pre-print and giving other researchers access to well-curated datasets and code before publication is what enabled Philip’s group to strengthen their manuscript already in review. In the end, data re-use led to high-quality publications from both groups and the pre-print also helped Chris secure funding for his lab.

CRUK is keen to recognise and reward data sharing best practice. Several of our funding committees now include a Data Champion, whose role is to amplify excellence in data sharing and data-enabled research when reviewing proposals.

We are also considering how to acknowledge cancer researchers with a track record of enabling data re-use and encourage all researchers to include ways to cite any datasets or code you produce. Continue to watch this space for more news on how to get involved with our cancer research data community.

Alexis Webb – Research Programme Manager for the Research Data Strategy

Cancer News

The data recipient

Dr Philip Dunne: “The reality from our side was that such a dataset would require methodologies and expertise we didn’t directly have, and to generate it would have come with a substantial cost both in terms of time and money.”

Insurmountable challenge?

All about the ethos

The data donor

Dr Chris Tape: “We shared the data largely because we’re charity funded academics and it’s the right thing to do. Transparency is central to the scientific process and sharing data builds trust with the community.”

A question of timing

When organoids meet computational research…

Explore our Research Data Strategy

Highlighted content

More like this

Research careers – bridging the gap between postdoc and group leader

Could you be a campaign ambassador?

Understanding CAFs to improve radiotherapy outcomes