Data science - are we chatting away scientific integrity? - Cancer Research UK

LLMs are impressive – revolutionary even – yet they also pose a threat to the very bedrock of how we do science. How can we make sense of it? Here our data science columnist Bissan Al-Lazikani talks hallucinations, responsibilities and safeguarding science…

This entry is part 1 of 3 in the series Data science

It is amusing to observe the love-hate relationship that we, as scientists, appear to be developing with new generation artificial intelligence and especially with Large Language Models (LLMs) such as ChatGPT, Gemini and Claude.

LLMs are real-world manifestations of our geekiest childhood Sci Fi fantasies. Remarkably intuitive and easy to use, they appear to ‘understand’ our less-than-perfectly-formed questions. They can summarise and synthesize information for us, making the practice of dredging through hundreds of pages of PubMed or Google results seems unbearable. They allow us to explore any topic and can respond to us no matter our line of thought.

Let’s face it, they are a huge time and effort saver. Disillusionment rapidly sets in however when we experience their faults. From fake citations masquerading as real references to erroneous information presented with confidence. How many of us have recoiled when reviewing a paper harbouring the tell-tale junk of AI-writing; or receiving LLM-generated reviewer comments?

We are being swamped by AI-generated content that is endangering the very foundations of objectivity and empirically derived facts.

We are being swamped by AI-generated content that is endangering the very foundations of objectivity and empirically derived facts. For many of us, our Sci Fi dreams have morphed into nightmares. We become, rightly, suspicious of everything and increasingly discouraged.

Yet not all scientists appear to have the same negative reaction – as evidenced by the growing AI generated ‘information’ being spread in scientific circles.

Use the tools correctly…

Who is to blame? In my opinion, the blame lies squarely on our shoulders. The misuse of a tool is rarely the fault of the instrument itself but rather that of the wielder. Like with any tool, we must understand how they work, what their appropriate use is, and we must use them responsibly.

Traditional AI/machine learning algorithms – ubiquitous in science – are heavily constrained to specific inputs and outputs and fully supervised by humans. Foundation models such as LLMs, on the other hand, are given vast, unstructured, uncurated data.

Importantly, LLMs are not trained to be fact-retrieval algorithms. They are generative, probabilistic predictors of the next items in a sequence. They will generate (not retrieve) the most reasonable follow-up in response to prompts. Given they were trained on books, articles and general language constructs generated by humans, this follow-up will frequently be factually correct. But is not necessarily so, and not by design.

The concept of AI ‘hallucination’ is a human construct that we use to describe our surprise when we misuse AIs and get an answer we did not expect.

Humans behave much the same way. Complete the following sentence: “We finish each other’s …”. Probabilistically, the most likely word is ‘sentences’ unless you have a 6-year-old in your life in which case it will more likely be ‘sandwiches’. But perhaps in your specific context you meant ‘stories’.

Like human responses, we can guide LLMs to predict a more appropriate output by giving them context and constraints. In this setting, a fabricated but plausible citation provided by an LLM makes perfect sense. The concept of AI ‘hallucination’ is a human construct that we use to describe our surprise when we misuse AIs and get an answer we did not expect. In the example above, ‘sandwiches’ is not a hallucination, neither would ‘cupcakes’ be.

Responsible use will yield great results

LLMs and other foundation models provide powerful tools in our scientific armoury. We can use them to explore patterns in data that would be hidden in noise. We can use them to generate hypotheses that we can then test with appropriate controls.

Indeed, even in our everyday interactions with them, we must provide such controls. We must always question both ourselves, and the responses we are getting; can we verify information using an alternative method? Can we ask the LLM to provide a link to a primary source and verify that primary source ourselves?

If we misuse LLMs and report their predictions as facts, we bear responsibility. If we succumb to our own complacency and rely on AI to replace our experience, scientific method, and critical evaluation, we cannot blame the tools.

Importantly, we must safeguard our science against the erosion of these AIs themselves through AI-generated data feeding future foundation models. As described above, LLMs learn patterns and derive new connections based on the vast training data they were given. When these data are primarily human generated, the patterns AIs learn are real. As more AI generated data swamp the scientific literature (and the wider world) these models risk erosion and loss of their very value. It is therefore imperative that we safeguard primary scientific outputs and databases reporting empirically derived scientific data.

If we misuse LLMs and report their predictions as facts, we bear responsibility. If we succumb to our own complacency and rely on AI to replace our experience, scientific method, and critical evaluation, we cannot blame the tools.

As scientists, we must develop a rational relationship with these powerful technologies. They are here to stay. If we maintain our role as responsible human scientists and do not surrender our own objectivity and critical thinking, these models will propel our discoveries and innovation beyond our wildest imaginations.

Author

Professor Bissan Al-Lazikani

Bissan is the Director of Therapeutics Data Science and Professor in the Department of Genomic Medicine, at the University of Texas MD Anderson Cancer Center.

Data-driven Cancer Research Conference

Want to hear more about how data is shaping cancer research? Join our three-day conference in Edinburgh from 24-26 February 2026.

We’ll be exploring the future of data-enabled cancer research, highlighting the transformational role of emerging tools and technologies in advancing our understanding of cancer.

Cancer News

Data science – are we chatting away scientific integrity?

Use the tools correctly…

Responsible use will yield great results

Professor Bissan Al-Lazikani

Data-driven Cancer Research Conference

Highlighted content

More like this

The gatekeeper and the invader – H.pylori and the development of stomach cancer

Research is Beautiful – the best of the rest

Let there be light - a new route to precision diagnosis