Data science: has AI solved drug discovery?

Spoiler-alert… no. At least not yet. But it is revolutionising the field in incredible ways. Bissan Al-Lazikani gets into where we are, and what we need to do next to deliver on AI’s promise for drug discovery…

This entry is part 3 of 3 in the series Data science

The past decade – and especially the last two years – have witnessed the recognition of the role of artificial intelligence and machine learning in biomedical science, and drug discovery in particular.

Yet, based on some headlines, a casual onlooker would be forgiven for thinking that drug discovery scientists are now simply sitting around, awaiting a panacea to be spewed out by their AI-models. This is, of course, nonsense.

Data science and machine learning require two main elements: large, sufficiently organized and annotated data sets to ‘learn from’, and a training framework (designed by a human) to allow the computer to learn from data and make predictions. Drug discovery – including molecular design – is one of the fields with the best foundation of data and machine learning which is why it’s poised to benefit from AI. Yet we are approaching a major plateau in the benefits we’ll see for drug discovery – one that cannot be overcome by smarter AI alone.

We are approaching a major plateau in the benefits we’ll see for drug discovery – one that cannot be overcome by smarter AI alone.

AI and the protein folding problem

A protein’s 3D structure provides immeasurable insight into normal function, pathogenicity, and – for drug discovery scientists – rational drug design. However, structural biology is complex, costly and technically challenging, which means that the ability to predict protein structure is essential.

In 1968, American molecular biologist Cyrus Levinthal stipulated that the 3D protein structure and folding pathway must be encoded within its amino acid sequence. Randomly sampling all conformations to reach a physiologically stable 3D fold would – remarkably – take longer than the age of the universe. Yet, proteins fold in milliseconds. With this realisation, the protein folding problem and the Levinthal Paradox emerged. Identifying the rules governing protein folding will have profound implications on biology and drug discovery.

And its clear data science and AI are having a huge impact here. The 2024 Nobel Prize for chemistry was awarded to the Alpha Fold team (along with David Baker and his team for developing Rosetta, a protein structure prediction computational method) for their computational prediction of countless hitherto unsolved protein structures. The excitement in the field is still felt two years later.

So, have these teams solved the protein folding problem? Sadly for all of us, no they have not.

To my knowledge, nor have these innovators made such a claim. Excited media coverage maybe responsible for the gross misunderstanding.

The great achievement is not actually that of being able to accurately predict every part of every protein. Far from it. Take c-Myc for example – perhaps one of the most well-known oncogenes, even referred to as the grand orchestrator of cancer – the AlphaFold 3 server is unable to predict its structure. The achievement of AlphaFold – to date at least – is that we can uncover structures that are similar to those we already know. Our technologies prior to AlphaFold were incapable of detecting such similarities.

While AlphaFold and similar algorithms are not, as yet, a magic bullet they nonetheless they can be profoundly useful. For example, my own lab’s analysis shows that using the publicly available AlphaFold 2 models has doubled the number of druggable proteins available to the drug discovery community.

The unparalleled foundation of deep, well annotated data, coupled to decades of computational-led understanding of patterns in these data, have positioned our field to be a major beneficiary of the AI revolution.

However, something absolutely key when understanding how we can successfully use AI, is that its utility is a direct result of findings from ‘real’ experiments. The reason AlphaFold could be developed was because in 1971 a – in a moment of real vision – the Brookhaven Data Bank (a protein data bank) was founded to standardise and catalogue all future protein 3D structures. At the time, it contained only seven structures, now it boasts almost quarter of a million structures, representing over 750,000 distinct protein snapshots. This regulated repository positioned generations of computational scientists to systematically analyse these data for patterns – and is exactly the kind of repository upon which AI can be trained.

What’s needed next?

The reason AI methods cannot predict the structure of proteins like c-Myc is because we lack the experimental data that holds the key information needed by the AIs to learn.

We estimate that, in total, drug discovery research has only tested compounds against one quarter of the human proteome. AI algorithms, therefore, have limited chance of identifying completely novel areas of chemistry required to address some of the remaining three quarters.

So, we need more experimental data. Evidently, the unparalleled foundation of deep, well annotated data, coupled to decades of computational-led understanding of patterns in these data, have positioned our field to be a major beneficiary of the AI revolution. Moving forward we must invest in key data generation. AI ‘creates’ by interpolating from existing data and any extrapolation remains within confined boundaries. AI cannot leap into the complete unknown without data ‘stepping stones’. Without this data, it’d be like expecting generative AI to create accurate images of the life forms roaming the exoplanet Kepler-62e based solely on photographs from Earth.

We must also define boundaries of capability and applicability for each algorithm we develop. Overhyping will erode trust. The scientific method, not dogma and theatrics, must dictate our investment decisions and our use of AI moving forward.

Author

Professor Bissan Al-Lazikani

Bissan is the Director of Therapeutics Data Science and Professor in the Department of Genomic Medicine, at the University of Texas MD Anderson Cancer Center.

Cancer News

AI and the protein folding problem

What’s needed next?

Professor Bissan Al-Lazikani

Tell us what you think

Leave a Reply Cancel reply

Highlighted content

More like this

The gatekeeper and the invader – H.pylori and the development of stomach cancer

Research is Beautiful – the best of the rest

Let there be light - a new route to precision diagnosis

Tell us what you think

Leave a Reply Cancel reply