Not knowing your expertise, I’ll risk making a fool of myself.
But first, for everyone’s sake: ALPHAFOLD ISNT AN LLM
That isn’t what alphafold is. Under no circumstances is it anywhere near close to a “database of every way … proteins can fold.” At all. And to construe it as anything other than an attempt at a way to predict or simulate the potential three dimensional structure a protein can adopt will mislead people.
A protein does not have one shape. It has multiple. They’re dynamic. They change shape. It’s how proteins represent information. I’d argue that the ability of proteins to change their shape is one of their most important properties. (Analogy: play a song on an instrument with one note. Hard-mode: consider silence a note)
Everything in Alphafold is a GUESSED MODEL and not reality. Crystal structures and cryo-EM structures are also MODELS. But they’re based on empirical evidence.
Alphafold is based on statistical evidence. It is evidence, but it is weaker. If we don’t have an example of how a protein might fold in the structural database, alphafold will struggle to predict the structure. At least not without it sharing some type of sequence similarity.
I see this in the AI drug companies and how they just treat predictive models the same as 2 angstrom crystal structures and it pisses me off.











It’s honestly a super subtle difference and only structure heads like me care about it. It matters on the edges. My advisor gave sage advice that I think more people should take to heart. To paraphrase: every experiment has limitations and assumptions baked in assumptions. That doesn’t mean their results are invalid/irrelevant, but you need to know what they are so you know when they are violated or don’t apply.
It’s a spin on the "all models are wrong, but not all models are useful
/pedagogical soapbox
It’s a great use of AI / machine learning tech. Incredible. Turns out biology reuses structure a LOT. Structure is function in biology, and there are a LOT of shared, essential functions in Biology. Their models are actually incredibly accurate at predicting the individual atom placement for side chains (the bit of an amino acid that makes it unique from other amino acids). Side chains do chemistry for proteins, so this is highly salient for research broadly. It’s just far from being a “solved problem” like they would have you believe.
The main thing to keep in mind, is that alphafold is not predicting structures based on first principles (that is to say, based on the underlying physics and laws of nature). It uses sequence similarity between proteins with solved structures to make probable guesses as to the structure and how it folds. Solely based off current experimental data-driven structure models.
This works surprisingly well even for proteins that don’t have much actually in common with the amino acid sequence of the protein. But because structure is function, we can still trace and track the divergences in sequence over time while still being confident the overall shape is the same.
But for things that there are not enough sequence diverse examples of, or for things that there are no examples, alphafold regularly just spits them out like a literal ball of spaghetti - because the assumptions it relies on are invalid. There’s not enough statistical evidence - for those examples.
I don’t have a handy reference offhand, it’s been a few years, but there is this image from their blog posts ( ref ) that shows whe I’m talking about. My understanding is that AlphaFold is an incredibly accurate and effective homology modeling strategy (how can we model structures we don’t have data for based off similarity to structures we do have).
Disclaimer: I am not an expert in structural prediction, homology models, machine learning, structural determination via crystallography or cryo-EM. I’m more experienced in consuming them for understanding the structure:function relationship.