The CASP14 Watershed: AlphaFold 2 and the Dawn of End-to-End Attention

Series: protein-structure-prediction

The CASP14 Watershed: AlphaFold 2 and the Dawn of End-to-End Attention

Author: Ersi Ni  |  Date: Dec 08, 2020  |  Log Entry: PhD Thesis Conclusion  |  Focus: DeepMind AlphaFold 2 End-to-End Architecture

It has finally happened. The grand challenge of structural biology has been solved. Last week, at the virtual CASP14 conference, the independent assessors announced that DeepMind’s AlphaFold 2 achieved a median GDT_TS score of 92.4 overall, and a staggering 87.0 on the hardest Free Modeling targets. The single-chain protein structure prediction problem is solved.

As I sit here, preparing to write the final chapters of my PhD thesis which I started in the autumn of 2017, I feel an overwhelming sense of awe. I have spent the last four years sitting at the front-row seat of scientific history.

Today, in the final entry of this blog series, I want to break down the architectural masterclass that is AlphaFold 2. We will look at why it represents a complete paradigm shift from AlphaFold 1, diving into the Evoformer and Invariant Point Attention (IPA).

The End-to-End Differentiable Paradigm

To understand AlphaFold 2's breakthrough, we have to look at what they threw away.

AlphaFold 1 was a brilliant but fragmented hybrid. It used a neural network to predict 2D distance bins, and then used classical physics-based L-BFGS gradient descent to fold a physical chain to satisfy those distances. The gradients did not flow directly from the final 3D coordinates back to the network weights.

AlphaFold 2 replaced this entire process with a single end-to-end differentiable network. There is no potential function. There is no separate folding step. The input is a sequence, and the output is a set of physical $X, Y, Z$ atomic coordinates. Crucially, the structural loss is calculated directly on the 3D coordinates (using a metric called Frame Aligned Point Error, or FAPE) and backpropagated all the way through the network back to the sequence embeddings.

Every single weight in the network is tuned specifically to place atoms in 3D space.

92.4
Median GDT_TS Score
87.0
FM (Hardest Target) GDT
90.0+
Experimental Resolution

1. The Evoformer: Co-Updating Sequence and Geometry

The heart of AlphaFold 2 is a novel transformer block called the Evoformer.

Instead of processing sequences and alignments into a static profile (like PSSMs), the Evoformer maintains two dynamic representation spaces that are updated simultaneously:

  • The MSA Representation (1D): Captures the evolutionary rows and columns of aligned homologous sequences.
  • The Pair Representation (2D): Captures the spatial geometric relationships (distances, contacts, angles) between all residue pairs.

By letting information flow bi-directionally between sequence space and spatial pair space over 48 Evoformer blocks, the model simultaneously "reasons" about evolutionary genetics and physical 3D geometry.

AlphaFold 2 End-to-End Coordinate Predictor INPUTS 1D Sequence 2D MSA Grid Templates 48-Block Evoformer Row/Col Attention on MSA Outer Product updates Pairs Triangle Attention on Geometry Structure Module Gas of 3D Rigid Bodies Invariant Point Attn (IPA) 3D coordinates output Fully Differentiable Coordinate Assembly Pipeline
Fig 6 — AlphaFold 2's direct end-to-end structural architecture. The co-updated genetic and geometric representations guide invariant rigid-body assembly directly in 3D coordinate space.

2. The Structure Module: Invariant Point Attention (IPA)

Once the Evoformer compiles this rich geometric representation, how do we get actual 3D coordinates?

AlphaFold 2 does not use distance matrices or physical force fields. Instead, they built a Structure Module that treats the protein as a "gas of 3D rigid bodies."

  1. Every amino acid residue is initialized as a rigid triangular frame defined by its backbone atoms (N, Cα, C), sitting at the origin (0, 0, 0).
  2. The network uses Invariant Point Attention (IPA)—a custom attention mechanism that is strictly invariant to 3D rotations and translations (respecting the physical SE(3) equivariance of space).
  3. Over 8 iterative blocks, the Structure Module predicts a translation vector t and a rotation matrix R for each residue frame, literally translating and rotating the "gas of residues" in 3D space until they assemble into a continuous, perfectly folded polypeptide chain.

Concluding My PhD Journey (2017 - 2020)

When I started my PhD in 2017, the field was plateaued. We wrestled with secondary structure accuracies of 80% and contact predictions that could barely resolve simple folds. We had unresolved, complex ideas about attention grids and autoencoding BLAST alignments that we couldn't implement because the deep learning primitives simply didn't exist.

Three years later, the landscape is unrecognizable. AlphaFold 2 solved the folding problem by realizing that proteins are not just sequences or images—they are evolutionary arrays that can be modeled using multi-dimensional attention, and physical geometries that can be directly assembled end-to-end.

PhD Thesis Defended (December 2020)

The folding problem is solved, but a new era is just beginning. We now have the tools to design custom proteins from scratch, map out entire viral proteomes in minutes, and simulate cellular mechanics with atomic precision. The code compiles, the thesis is ready, and the future is wide open.

Thank you for following this PhD journal series (2017 - 2020). Compilation complete.


This is a post in the protein-structure-prediction series.
Other posts in this series:

w