Evolution's Mathematical Whispers: Co-Evolution and the DCA Puzzle

Jul 15, 2018

Evolution's Mathematical Whispers: Co-Evolution and the DCA Puzzle

Author: Ersi Ni | Date: Jul 15, 2018 | Log Entry: PhD Journal #3 | Focus: Evolutionary Bioinformatics & Statistical Models

It is mid-summer in the lab, and the heat outside is matched only by the computational intensity of our servers. Following my work on secondary structure, I’ve spent the last few months diving deep into the most magical data source in computational biology: Multiple Sequence Alignments (MSAs). Today, I want to write about how evolution acts as a natural laboratory, leaving mathematical footprints that allow us to predict 3D shapes from sequence text.

I also want to lay out two distinct, "failed" (or rather, unfinished) research ideas I sketched in my notebook late last year. I got stuck on both of them, but writing them out feels like a necessary step to see if anyone in the community has thoughts, or if future architectures will eventually crack them.

The Co-Evolution Magic

To understand how we predict 3D structures, we have to look at how proteins evolve.

Imagine two amino acids at positions i and j in a protein sequence. In the folded 3D shape, these two residues are physically touching—forming a salt bridge or a hydrophobic contact that stabilizes the entire shape.

Now, imagine a mutation occurs. Position i suddenly changes from a positively charged Lysine (K) to a negatively charged Glutamic Acid (E). If position j remains unchanged, its charge might repel the new mutation, destabilizing the entire protein and causing the organism to perish. To survive, the protein requires a compensatory mutation at position j to restore the balance.

Over millions of years, if residue i and residue j physically interact, their mutations will be statistically correlated. By aligning hundreds of related sequences from different organisms (an MSA), we can search for columns that mutate together.

Query: M - K - A - L - E - Y - R ...
Homolog 1: M - E - A - L - K - Y - R ...
Homolog 2: M - R - A - L - D - Y - R ...
Homolog 3: M - K - A - L - E - Y - R ...
Pos i Pos j

Whenever position i switches (K → E → R), position j switches in tandem (E → K → D). This is co-evolution.

The Mathematical Hurdle: Direct Coupling Analysis (DCA)

For a long time, researchers used a simple metric called Mutual Information (MI) to detect these correlations. But MI has a fatal flaw: transitivity.

Imagine residue A touches residue B, and residue B touches residue C. Residue A does not touch residue C. Because A and B are co-evolving, and B and C are co-evolving, A and C will appear highly correlated in our sequence alignments. If we use a simple correlation matrix, we will predict a false 3D contact between A and C.

This was the "indirect correlation" bottleneck that stalled computational structural biology for decades.

The breakthrough came around 2011 with the development of Direct Coupling Analysis (DCA) (and related Potts models). Instead of calculating local pairwise correlations, DCA models the entire sequence alignment as a global probability distribution using the Maximum Entropy principle.

By globally solving for the coupling parameters, DCA mathematically disentangles direct physical contacts from transitive, indirect correlations. DCA contact maps are beautiful, but solving this inverse statistical physics problem is computationally expensive, and noisy alignments can distort the results.

Uncharted Territory: My Two Unpublished 2017 Ideas

During my first semester, I spent late nights sketching out deep learning experiments to improve how we process evolutionary alignments. I never got them published, and they are currently sitting in my directory as half-broken scratch files.

Here is what I was trying to do, and why I got stuck:

Idea #1 — Autoencoding BLAST Profiles

Stuck / Shelved

The Concept: I wanted to train a deep Autoencoder to compress a protein's dynamic, high-dimensional alignment distribution (from PSI-BLAST) into a stable, fixed-size latent embedding space, removing noise and parameterized bias.

The Bottleneck: Database Drift & Input Space Boundaries.
Sequence databases are dynamic and constantly expanding. If I run BLAST today, I get a certain set of homolog alignments. Next month, I get a richer set. This introduces input ambiguity. If the definition of "the BLAST profile of protein X" shifts daily, the autoencoder struggles to learn stable latent projections. Without a way to formulate database-invariant inputs, I had to shelf the training loop.

Idea #2 — 2D Attention Grids for Variable-Depth MSAs

Stuck / Shelved

The Concept: Treat the Multiple Sequence Alignment (MSA) as a 2D grid. The horizontal axis is sequence length L, and the vertical axis is alignment homolog depth N (variable). I wanted to build a vertical column attention mechanism that dynamically attended to homolog variants, outputting a single dense representation per position to feed into a horizontal scanning LSTM.

The Bottleneck: Variable Tensor Dimension Gradients.
In early 2018, deep frameworks treated LSTMs as strictly 1D sequence operators. I got completely lost in tensor dimension gymnastics: trying to write a dynamic, variable-depth vertical attention block that could co-operate and backpropagate gradients cleanly through a horizontal recurrence scanner without exploding the GPU memory.

Fig 3 — Visualizing the 2D Attention Grid concept. Scanning horizontally across sequence columns while dynamically "attending" vertically to variable-depth homologous alignments.

Reflections

It is fascinating to look at these blocks. I still believe that processing the raw, two-dimensional MSA grid directly—using attention rather than flattening it into a pre-computed BLAST matrix—is the ultimate way forward for structural biology. The PSSM matrices we use lose so much joint probability information.

For now, I'm focusing my main PhD line on convolutional residual blocks. But these unresolved ideas continue to whisper in the back of my mind.

Next time, I'm going to write about the benchmark that governs our entire lives in this field: CASP. It’s the double-blind Olympics of structural biology, and the upcoming CASP13 competition this winter is already causing nervous chatter among the postdocs in our lab.

Thank you for following this PhD journal series (2017 - 2020). Compilation complete.

This is a post in the protein-structure-prediction series.
Other posts in this series:

Dec 08, 2020 - The CASP14 Watershed: AlphaFold 2 and the Dawn of End-to-End Attention
Feb 15, 2019 - AlphaFold 1: The Distogram Revolution at CASP13
Nov 10, 2018 - CASP: The Olympic Arena of Double-Blind Structural Biology
Jul 15, 2018 - Evolution's Mathematical Whispers: Co-Evolution and the DCA Puzzle
Feb 20, 2018 - From Pixels to Peptides: Predicting Secondary Structure with Bi-LSTMs and ResCNNs
Oct 12, 2017 - Anfinsen's Dogma and Levinthal's Paradox: The Biophysical Riddle of Protein Folding

The Knight who says Ni

Evolution's Mathematical Whispers: Co-Evolution and the DCA Puzzle

The Co-Evolution Magic

The Mathematical Hurdle: Direct Coupling Analysis (DCA)

Uncharted Territory: My Two Unpublished 2017 Ideas

Reflections