From Pixels to Peptides: Predicting Secondary Structure with Bi-LSTMs and ResCNNs
Feb 20, 2018Series: protein-structure-prediction
From Pixels to Peptides: Predicting Secondary Structure with Bi-LSTMs and ResCNNs
Five months into my PhD, and I have finally crawled out of the biophysics textbooks and into the PyTorch code. After spending the winter getting my hands dirty with structural data, I’ve completed my first major milestone: designing and training a deep neural network to solve the Secondary Structure Prediction problem. I'm excited to share that my single-model architecture is achieving a solid 81% to 83% accuracy (Q3 score) on standard benchmarks.
Here is the story of how I built it, the architecture design, and how I set up the input encoding pipeline.
What is Secondary Structure, and Why Predict It?
Before we can solve the ultimate "Ab Initio" folding problem (predicting 3D atomic coordinates from a sequence), we need to solve a simpler, intermediate problem.
Proteins don’t just collapse into a messy ball all at once. They fold locally into highly stable, recurring geometric patterns called Secondary Structures (SS). The two dominant types, discovered by Linus Pauling in 1951, are:
- α-helices: Tight, right-handed coils stabilized by hydrogen bonds between the backbone C=O of residue i and the N-H of residue i+4.
- β-sheets: Extended strands that align side-by-side (either parallel or anti-parallel) to form sheet-like structures stabilized by lateral hydrogen bonds.
- Coils/Loops: Anything that isn't a helix or a sheet—these are the flexible hinges and loops that connect the structured blocks.
In bioinformatics, this is treated as a sequence-to-sequence classification problem.
Structure: C - C - H - H - H - E - E - C - C - H - H - C - C - C - C
If we can predict these local structures with high accuracy, we establish massive geometric constraints that make 3D folding exponentially easier.
Enriching the Input: The BLAST Encoding Pipeline
A naive approach would be to feed the raw amino acid sequence directly into a network. We have 20 standard amino acids, so we could represent each as a one-hot vector of length 20.
But machine learning is data-driven, and we can make our model's life much easier by enriching our input profiles using evolutionary information. If we find homologous (similar) proteins in nature, the parts of the protein that perform crucial structural roles will show distinct conservation and substitution patterns.
To capture this, my input pipeline does the following:
- BLAST Alignments: For a given query protein sequence, I run PSI-BLAST against a massive non-redundant database to retrieve related sequences and compile a Position-Specific Scoring Matrix (PSSM).
- Entropy-Based Encoding: For each column in the alignment, I calculate the Shannon entropy of the amino acid distribution. A low entropy indicates that evolution has strictly preserved a specific amino acid at this position (likely a core structural pivot), while high entropy indicates a flexible, highly mutable loop.
- Distance-Based Calculations: I enrich the PSSM matrix with distance-based substitution matrices (like BLOSUM62) to measure the biochemical "drift" or likelihood of specific mutations, giving the model a rich chemical coordinate space.
The resulting input features form a dense 2D matrix of shape (L, 22), where L is the length of the protein chain, representing both the local chemical identity and the multi-million-year evolutionary history of that position.
Architecture Design: The Bi-LSTM + ResCNN Interlace
To capture both the global sequential grammar and the local spatial features of proteins, I designed a single, unified neural network that interlaces two powerful deep learning paradigms:
To prevent vanishing gradients and allow the model to build deep hierarchies, I arranged these convolutions into Residual Blocks. A residual block uses skip connections:
y = F(x) + x
This allows local sequence shapes to pass directly through the network, preventing spatial patterns from getting washed out in deep layers.
Convolutions are excellent for local patterns, but proteins also have long-range dependencies. A β-sheet is formed when two strands that are hundreds of residues apart in the linear sequence fold back and pair up. To capture these long-range interactions, I interlace the convolutions with a Bidirectional LSTM (Bi-LSTM).
Results and Performance
By training this model on a non-redundant subset of the Protein Data Bank (using standard CullPDB splits), my single-model network achieves:
Hitting 83% on a single model without massive ensembles or pre-trained transformers feels like a massive win for my first PhD project. It shows that interlacing sequence-aware LSTMs with spatial-aware ResCNNs captures the dual biophysical nature of proteins: they are both linear chemical strings and 3D spatial objects.
In my next post, I want to step back and look at Co-Evolution and Direct Coupling Analysis (DCA)—the mathematical concepts that allow us to detect 3D contacts from sequence alignments alone, and write about some wild, unpublished ideas I’ve been sketching in my research notebook.
Next Log Entry: Evolution's Mathematical Whispers: Co-Evolution and the DCA Puzzle.
This is a post in the protein-structure-prediction series.
Other posts in this series:
- Dec 08, 2020 - The CASP14 Watershed: AlphaFold 2 and the Dawn of End-to-End Attention
- Feb 15, 2019 - AlphaFold 1: The Distogram Revolution at CASP13
- Nov 10, 2018 - CASP: The Olympic Arena of Double-Blind Structural Biology
- Jul 15, 2018 - Evolution's Mathematical Whispers: Co-Evolution and the DCA Puzzle
- Feb 20, 2018 - From Pixels to Peptides: Predicting Secondary Structure with Bi-LSTMs and ResCNNs
- Oct 12, 2017 - Anfinsen's Dogma and Levinthal's Paradox: The Biophysical Riddle of Protein Folding