From Pixels to Peptides: Predicting Secondary Structure with Bi-LSTMs and ResCNNs

Series: protein-structure-prediction

From Pixels to Peptides: Predicting Secondary Structure with Bi-LSTMs and ResCNNs

Author: Ersi Ni  |  Date: Feb 20, 2018  |  Log Entry: PhD Journal #2  |  Focus: Deep Learning Sequence-to-Sequence Modeling

Five months into my PhD, and I have finally crawled out of the biophysics textbooks and into the PyTorch code. After spending the winter getting my hands dirty with structural data, I’ve completed my first major milestone: designing and training a deep neural network to solve the Secondary Structure Prediction problem. I'm excited to share that my single-model architecture is achieving a solid 81% to 83% accuracy (Q3 score) on standard benchmarks.

Here is the story of how I built it, the architecture design, and how I set up the input encoding pipeline.

What is Secondary Structure, and Why Predict It?

Before we can solve the ultimate "Ab Initio" folding problem (predicting 3D atomic coordinates from a sequence), we need to solve a simpler, intermediate problem.

Proteins don’t just collapse into a messy ball all at once. They fold locally into highly stable, recurring geometric patterns called Secondary Structures (SS). The two dominant types, discovered by Linus Pauling in 1951, are:

α-Helix (H) β-Sheet (E) Coil/Loop (C)
  • α-helices: Tight, right-handed coils stabilized by hydrogen bonds between the backbone C=O of residue i and the N-H of residue i+4.
  • β-sheets: Extended strands that align side-by-side (either parallel or anti-parallel) to form sheet-like structures stabilized by lateral hydrogen bonds.
  • Coils/Loops: Anything that isn't a helix or a sheet—these are the flexible hinges and loops that connect the structured blocks.

In bioinformatics, this is treated as a sequence-to-sequence classification problem.

Sequence: M - K - V - L - L - Y - A - G - I - F - S - Q - L - L - D
Structure: C - C - H - H - H - E - E - C - C - H - H - C - C - C - C

If we can predict these local structures with high accuracy, we establish massive geometric constraints that make 3D folding exponentially easier.

Enriching the Input: The BLAST Encoding Pipeline

A naive approach would be to feed the raw amino acid sequence directly into a network. We have 20 standard amino acids, so we could represent each as a one-hot vector of length 20.

But machine learning is data-driven, and we can make our model's life much easier by enriching our input profiles using evolutionary information. If we find homologous (similar) proteins in nature, the parts of the protein that perform crucial structural roles will show distinct conservation and substitution patterns.

To capture this, my input pipeline does the following:

  1. BLAST Alignments: For a given query protein sequence, I run PSI-BLAST against a massive non-redundant database to retrieve related sequences and compile a Position-Specific Scoring Matrix (PSSM).
  2. Entropy-Based Encoding: For each column in the alignment, I calculate the Shannon entropy of the amino acid distribution. A low entropy indicates that evolution has strictly preserved a specific amino acid at this position (likely a core structural pivot), while high entropy indicates a flexible, highly mutable loop.
  3. Distance-Based Calculations: I enrich the PSSM matrix with distance-based substitution matrices (like BLOSUM62) to measure the biochemical "drift" or likelihood of specific mutations, giving the model a rich chemical coordinate space.

The resulting input features form a dense 2D matrix of shape (L, 22), where L is the length of the protein chain, representing both the local chemical identity and the multi-million-year evolutionary history of that position.

Architecture Design: The Bi-LSTM + ResCNN Interlace

To capture both the global sequential grammar and the local spatial features of proteins, I designed a single, unified neural network that interlaces two powerful deep learning paradigms:

Input Profile (L × 22 PSSM) 1D Convolutions Residual Blocks Local spatial windows Bidirectional LSTM Forward (Left to Right) Backward (Right to Left) Long-range context Q3 Prediction H / E / C (Accuracy: 81-83%)
Fig 2 — The Bi-LSTM + ResCNN Architecture. Convolutions process local secondary structure segments, while the Bidirectional LSTM registers global chemical grammar and long-range alignment contacts.

To prevent vanishing gradients and allow the model to build deep hierarchies, I arranged these convolutions into Residual Blocks. A residual block uses skip connections:

y = F(x) + x

This allows local sequence shapes to pass directly through the network, preventing spatial patterns from getting washed out in deep layers.

Convolutions are excellent for local patterns, but proteins also have long-range dependencies. A β-sheet is formed when two strands that are hundreds of residues apart in the linear sequence fold back and pair up. To capture these long-range interactions, I interlace the convolutions with a Bidirectional LSTM (Bi-LSTM).

Results and Performance

By training this model on a non-redundant subset of the Protein Data Bank (using standard CullPDB splits), my single-model network achieves:

83.2%
Max Q3 Accuracy
85.3%
α-Helix (H) Precision
78.2%
β-Sheet (E) Precision

Hitting 83% on a single model without massive ensembles or pre-trained transformers feels like a massive win for my first PhD project. It shows that interlacing sequence-aware LSTMs with spatial-aware ResCNNs captures the dual biophysical nature of proteins: they are both linear chemical strings and 3D spatial objects.

In my next post, I want to step back and look at Co-Evolution and Direct Coupling Analysis (DCA)—the mathematical concepts that allow us to detect 3D contacts from sequence alignments alone, and write about some wild, unpublished ideas I’ve been sketching in my research notebook.

Next Log Entry: Evolution's Mathematical Whispers: Co-Evolution and the DCA Puzzle.


This is a post in the protein-structure-prediction series.
Other posts in this series:

w