A gentle overview of the pipeline in Protein Structure Prediction

Disclaimer: the author has a background in Computer Science, the physics, chemistry and biology anecdotes in this article are obtained during research on the topic on the on-demand basis, without a formal training in the respective subjects. Also, this article aims to describe the pipeline to help the readers to form initial intuition about the problem and approach. Details of the machine learning part, is not within my intended scope of this article.

ELI5 Protein Structure Prediction problem

Proteins are basic building blocks of life. They themselves are messy ball-like objects much similar to carelessly rolled up shoe-laces. A single shoe-lace is a string of amino acids. Amino Acids are even smaller stuff that are very important to life. Interestingly, groups of amino acids forms different shape under different conditions. They are like living lego blocks, some combination of these blocks tend to stick together, like magnets; and almost all of them act like this depending on whether they can relax themselves in water. We know that proteins can do a lot of different work, and we can tell what type of work a protein can do by looking at its shape. The shape of protein is like an outfit for job, to us humans: we can tell a person is a police officer because that person is wearing a police outfit. However, they are many many proteins, so the biology and chemistry experts documented them using a technique called “sequencing”: putting down each amino acid one by one: Imagine straightening a rolled shoe-lace magically without having to worry about knots. This technique is so fast, it allows us to document massive amount of proteins, but at the same time we lose the shape of the protein. As a result, only a small part of the known proteins have their shape mapped out. Finding out the rolled up shape base only on the straightened shoe-lace is what we called “Protein structure prediction problem”.

How hard can it be? As it turns out, quite complicated :‘(


Above is an illustration of broken down steps of how we predict a folded protein: as you can see, getting the final structure or even the tertiary structure right involves a lot of hidden knowledge from physics, chemistry and biology.

Hold your breath: Terminology and Structural Physics of the Protein

Firstly, we will use abbreviation AA for amino acids. The magnet that binds AA is called peptide bonds. We sometimes also say conformation (of AA sub-sequence or the whole chain), instead of the shape.

The bonds bind a small sequence of AA, different ordering and different type of AA forms different local shapes. We call these local shapes Secondary Structures(SS), they mostly fall into 2 groups: $\alpha$-helix and $\beta$-sheet. Also, SS generally have a certain range of length for their respective type. $\alpha$-helix is usually 3 to 5 AA long, while $\beta$-sheet averagely could stretch to 7 AA long.

The conformation of “strings” of SS (Tertiary Structure) is mainly determined by compounded bond force of all the AA in the chain. On the even smaller scale, AA’s carbon atom ($C_{\alpha}$) serves as pivot point, on which the chain rotates: The positive and negative polar charge of each AA interact (attracts and propels) and define a “stable” conformation, depending on a series of rotating dynamics. To make matters even more “interesting”, given the same sub-sequence, some known conformation of the said sequence don’t end up in the same conformation in other part of the chain, because that part is not interacting with water.

To recap, the conformation of protein is important because the structure reveals the function of protein. Many people equate protein to machines, and this analogy is quite fitting: on the molecular level, a factory of proteins breaks up and assemble genes as if they are workers of different parts on the manufacturing line.

Without loss of generality, a rigorous physical rule-set describes the mechanics of proteins, thus in principle life itself can be understood, should we be able to gain full knowledge of the dynamics and their functional mappings.

Methodology: a tale of two cites

Evidently due to the complexity of the problem, traditional approaches of structural prediction involve the comparison against known structures. The assumption is that the evolution has done enough experiments for us to say, if a stable and working structure happened before (“happened” as in proven, not extinct), it should happen again elsewhere. As a matter of fact, the comparison method is effective. The main problem, however, is that we don’t have nearly enough of known templates to solve the rest of protein sequences.

For sequences that have nearly zero matching templates, we have to seek solutions that assume almost nothing, that derive from first principles. This is what we (Computer Science, Computational Biology) call the ab initio or de novo methods.

de novo or ab initio is latin synonym for “from scratch”. Because we at least know about the physical bond properties and constraints, we should be able to compute based on energy (how likely, or least amount of energy, for a conformation to take hold).

From the early hand-crafted rule based simulations to the more recent machine learning approaches, ab initio methods continue to gain traction. For instance, SS prediction using bi-directional recurrent neural networks can achieve an accuracy of well beyond 80%, and each iteration seemed not have slowed down, continuing the approach to the theoretical limit of 90% accuracy.

So how does a pipeline for ab initio Protein Structure Prediction task looks like?

Pipeline for ab inito or de novo Protein Structure Prediction

Machine learning is data-driven, more meaningful data is always better than less. Instead of taking sequences alone as input for training in a machine learning approach, we enrich them using a series of pre-processing steps. Although we don’t have templates to look-up from, we still can gain insight from sequence similarities. In Computational Biology we can gain richer protein profile by aligning sequences based on similarities. Protein sequences are coded sequences of AA. We have 20 known AA, thus we codify them using alphabet A-T. Similarity includes, but not limited to, alphabetical similarity of the sequence, sub-sequence AA substitution tables and similarity produced by probabilistic models. Commonly the main alignment and search methods are PSI-BLAST and HHblits. The resulting richer data-set is then further encoded and primed for training.

Mirroring the illustration in the first section, we design a multi-stage machine learning model: Predicting SS and other properties first, using predicted SS feature and original profile to predict distance/contact maps of the tertiary features, eventually combining all previous information to predict final folded structure.

For those readers who reach here and find this text helpful, I thank you for your attention. And I hope you to talk about the details of my approach in a later article soon.