Exact algorithms for the prediction of optimal structures in lattice protein models

Several different approaches have been proposed to model protein structure prediction, which is the problem of predicting the three-dimensional native structure of a protein given its primary structure. Among these, one approach of particular interest to us aims at investigating the prediction of optimal structures using the
simplified lattice model. In this model, each amino acid is represented as a point in an integer lattice and the prediction problem is reduced to that of finding the self-avoiding walk(s) of minimum energy on the lattice. The energy model used here is the HP model, due to Lau and Dill [1], where one considers the hydrophobic effect as the main interaction in the folding process. In this model, each amino acid is encoded as a bead of two different types: Hydrophobic (H) and Polar (or
hydrophilic) (P). Two beads of type H that are topologically first-neighbours and are not connected in the sequence contribute negatively to the energy. All the other combinations contribute zero energy. The protein structure prediction problem on 2D and 3D lattices is NP complete [2,3].

We have developed an innovative algorithm that outperforms the state of the art in exact approaches for 2-D square lattices.

E Giaquinta, L Pozzi: An effective exact algorithm and a new upper bound for the number of contacts in the hydrophobic-polar 2D-lattice model. Accepted for Publication. Journal of Computational Biology. Mary Ann Liebert, Inc. Publ.

The algorithm, called minwalk, is based on a pruned exhaustive search. The devised pruning criteria guarantee optimality, while enabling to skip an exponential number of calls when compared to exhaustive search. The proposed algorithm can find optimal structures for an input size of up to around 40-50. Only exact approaches dealing with size up to 25 have been previously presented in the literature for the 2-D square lattice [4]. A comprehensive survey of recent results can be found here [7].

Stemming from this result, we envision several directions of research that we intend to investigate. These directions exploit the characteristics of the proposed algorithm and are meant to improve and use its capabilities. They include: adding several missing optimisations that have the potential to increase the algorithm efficiency further; extending the algorithm to work on the 3-D cubic lattice and comparing it to the state of the art [5]; studying degeneracy in 2-D, beyond the limit of size 25 that has been treated so far; using the algorithm as a tool to explore the shape of energy landscapes; exploiting the high parallelism potential of the algorithm
so as to write a parallel version that can take advantage of multi-core platforms.

The algorithm explores the search space of self-avoiding walks for a given HP sequence, and its efficiency lies in avoiding exploration of large parts of such space that are guaranted (by deviced optimal pruing criteria) not to contain an optimal solution.

run times 1

Above, the runtime to find the optimal structure(s) given an input sequence of a certain size, is shown.
The red bars correspond to an exhaustive search, without pruning criteria (the complete search space is explored). This method does not scale and it was only run for sequences up to size 19, in this experiment. The blue bars correspond instead to a pruned, still optimal, search. Some simple pruning criteria are implemented, to show the power of pruning. Only a small fraction of the time is needed to find the same optimal solution, and sequences of around 25 beads can be processed optimally in a matter of seconds.

run times 2

As the project goes on, more and more pruning criteria are devised, and performance increases exponentially.
At the moment, sequences of length 40-50 can be processed optimally in a matter of seconds/minutes.

results

The tables above show some results for random sequences of length 45: for each sequence, we write the time needed by minwalk to find the optimal solution(s), the energy of the optimal solutions (number of H-H contacts) and the degeneracy (number of different solutions with optimal energy).

optimal structures for R45-1

This figure depicts Sequence R45-1 from Table 3 above, and all its 8 optimal structures of 24 contacts.

Sequences and structures are depicted thus in the picture: red and blue squares represent H beads.
P beads are simply depicted as a line without a square.
The red and blue colours are used to distinguish (and quickly visualise) H beads of even and odd index (in the 2D square and 3D cubic lattice, due to the parity issue, beads of odd index can only create contact with beads of even index and viceversa).

Several non-optimal approaches have been engineered, in order to deal with very long HP sequences. In particular, a recent contribution [6] presents a non-optimal, well performing heuristics approach, and compares with several previous ones. minwalk can in some cases detect all optimal solutions, for a given sequence, faster than all published non-optimal approaches could.
In particular, for the following benchmark sequence denoted S1-8 in [6] ( H^12 (PH)^2 (P^2 H^2)^2 P^2 H P^2 (H^2 P^2)^2 (HP)^2 H^12 ), of length 64, the best known result is 42 contacts, and state of the art heuristics find such structure in running times of the order of hours to days.

minwalk finds all 39 optimal structures of 42 contacts in a matter of seconds (six of them are shown in the figure below).

S48

An exact algorithm such as minwalk, that is able to return, given an input sequence, all optimal structures for that sequence—and potentially also all structures for energy levels close to optimal—can be an interesting tool for providing insights into the shape of energy landscapes [8].

Exhaustive exploration of the search space of self-avoiding-walks, without pruning, when in search of an optimal conﬁguration, corresponds to searching a needle in a haystack, as Levinthal put it in his argument [9]. In such scenario, all leaves, which correspond to all possible conﬁgurations, might be visited before the optimal conﬁguration is declared so.

Algorithm pruning, instead, can be seen as a parallel to the reason why a protein does ﬁnd its optimal conformation in a reasonable time. Pruning eliminates an exponential number of weak tree-leaves (low-contacts conﬁgurations), and makes the search more to-the-point. In other words, it cuts away a section of the haystack, where the needle certainly is not.
A graphical view of pruning at work is shown below.

pruning

It depicts all recursive calls that minwalk has performed, for a sequence of length 48 that has several optimal conﬁgurations (several optimal leaves).
Each horizontal line in the ﬁgure corresponds to a search-tree level, and each level expands into diﬀerent choices (branches of the search tree). At some speciﬁc points, pruning occurs, and all calls from a speciﬁc subtree are dropped. The several leaves explored (black lines that reach all the way to the bottom of the
tree) correspond to the optimal conﬁgurations. No other leave is explored, and, in most parts of the tree, pruning occurs very early.

If we wrap the pruned tree around, we obtain a funnel, that has some resemblance to an energy landscape. In this ﬁgure, the vertical axis represents the level of the search tree (the level of each partial conﬁguration explored), while in an energy landscape the vertical axis represents energy instead. On the other hand, the pruned partial conﬁgurations (lines that stop high, and do not reach the bottom of the tree) are those that lead to few-contacts (high energy) structures. The pruned, high points represent, therefore, high energy conﬁgurations—the same way as in energy funnels.

However, the high energy (non optimal) conﬁgurations are not enumerated (not searched, not visited) in the tree --- visiting them would correspond to the needle in the haystack again. They are all ’hidden’ behind the process of pruning.

E Giaquinta, L Pozzi: An effective exact algorithm and a new upper bound for the number of contacts in the hydrophobic-polar 2D-lattice model. Accepted for Publication. Journal of Computational Biology. Mary Ann Liebert, Inc. Publ.

[1] Kit Fun Lau and Ken A. Dill. A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules, 22(10):3986–3997, 1989.
[2] Pierluigi Crescenzi, Deborah Goldman, Christos H. Papadimitriou, Antonio Piccolboni, and Mihalis Yannakakis. On the complexity of protein folding. Journal of Computational Biology, 5(3):423–466, 1998.
[3] Bonnie Berger and Frank Thomson Leighton. Protein folding in the hydrophobic-hydrophilic (HP) model is NP- complete. Journal of Computational Biology, 5(1):27–40, 1998.
[4] Anders Irback and Carl Troein. Enumerating designing sequences in the HP model. Journal of Biological Physics, 28(1):1–15, January 2002.
[5] Rolf Backofen and Sebastian Will. A constraint-based approach to fast and exact structure prediction in three- dimensional protein models. Journal of Constraints, 11(1):5–30, January 2006.
[6] Alena Shmygelska and Holger Hoos. An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein folding problem. BMC Bioinformatics, 6(1):30, 2005.
[7] Sorin Istrail and Fumei Lam, Combinatorial Algorithms for Protein Folding in Lattice Models: A Survey of Mathematical Results. 2009.
[8] Ken A. Dill and Hue Sun Chan. From Levinthal to pathways to funnels. Nature Structural Biology, 4(1):10–19, 1997.
[9] Cyrus Levinthal. Are there pathways for protein folding? Journal de Chimie Physique et de Physico-Chimie Biologique, 65:44–45, 1968