DALEL Exhaustive search for linear information in proteins

 

Introduction

Linear information encoded in proteins is the primary component of their structure, and the modus operandi of cell signaling and regulation machinery, including recognition and binding, cleavage, degradation, docking, tagging, targeting, folding, scaffolding, translation, and post-translational modification. Linear information is organized, i.e. encoded, in proteins as multiple independent subunits, each one responsible for storing and processing particular information that, individually or in cooperation, mediate proteins functions.

DALEL exhaustively searches the linear information in proteins. First, it enumerates all possible motifs of variable length including any number and combination of wildcards. Then, degenerates the motifs to discover conserved and flexible individual and correlated residues. DALEL utilizes a novel parallel and recursive algorithm that allows divide the exploding space of enumeration and degeneration into much smaller spaces that can be built and searched very fast and in parallel.

DALEL is based on the fundamental biological premise that proteins of interest known to have a common behaviour are enriched with the linear information mediating that behaviour, while other proteins do not exhibit such enrichment. Therefore, the entire space of linear information encoded in the proteins of interest is visited and assessed for significance by scoring their enrichment in the proteins of interest relative to the proteome and/or the negative control proteins, by using statistic based on the cumulative hypergeometric distribution

We applied DALEL to explore the linear information encoded in the SH3 domain recognition peptides in the budding yeast Saccharomyces cerevisiae. We succeeded, using only the linear information to independently identify the majority of experimentally determined recognition peptides. We discovered, however, a number of peptides with distinct properties that may serve ancillary roles. The strategy could be applied to any recognition domain for constructing both empirical and quantitative models of biochemical networks.

DALEL source code is available here

Citations

Exhaustive search of linear information encoding protein-peptide recognition
Kelil A, Dubreuil B, Levy ED, Michnick SW (2017) PLOS Computational Biology 13(4): e1005499. benchmark

Fast and Accurate Discovery of Degenerate Linear Motifs in Protein Sequences
Kelil A, Dubreuil B, Levy ED, Michnick SW (2014) PLOS ONE 9(9): e106081. benchmark