P2P - Theory and Documentation

 

 

General

Predicting the folding pattern of proteins from their sequence and prediction of residue-residue contacts within proteins are closely related problems. These fundamental problems are intensively investigated in computational molecular biology and biophysics. This site describes a new structurally derived pair-to-pair substitution matrix (P2P) 1, constructed from a very large amount of integrated high quality multiple sequence alignments (Blocks) and protein structures. Its elements are the likelihoods of all 160000 pair-to-pair substitutions. P2P matrix implicitly accounts for evolutionary conservation, correlated mutations and residue-residue contact potentials. The matrix is useful for such structural prediction tasks. Predicting protein residue-residue contacts from sequence information alone, by our method (P2PConPred) is particularly accurate in the protein cores  (24% for 59 diverse families and 34% for a subset of proteins shorter than 100 residues). The matrix was also proven useful to identify native structures within large sets of protein decoys. Based on evolutionary information alone our method ranks the native structure in the top 0.3% of the decoys in 4/10 of the sets and in 8/10 of sets the native structure is ranked in the top 10% of the decoys.  The method can, thus, be used to assist filtering wrong models, complementing traditional scoring functions. The P2P method can also be used for analysis of protein structures, for comparison of sequences and in fold recognition. This web site presents the concept of P2P matrices and offers on line calculation and analyses of proteins based on P2P matrices.

Derivation of the P2P substitution matrix

The Blocks database 2 is the source of multiple alignments for derivation of the matrix.  Only the blocks with available structure in the PDB for at least one of their proteins (a "representative structure" for this block) where used. For blocks including several crystal structures, the one with the best resolution was taken. Blocks represented only by NMR models were excluded. Each pair of columns in a block was classified as either contacting or not, according to the contact surface area analysis 3 of the block representative structure.

          The pair-to-pair substitution matrix, P2P matrix 1, of 160,000 ((20*20)*(20*20)) elements was constructed to approximate the probability of changes from any pair of amino acids (x,y) to another pair (u,v) in the same two columns of a multiple alignment. Namely, in one sequence residue x is found in position i and residue y in position j, while in another sequence of the same alignment residue u is found in position i and residue v in position j. Each pair-to-pair matrix element for (x↔u, y↔v) was calculated based on two components. The first component was derived from pair-to-pair substitutions in sites i, j having a contact (according to the representative structure of the protein family). This component reflects a contact signal. The second component was based on pair-to-pair substitution of non-contacting sites, reflecting the background noise for the purpose of contact prediction.

Formally, to derive the P2P matrix element M[xy][uv] with the score of pair-to-pair substitutions from amino acids x↔u at one position and y↔v at another position simultaneously, the following equation was applied:

                                        (1)

 and  are the observed and expected frequencies for contacting pairs, respectively; and  are the observed and expected frequencies for non-contacting pairs, respectively. The observed frequencies  of the simultaneous substitutions x↔u, y↔v in two block columns, which have a direct contact, were calculated according to the following equation:

                                                                     (2)

 where  is a weighted sum for the counts of all x↔u, y↔v simultaneous substitutions that was calculated according to:

                         (3)

where b is an index running over all block entries with representative structures, Nb is the number of sequences and Lb is the alignment length of block b, ws and wt are position specific weights  4 given to sequences s and t respectively in the block b, i and j are sequence positions, and si stands for the amino acid type found at position i in sequence s. The δ function equals 1 if the four conditions in parentheses all hold, and 0 otherwise. Less formally, for each pair of sequences s and t within the same block and each pair i and j of sequence positions having contact in the structure, the product of sequence weights ws and wt was added to the count of .  was calculated in the same way but based on pairs of block positions with no direct contact.

The expected frequencies were calculated as follows:

                                                    (4)

where  is a sum over all substitutions .  is simply a weighted count of substitutions x↔u observed in each block column calculated as follows:

                                                     (5)

 where indexes and functions are same as in Equation (3). Equation (5) states that for each pair of sequences s and t within the same block, the product of the sequence weights wsand wt was added to the count of for each position i.  was calculated in an equivalent way to that of . Only positions i and j separated by three amino acids or more along the sequence were used to derive the matrix. The sequence order of long-range interacting residues relative to each other does not influence their substitution rates, and pair substitutions were considered bi-directional. Equivalent pairs (i.e., ) were combined and averaged, to symmetrize the resulting matrix.

Types and versions of  P2P matrices

P2P can be constructed by different ways and according to different sets of dada. The first version was based on Blocks version 13 from 2001. newer versions are based on Blocks version 14. The number of blocks with representative structures is about 3000 in version 13 and about 7500 in version 14. The matrix can be derived based on intra-blocks statistics. This means that the matrix is derived based on pairs of columns within the same block. Alternatively the matrix can be derived based on columns in different blocks in the same protein family. In this case the pair members are usually located in more distal regions on the protein sequence. A matrix constructed by this way should be more appropriate for detection of such contacts. However, it was shown that the two types of matrices give comparable results for contact prediction. Another type of matrix was constructed only based on  positions which the structural data suggest that are buried within the protein matrix (we used accessibility threshold of <0.15). Such matrix might be  more informative, as the constrained environment is more selective for the allowed pair substitutions. The matrix, which is based on Blocks version 13 and constructed based on intra-block positions, is shown in Figure 1.  Different versions of P2P matrices are available here. In addition, visualization and analysis page of P2P matrices  available. This site allows to explore the different P2P matrices and to retrieve values of individual elements (which is not a trivial task in a matrix of 160,000 elements).

Figure 1. P2P matrix based on version 13 of the Blocks database derived based on pairs of columns located in the same blocks ("intra-blocks matrix").

 

Application of P2P for inter-residue contact prediction using P2PConPred

The P2P is currently designed to reflect probabilities of pair to pair substitutions at two positions with physical contact.  The ultimate goal is to detect residue-residue contact solely based on the evolutionary information stored in multiple sequence alignment (MSA). The score given for each two columns is  a weighted sum of the P2P matrix elements corresponding to all amino acids pair combinations in these two columns (Equation 6).

                                                              (6)

where M[sisj][titj] is the P2P matrix element according to the identity of the amino acids in columns i and j in sequences s and t. Pairs with gaps are ignored.

The accuracy of the prediction is defined as the fraction of correct contacts from the L/k best contact predictions, where L is the alignment length and k is a constant. The accuracy of the L/10 best predicted contacts for a test set of 59 families was found to be 0.135, much better than random prediction (0.039).

            Solvent accessibility can assist in the prediction accuracy, since the method is more sensitive in tightly packed and constrained environments, i.e. the protein core. Using programs for predicting residues solvent accessibility based on sequence information alone, such as SABLE  5 the prediction can be restricted to core regions (predicted accessibility of less than 0.15). In this case the mean accuracy is significantly improved to 0.236 (Figure 2).

Example of contact prediction is shown in Figure 3 for the Galactoside-binding lectins family (β SCOP class). For the prediction of L/10 contacts in the core, the accuracy in this example  was 0.69. The lectins family has a sandwich topology with a tight association between a five-stranded and seven-stranded β-sheets. More than half of our best 13 predicted contacts are between residues from the two sheets. These contacts are likely to be important for the protein topology organization.

 

Figure 2. Prediction accuracy by P2PConPred (this work) in different solvent accessibility filters. Compare also the performance to the methods of Gobel 6 and Singer 7.

 

Figure 3. Example of contact predictions for human galectin-7 (PDB ID: 1bkz), a representative structure for the Galactoside-binding lectins. The structure is presented as cartoon and the L/10 top core contact predicted residues are in sticks with inter-residue distances indicated by dashed lines. Predicted pairs of residues that are actually in contact are shown in purple and predicted pairs that are not in contact are shown in green.

 

Application of P2P for discrimination of native folds

 

            P2P matrix can also be used for evaluating protein fold stability based on structure contacts and the scores between the corresponding positions in a multiple sequence alignment. This idea was examined on ten protein structures with proper decoy sets and multiple sequence alignments. The amount of decoys in the sets ranged from 300 to 2000 structures. Each of the decoys in the examined sets has been given a score using Equation 7. Figure 4 demonstrates that the native structure is usually ranked as one of the top structures in the evaluation.  Two out of ten times it is ranked first, in two other cases it is ranked among the top 0.3% of the decoys, and in 4 other cases it is ranked among the top 10%. When there are subsets of decoys which were constructed to be closer to the native as is the case for some of the Rosetta sets 8, these subsets usually get better scores than the rest of the decoys.

                                                                                   (7)

where Cdij is equal to1 if position i,j have a contact surface area in decoy d structure, and are separated by at least 5 position in the sequence, and equal to 0 otherwise.

 

Figure 4. Evaluation of decoy sets using the P2P matrix. The rank of each decoy (in percentage of the total number of decoys) is plotted versus its RMSD to the native structure. The native structure points appear as triangles, and have zero RMSD value. For each protein, the PDB id of the native structure is given along with the source of the decoys. ro – Rosetta  8, du – Decoys Are Us  9, rd – combined set of Rosetta and Decoys Are Us, wa- taken from the work of Wang et al. 10.

 

P2PConPred program

P2PConPred calculates correlations between sites based on a predefined P2P matrix. The input for the program is a multiple sequence alignment in fasta format. The output includes the following information:

1. Correlation between all pairs of sites.
2. Entropy and conservation of each site
3. Standard deviation of the individual correlation scores which compose the overall score between each two positions.

The time complexity is O(N2L2), where N is the number of sequences in the alignment and L is the alignment length. In practice the calculation is completed in less than a minute on a modern processor using a typical MSA.

The output looks like:  
# P2PConPred results
# Thu Oct  5 21:29:15 2006
# alignment file: /gnm/apache/htdocs/p2ptmp/56751160098138/PF00018.seed.fta
# P2P matrix: P2Pmat
# number of sequences: 61
# alignment length: 65
# mean sequence length: 56.459
# longest sequence: 60
# shortest sequence: 55
# gaps ratio: 0.1314
# mean column conservation: 0.34781
   1 S    2 Y   61   61   61   61  2.67  2.60  0.11  0.13   0.01   0.55
   1 S    3 V   61   61   61   61  2.67  1.67  0.11  0.44  -0.22   0.52
   1 S    4 K   61   61   61   61  2.67  1.86  0.11  0.38  -0.03   0.52
   1 S    5 A   61   61   61   61  2.67  0.45  0.11  0.85  -0.30   0.39
The format of this output is:

          1         2         3         4         5         6         7         8         9
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890

   1 S    2 Y   61   61   61   61  2.67  2.60  0.11  0.13   0.01   0.55
   1 S    3 V   61   61   61   61  2.67  1.67  0.11  0.44  -0.22   0.52

   ^ ^    ^ ^   ^    ^    ^    ^    ^     ^     ^     ^      ^      ^  
   1 2    3 4   5    6    7    8    9    10    11    12     13     14  

field  1  columns  0  -  3  sequence index of residue i (the first position of the sequence is assigned index 1)
field  2  column   5        amino acid type of the residue i
field  3  columns  7  - 10  sequence index of residue j
field  4  column  12        amino acid type of residue j
field  5  columns 14  - 17  number of sequences having an amino acid (and not a gap) at position i
field  6  columns 19  - 22  number of sequences having an amino acid (and not a gap) at position j
field  7  columns 24  - 27  number of sequences having an amino acid (and not a gap) at both positions i and j
field  8  columns 29  - 32  number of sequences in the alignment
field  9  columns 34  - 38  sequence entropy at position i (based on Shannon's entropy definition)
field 10  columns 40  - 44  sequence entropy at position j
field 11  columns 46  - 50  sequence conservation at position i  (in a scale between 0 and 1 (most conserved))
field 12  columns 52  - 56  sequence conservation in position j  (in a scale between 0 and 1)
field 13  columns 59  - 63  correlation score
field 14  columns 66  - 70  standard deviation of the score with respect to all pairwise scores
The output above can be directed to the standard output or saved in a file. The program also creates an output file with the correlations  in a matrix form without any additional information or text. This format can be directly read by other programs, such as Matlab.

 

Running P2PConPred through the web

The simplest way to use P2PConPred for a limited number of cases is to use the web interface. The calculation is done on-line and typically takes less than a minute (but this depends of course on the number of sequences and their length). The website allows considerable flexibility for the input. The possible inputs may include:

1. A protein sequence. The server performs a blast 11 search against  the uniprot protein database. It collects 70 significant hits  (E < 0.001) if available. The hits selected from the blast results are the least similar hits (as long as they satisfy the criterion above) in order to get sequence variability. These sequences, together with the query sequence, are aligned using Clustalw 12. The output multiple alignment is then fed to P2PConPred.

2. An input multiple alignment file (fasta format). This is the preferred mode of action, since the user has maximum control over the alignment. "Good" input alignment has the following properties:

3. Pfam ID.  The server uses the alignment corresponding to this ID from Pfam database 13. The user can choose if to use the seed alignment (less sequences, representative) or the full alignment (more sequences, redundant and possibly biased).

4. PDB code. The main utility of P2PConPred is to provide scores for potential inter-residue contacts in the three dimensional structure of query sequences, and therefore to run P2PConPred for sequences of PDB structures may sem redundant. This is important, however, for evaluation of the program against known structures and for structural analysis. For each input PDB code, an option is given to construct an alignment based on the sequence of this protein, as detailed in (1), or to use Pfam alignments that contain sequence fragments from the protein.

In addition to the input sequences, the user should also choose the P2P matrix for the analysis. An option to perform prediction of solvent accessibility is also given. Prediction of contacts in the protein core was found to be much more accurate.

The user should supply an email address for which the results will be sent as soon as the computation is completed. This email address will not serve for any other purpose and  is not saved on any place on the server.

The results are returned both by email and online and are saved on the server for a month.


Graphical analysis 

The main advantage of the web server (apart from the ease of usage) is the enhanced graphical environment for analysis of the results. The correlations can be explored as a colored-coded matrix. Blue colors indicate high correlations, white colors intermediate correlations, and red colors indicate low correlations. The matrix elements are also "hot". Dragging the mouse on top of them will reveal the actual correlation value. If solved structures available for the family (as detected by Blast search against the PDB), the correlation can be mapped on any of them inside a Jmol 14 applet. Pressing the mouse on any matrix elements will highlight the corresponding residues inside the Jmol applet. The colors of the residues are exactly the matrix colors and indicate the degree of sequence correlation.

A list of the top correlations appear on the lower left panel. If there is a representative structures available for the family you can also highlight the residues on the structure by checking the boxes on this list.

The server can also performs SABLE5 prediction for solvent accessibilities (upon request in the submission form). The accessibilities appear in the sorted list of the top correlations near the correlation values.

Figure 5. Graphical analysis page for the results from P2PConPred as released by the P2P web server. The bottom left panel lists the most correlated pairs of residues (positions). The matrix on the right show a complete color coded correlation map for the family. If there is a representative structure available for the family it can be viewed using Jmol 14 and the correlations can be mapped on top of the structure. Blue colors indicate high scores and red colors indicate low scores. Green colors indicate positions with many gaps in the multiple alignment, such that a meaningful score can not be calculated.

 

Downloading P2PConPred and running locally

The source code of P2PConPred, written in C++  is available for download. The program was tested on variety of platforms and operation systems including Linux/Sun OS/Silicon Graphics. An executable version for Windows is also available in this distribution. Instructions about the content of the distribution, compilation, running options, inputs and outputs are included.

 

References

  1. Eyal E, M. Frenkel-Morgenstern, V. Sobolev and S. Pietrokovski A pair-to-pair amino acids substitution matrix and its applications for protein structure prediction. Proteins, 2007. 67: p. 142-153.

  2. Henikoff, J. and S. Henikoff, Automated assembly of protein blocks for database searching. Nucl Acids Res, 1991. 19: p. 6565-6572. Henikoff, S., J. Henikoff, and S. Pietrokovski, Blocks+: A non-redundant database of protein alignment blocks dervied from multiple compilations. Bioinformatics, 1999. 15: p. 471-479.

  3. McConkey, BJ, V. Sobolev, and M. Edelman, Quantification of protein surfaces, volumes and atom-atom contacts using a constrained Voronoi procedure. Bioinformatics, 2003. 18: p. 1365-1373.

  4. Henikoff, S. and J. Henikoff, Position-based sequence weights. J Mol Biol, 1994. 243: p. 574-578.

  5. Adamczak, R., A. Porollo, and J. Meller, Accurate prediction of solvent accessibility using neural networks-based regression. Proteins, 2004. 56: p. 753-767.

  6. Gobel, U., et al., Correlated mutations and residue contacts in proteins. Proteins, 1994. 18: p. 309-317.

  7. Singer, M., G. Vriend, and R. Bywater, Prediction of protein residue contacts with a PDB-derived likelihood matrix. Protein Eng, 2002. 15: p. 721-725.

  8. Tsai, J., et al., An improved protein decoy set for testing energy functions for protein structure prediction. Proteins, 2003. 53: p. 76-87.

  9. Samudrala, R. and M. Levitt, Decoys 'R' Us: a database of incorrect conformations to improve protein structure prediction. Protein Sci, 2000. 9: p. 1399-1401.

  10. Wang, Y., et al., Discriminating compact nonnative structures from the native structure of globular proteins. Proc Natl Acad Sci USA, 1995. 92: p. 709-713.

  11. Altschul, SF., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25: p. 3389-3402

  12. Thompson, JD., et al., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22: p. 4673-4680

  13. Bateman, A., et al., The Pfam protein families database. Nucleic Acids Res, 2004. 32: p. D138-D141.

  14. http://jmol.sourceforge.net