Espritz - Efficient prediction of protein disorder




Methods

This page contains a slightly more detailed description of the methods implemented in ESpritz. It is meant as a quick reference for the user interested in understanding the technical details and dataset. If you are interested in all the details of ESpritz please read our paper. If you are interested in a more general description of the approach, input and output please refer to the Quick Help page instead.

If you have any further questions or suggestions, please contact silvio.tosatto@unipd.it.

Overview
Espritz is a new web server to detect disordered regions from primary sequence. ESpritz is based on an efficient prediction system to find regions of protein disorder (or unstructured regions). Protein disordered regions are key for the function of numerous processes within an organism. Experimental annotations remain low with the two most common sources of information being Disprot and the Protein Data Bank (PDB). Determination of disorder from amino acid sequence is a difficult problem but nonetheless published methods have shown promising results. Possibly for two reasons:
  • 1. if the amino acid sequence determines structure then unstructured regions can also have specific amino acid propensities,
  • 2. disorder is important in many biological functions and therefore the unstructured protein segments should be conserved by evolution.

  • Prediction system
    In short ESpritz is constructed as follows:
    1. ESpritz: is based on a machine learning method which does not require sliding windows or any complex sources of information (Bi-directional Recursive Neural Networks (BRNN)) [1]. The server can process proteins on a genomic scale with little effort and state-of-the-art accuracy. We have proved that BRNNs are capable of extracting more information from the protein sequence compared to static neural networks.

      Sequence information: The only source of information for the BRNN is the primary amino acid sequence. Multiple sequence alignments are generated using PSI-BLAST. Although the PSI-BLAST based input improves the performance slightly the ESpritz without PSI-BLAST much faster and only 0-3% points lower in performance. We envisage the main usuage for ESpritz being its fast genome scale processing capabilities.

      Learning: Learning proceeds by extracting the relevant information from the local context of the residue under consideration using the BRNN. The algorithm used for training was gradient descent and the backpropagation through structure algorithm [2].

      Datasets: There are three types of disorder data. We believe publically available data is important for the replication of experiments and thereforw you can link to the appropriate data from here:


      (A)
      X-Ray train
      (download training set) : For proteins in the PDB, we defined disordered residues for which any of the backbone c-alpha atoms have no coordinates. To generate the training set we downloaded the list of protein chains depos- ited in the PDB until 1st of May 2008 restricted to x-ray protein chains of length between 25 and 2,000 amino acids, with resolution at most 2.5 Å and R-factor up to 25%. We reduced by sequence identity using UniqueProt [4] to an HSSP value of 0 and using the -m option to give priority to proteins with better quality. The resulting lists were merged and redundancy reduced in a similar manner leaving proteins with disordered regions as a priority.

      X-Ray test (download testing set) : To create a test set the same procedure was repeated, with proteins re- leased by the PDB between May 1st 2008 and September 13th 2010. In order to ensure that the test proteins do not share significant homology with the training data, they were combined with the training set and redundancy reduced using UniqueProt with the same options.

      (B)
      DisProt train (download training set) : DisProt [5] is a manually curated database of partially or completely disordered proteins. Here we define a residue to be disordered if the Disprot curators consider the residue to be disordered at least once. All other residues are considered ordered.

      DisProt test (download testing set) : This set is based on Disprot release 5.7. Each DisProt entry was matched to one or more PDB entries by taking the UniProt accession code present in the DisProt record and linking it to the PDB through the SIFTS database [6]. The data is therefore a combination of sources of information from the PDB and Disprot. However, Disprot takes preference over all types of disorder.

      (C)
      NMR train (download testing set) : NMR mobility/flexibility is calculated using the Mobi server. Mobi is based on a simple algorithm to find regions with different conformations among all the models in an NMR ensemble optimized to replicate the ordered-disordered NMR definition used in CASP8. The extraction and redundancy reduction is identical to the X-ray data collection (see above) except PDB NMR structures (no quality filter) are considered.

      NMR test (download testing set) : The extraction and redundancy reduction is identical to the X-ray data collection (see above) except PDB NMR structures (no quality filter) are considered.


      The Calpha and Disprot versions of the models in ESpritz are also one of the systems used in our more accurate but much slower predictor cspritz.


    References
  • (1) G. Pollastri, D. Przybylski, B. Rost, P. Baldi
    Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles
    Proteins: Structure, Function, and Bioinformatics Volume 47, Issue 2, pages 228–235, 1 May 2002   (2002)

  • (2) A. Sperduti and A. Starita
    Supervised neural networks for the classification of structures
    Neural Networks, IEEE Transactions on, 8(3) 714-735. (1997)

  • (3) M. Sickmeier, J.A. Hamilton, T. LeGall, V. Vacic, M.S. Cortese, A. Tantos, B. Szabo, P. Tompa, J. Chen, V.N. Uversky, Z. Obradovic, A.K. Dunker.
    DisProt: the Database of Disordered Proteins
    Nucleic Acids Res. Jan;35(Database issue):D786-93. Epub 2006 Dec 1. (2006)

  • (4) S. Mika and B. Rost
    UniqueProt: Creating representative protein sequence sets
    Nucleic Acids Res, 31, 3789-3791. (2003)

    (5) N. Sickmeier et al.
    DisProt: the Database of Disordered Proteins
    Nucleic Acids Res, 35, D786-793. (2007)

    (6) Velankar, S., et al.
    E-MSD: an integrated data resource for bioinformatics.
    Nucleic Acids Res, 33, D262-265. (2005)