ESpritz - Efficient detection of protein disorder



Quick Help and References

Description
Espritz is a new web server to detect regions of proteins which are thought to contain no structural content. This non-structure is known as protein disorder. Espritz uses an efficient and accurate prediction algorithm based on Bi-directional Recursive Neural networks (BRNN's). BRNN's are a sequence based machine learning algorithm which have being found to be useful for other structural predictions [1,2]. The method is based solely on the sequence and does not use expensive computations to find multiple sequence alignments.
Espritz also calculates some interesting statistics about all proteins submitted and on each protein individually. For a slightly more detailed technical description of the capabilities and technology used, see the methods page. For the most detailed description see our paper.

Input (Single mode)
E-Mail address
This is not a requirement. However, if you tick the "E-mail notification of results" box then make sure your email address is typed correctly. If no email is supplied do not close the window (unless bookmarked) while waiting for the results.

Sequence Name
An optional title for your submission. This will appear in the header of the output. We suggest you select one.

Sequence
All sequences submitted must be in fasta format. Invalid amino acid types, spaces, return characters should be ignored by the server. However, we suggest that you keep the input in the following format:
>P1
MIETPYYLIDKAKLTRNMERIAHVREKSGAKALLALKCFATWSVFDLMRD
YMDGTTSSSLFEVRLGRERFGKETHAYSVAYGDNEIDEVVSHADKIIFNS
ISQLERFADKAAGIARGLRLNPQVSSSSFDL
>P2
MTFSKELREASRPIIDDIYNDGFIQDLLAGKLSNQAVRQYLRADASYLK
EFTNIYAMLIPKMSSMEDVKFLVEQIEFMLEGEVEAHEVLADFINE
>P3
MENKKMNLLLFSGDYDKALASLIIANAAREMEIEVTIFCAFWGLLLLRD
PEKASQEDKSLYEQAFSSLTPREAEELPLSKMNLGGIGKKMLLEMMKEE
KAPKLSDLLSGARKK
>P4
MKKVFYPLACCCLAAGVFASCGGQKKANAQEEPSKVALSYSKSLKAPET
DSLNLPVDENGYITIFD
where P1, P2, P3 and P4 are the sequence identifiers. It is vital that all identifiers are unique. For an example multiple fasta example see here and for an example single fasta see here.

Once your sequence(s) are in fasta format there are two input types:

  • Paste: Simple copy the fasta sequences (ala "copy and paste" functions available on all operating systems) and paste into the box provided. A limit of less than 3000 proteins is recommended for pasting. This is due to memory resrictions.
  • Upload: uploading a file with multiple fasta sequences is another possibility. This has no memory restrictions since the data is stored on physical devices rather than memory. In theory any number of proteins can be submitted. For example, the human proteome can be predicted by uploading the fasta file (note: this file was downloaded from from NCBI genbank and contains 39151 proteins).



  • Options
    ESpritz The predictor is trained on three types of disorder predictions:
    • Short x-ray: This is based on missing atoms from the Protein Data Bank (PDB) X-ray solved structures. If this option is chosen then the predictors with short disorder options are executed.
    • Longer disprot: The dataset used for this definition contains longer disorder segments compared to x-ray. In particular, disprot a manually curated database which is often based on functional attributes of the disordered region was used for this definition. Disorder residues are defined if the disprot curators considers the residue to be disordered at least once. All other residues are considered ordered. If this option is chosen then the predictors with long disorder options are executed.
    • NMR mobility: NMR flexibility is calculated using the Mobi server optimized to replicate the ordered-disordered NMR definition used in CASP8

    X-ray, disprot and NMR disorder types can be chosen using the left most drop down menu in the option menu.

    ESpritz produces a probability of disorder for each residue. A decision on what probability cut-off threshold produces the best disorder was made on the training sets for each predictor. Although, our method produces true Bayesian probabilities the data used to learn the disordered regions was imbalanced and therefore using 0.5 as the probability decision is not recommended. For this reason internal probability thresholds are defined on each predictors training set. The server hides this information from the user by allowing two types of threshold with respect to the training dataset:

    • Best Sw: The definition of Sw can be found in [3]. This option chooses the threshold which maximised the Sw measure on the training set. This threshold tends to over-predict disorder but is nonetheless useful.
    • 5% False Positive Rate (FPR): This is a stricter definition of disorder compared to best Sw. It is the threshold which produces 5% False positives on the training data.


    Output (Global stats)
    The first page the user sees is a global statistics and download page.
    Statistics on all proteins
    The global statistics are in the following form:
    Total amino acids 	: ℕ
    Total proteins 	: ℕ
    Total % disorder 	: ℝ
    % of proteins with at least 1 disordered region greater than 30 amino acids 	: ℝ
    % of proteins with at least 1 disordered region greater than 50 amino acids 	: ℝ
    Total no. of disordered regions greater than 30 amino acids 	: ℕ
    Total no. of disordered regions greater than 50 amino acids 	: ℕ
    Number of disordered segments 	: ℕ
    Mean segment length (standard deviation) 	: ℝ (ℝ) 
    
    Residues are split into 3 groups charged, uncharged and hydrophobic. The occurences and disorder percentages are reported for the 3 groups.
    percetage of and percentage disorder for crage, uncharged and hydrophobic
    In the example above 29.87% of the residues in all the proteins submitted are charged. Also, of this 29.87% residues 4.97% are disordered.

    In addition, for completeness this table is expaned to include the percentage occurence of all the 20 amino acids and its percentage disorder.

    Available files for download
    Available for download is a Disorder.tar.gz archive containing:
    • A directory for each protein containing:
      • The newly annotated fasta sequences (.fasta extension). The example below should be self explainatory:
          >sample.fasta
          MSGGGDVVCTGWLRKSPPEKKLRRYAWKKRWFILRSGRMSGDPDVLEYYKNDHSKKPLRIINLNFCEQVDAGLTFNKKELQDSFVFDIKTSERTFYLVAETEED
          >sample.fasta_disorder
          DDDDDDDDOOOOOOOOODDDOOOOOOOOOOOOOOOOODDDDOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOODDDDDDDOOOOOOOOOOOOOOOOOODDDDDD
          >sample.fasta_disorder_confidence_label
          33322111889999888111888899999999999881111889999999988888999999999999999881111111889999999999999999111222
          	

          The confidence label depends on the probability of disorder or order. For example, a label of "2" on a predicted disordered residue means the probability of that residue being disordered was in the range [0.2,0.3).

      • Disorder statistic file (.stats extension). See later for individual statistics.
      • The disorder prediction (.espritz extension). Each line of this file is ordered according to the ordering of amino acids in your fasta sequence. Each line contains 1. Disorder/Order symbol (D/O) and 2. the probability of disorder (e.g. "D 0.462056" means the amino acid is disordered with probability 0.462056).
    Distribution of disordered segments: The final graph to download is the Histogram of disordered segments graph. An example graph is provided below. The x-axis contains the disorder segment ranked by its length. The y-axis contains the length of the disorder segment.
    Distribution of 
disordered segments
    Distribution amino acids: Also available to download is the distribution of amino acids in any state and in a disordered state. An example graph is provided below. The x-axis contains the amino acid under investigation. The red bars show the percentage that amino acid occurs. The blue bar shows the percentage of the red bar is disordered.
    Distribution of amino acids in any and disordered states
    Links to individual protein pages
    Finally, links to the individual proteins are available (see next section). An example list is shown below.
    Distribution of amino acids in any and disordered states
    We feel it is important to rank this list considering it may be a very large list. We allow rankings by %disorder, number of segments, number of segments greather than 30 and number of segments gretaer than 50. Each page of the list contains 10 proteins where the next 10 can be accessed by clicking the next arrow button.



    Output (individual protein)
    The output page presented for each protein gives a quick overview of the proteins disorder and statistics. In addition, secondary structure and linear motifs are presented for each disorder segment.

    Available files and links
  • Input parameters: A link to a summary of the input parameters the user supplied.
  • Fasta sequence: A link to 4 fasta sequences: (i) amino acid, (ii) secondary structure, (iii) disorder and (iv) disorder confidence label. The disorder confidence label gives a measure of the probability of disorder produced by our machine learning algorithm. For example, a confidence of 3 means the probability of disorder lies between [0.3,0.4).
  • Disorder Prediction (with disorder probability). This is the disorder prediction file (with .cspritz extension) described above.
  • Protein statistics: A downloadable statistics file. The file contains:
    Total amino acids: 
    Total % disorder: 
    Total no. of disordered regions greater than 30 amino acids: 
    Total no. of disordered regions greater than 50 amino acids: 
    Number of disordered segments: 
    Length distribution of segments (N to C terminal order): 
    
  • Disorder Prediction (with disorder probability). This is the disorder prediction file (with .espritz extension) described above.

  • Disordered residues and stats:
    This panel consists of two pieces of information:
    1. Amino acid sequence and the final disorder decision on the left. Residues are labelled every 10 residues, for example:
                    110
      XXXXXXXXXXXXXXYXXXXXXXXXXXXX
      
    2. means residue Y is position 110 in the sequence.

    3. The statistics for the individual protein (this is identical to the individual downloadable statistic file above).


    Examples

    Here we present 1 single fasta pasted example, all the casp 9 targets pasted and the human genome downloaded from ftp://ftp.ncbi.nlm.nih.gov/genbank/. This section should also serve as a tutorial of how to use ESpritz.

    Example 1   -    Human p53. UniProt ID: P04637, 393 residues calculations with short option. Input: input page for human p53.

    Multiple protein example 2   -    Entire casp9 disorder targets. 117 proteins. Download the casp9 fasta sequences, hit browse on the input page and upload the file. This example shows how to process multiple sequences.

    Genome example 4   -    Entire proteome of homo sapien. 39151 proteins. Download the human fasta sequences, hit browse on the input page and upload the file. This took approx. 6.5 hrs on the current infrastructure.

    References

    Please cite:

    I. Walsh, A. J. M. Martin, T. Di domenico, S. C. E. Tosatto
    Espritz: Accurate and fast prediction of protein disorder. Bioinformatics, 28 (4), pp. 503-509   (2012)

  • (1) G. Pollastri1 and A. McLysaght
    Porter: a new, accurate server for protein secondary structure prediction
    Bioinformatics 21 (8): 1719-1720   (2004)

  • (2) G.Pollastri, A. J. M. Martin, C. Mooney, A. Vullo
    Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information. BMC Bioinformatics 2007, 8:201

  • (3) O. Noivirt-Brik, J. Prilusky, J.L. Sussman.
    Assessment of disorder predictions in CASP8. Proteins. 2009;77 Suppl 9:210-6.