ALIGN: Pairwise Alignment

Quick Help and References

Pairwise alignment is a tool designed for performing sequence alignments in a wide variety of combinations. It implements sequence to sequence, sequence to profile and profile to profile alignments with optional support of secondary structure. Different alignment options are freely selectable and include alignment types (local, global, free-shift) and number of sub-optimal results to report.

E-Mail address

Needed for notification of the results by e-mail (selected by default) and is required. Please check that it is typed correctly.

Name of alignment

An optional title for your submission. This will appear in the header of the output. We suggest you select one, especially if sending multiple queries, as they may be completed in a different order.

Sequence inputs

The sequences must be provided by the user in FASTA format. The user is allowed to use either the text area or the file upload option. The precedence is given to the text area, if both the fields are filled for the same sequence, and an internal control is performed to check that the format is correct.

An example of a valid FASTA format follows:

>t111
SKIVKIIGREIIDSRGNPTVEAEVHLEGGFVGMAAAPSGASTGSREALELRDGDKSRFLGKGVTKAVAAV
PIAQALIGKDAKDQAGIDKIMIDLDGTENKSKFGANAILAVSLANAKAAAAAKGMPLYEHIAELNGTPGK
YSMPVPMMNIINGGEHADNNVDIQEFMIQPVGAKTVKEAIRMGSEVFHHLAKVLKAKGMNTAVGDEGGYA
PNLGSNAEALAVIAEAVKAAGYELGKDITLAMDCAASEFYKDGKYVLAGEGNKAFTSEEFTHFLEELTKQ
YPIVSIEDGLDESDWDGFAYQTKVLGDKIQLVGDDLFVTNTKILKEGIEKGIANSI

Alignment options

Alignment type: It is possible to select either the simple sequence to sequence method or the sequence to profile and profile to profile methods that require a multiple alignment to generate the profile used in the pairwise alignment of the two sequences given. In sequence to profile method a multiple alignment of the second sequence input must be either provided by the user as a multiple FASTA format or automatically calculated by the server. In profile to profile method a multiple sequence alignment is required for both sequence inputs either provided by the user or automatically generated by the server. Default is sequence to sequence.

Alignment algorithm: The alignment type is either performed using a local, global method or the "glocal" alias freeshift method. Default is local.

Sub-optimal alignments: Number of sub-optimal alignments to report. The range of alignments to show is 1 to 999. Default is 1.

Matrix: Substitution matrix used to perform the alignment. Default is BLOSUM62.

Gap open: Cost to open a gap. Range is 0-999. Default is 10.

Gap extension: Cost to extend a gap. Range is 0-999. Default is 2.

If the secondary structure option is checked, the pairwise alignment will be performed using the information of the secondary structure derived from both the sequences given. The secondary structures can be either provided by the user or automatically performed by the server using PSIPRED.

Multiple alignment

The multiple alignment for profile construction must be in FASTA format if provided by the user. The user is allowed to use either the text area or the file upload option. The precedence is given to the text area, if both the fields are filled for the same multiple alignment, and an internal control is performed to check if the format is correct. If not provided by the user the server will automatically generate the profile using PSIBLAST whose options may be set by the user. Multiple alignments, provided by the user, must contain the corresponding sequence, given in the first input page, as first. Check the sequence link.

NB: The multiple alignment will be used to construct a profile as provided by the user, i.e. gaps (hyphens) have to be placed in the alignment.

A multiple FASTA format example follows:

>t111
SKIVKIIGREIIDSRGNPTVEAEVHLEGGFVGMAAAPSGASTGSREALELRDGDKSRFLGKGVTKAVAAVNG
PIAQALIGK--DAKDQAGIDKIMIDLDGTENKSKFGANAILAVSLANAKAAAAAKGMPLYEHIAELNGTPGK
YSMPVPMMNIINGGEHADNNVDIQEFMIQPVGAKTVKEAIRMGSEVFHHLAKVLKAKG--MNTAVGDEGGYA
PNLGSNAEALAVIAEAVKAAGYELGKDITLAMDCAASEFYK-DGKYVLAG-----EGNKAFTSEEFTHFLEE
LTKQYPIVSIEDGLDESDWDGFAYQTKVLGDKIQLVGDDLFVTNTKILKEGIEKGIANSILIKFNQIGSLTE
TLAAIKMAKDAGYTAVISHRSGETEDATIADLAVGTAAGQIKTGSMSRSDRVAKYNQLIRIEEALGEKAPYN
GRKEIKGQA--- 
>1pdy
-SITKVFARTIFDSRGNPTVEVDLYTSK-GLFRAAVPSGASTGVHEALEMRDGDKSKYHGKSVFNAVKNVND
VIVPEIIKSGLKVTQQKECDEFMCKLDGTENKSSLGANAILGVSLAICKAGAAELGIPLYRHIANLAN-YDE
VILPVPAFNVINGGSHAGNKLAMQEFMILPTGATSFTEAMRMGTEVYHHLKAVIKARFGLDATAVGDEGGFA
PNILNNKDALDLIQEAIKKAGYTG--KIEIGMDVAASEFYKQNNIYDLDFKTANNDGSQKISGDQLRDMYME
FCKDFPIVSIEDPFDQDDWETWSKMTSGTT--IQIVGDDLTVTNPKRITTAVEKKACKCLLLKVNQIGSVTE
SIDAHLLAKKNGWGTMVSHRSGETEDCFIADLVVGLCTGQIKTGAPCRSERLAKYNQILRIEEELGSGAKFA
GKNFRAPS----
>1bcy
-SITKVFARTIFDSRGNPTVEVDLYTSK-GLFRAAVPSGASTGVHEALEMRDGDKSKYHGKSVFNAVKNVND
VIVPEIIKSGLKVTQQKECDEFMCKLDGTENKSSLGANAILGVSLAICKAGAAELGIPLYRHIANLAN-YDE
PNLGSNAEALAVIAEAVKAAGYELGKDITLAMDCAASEFYK-DGKYVLAG-----EGNKAFTSEEFTHFLEE
PNILNNKDALDLIQEAIKKAGYTG--KIEIGMDVAASEFYKQNNIYDLDFKTANNDGSQKISGDQLRDMYME
FCKDFPIVSIEDPFDQDDWETWSKMTSGTT--IQIVGDDLTVTNPKRITTAVEKKACKCLLLKVNQIGSVTE
SIDAHLLAKKNGWGTMVSHRSGETEDCFIADLVVGLCTGQIKTGAPCRSERLAKYNQILRIEEELGSGAKFA
GKNFRAPS----

.....................

Secondary structure

The sequence must be in FASTA format if provided by the user . The user is allowed to use either the text area or the file upload option. The precedence is given to the text area, if both the fields are filled for the same secondary structure, and an internal control is performed to check if the format is correct.

NB. secondary structure, if provided by the user, must derive from the corresponding sequence given in the first input page by the user. Check the sequence link.

A FASTA format example of secondary structure follows (allowed characters: C = 'coil', E = 'extended - beta strand' and H = 'helix')

>t111
CCEEEEEEEECCCCCCCCCCCCCCCCCCCCCHHHHHHHCCCCCCCCEEEEEEEEEECCCCCC
HHHHHHHHHHHHHHCCCCCEEEEECCCCCHHHHHHHHHHHCCCCCCEEEECCCCCCCCCCCC
HHHHHHHHHHHHCCCCCCCCCCEEEECCEEEECCCEEECCCCCCCCEEECCCCCCEEEECCE
EEECCCCCCCCCCCCCCCCCCCCCCCCEEEEECCCCCCCHHHHHHHHCCCCEEEEEECCCCC
CCHHHHHHHHHHHHCCCEEEEEECCCCCCCCCCCCCCHHHHCCEECCCCCHHHHHHHHHHHC
CCCCCHHHHHHHHCCC

PSI-Blast options

Database: Available protein databases are nr (NCBI non redundant; updated weekly) and nr90, nr60 that are updated monthly and derived fromnr clustered at 90% and 60% sequence identity using CD-HIT respectively.

Enable filtering sequence with SEG: Filter query sequence with SEG. default is not enabled.

Expectation value (E): Expectation value (E) [Real number]. Default is 10.0

Psi-blast rounds: Maximum number of passes to use in multipass version. Default is 4

E-value threshold for inclusion in multipass model: E-value threshold for inclusion in multipass model [Real number]. Default is 0.002

Composition-based statistics:    Use composition based statistics:
       disabled
       Composition-based statistics as in NAR 29:2994--3005, 2001 -> default
       Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties in round 1
       Composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally in round 1

Matrix: substitution matrix to use. Default is BLOSUM62.

Gap costs: gap open 'existence' and gap extension 'extension'. Acceptable range values depend on the substitution matrix selected. Default is 'Existence = 11; Extension = 1' for BLOSUM62.

Profile filtering

A filter is applied to eliminate sequences that share a certain threshold range of percentage identity from the resulting profile.

Upper bound: Maximum (master vs. all) sequence identity threshold. Default is 100.

Lower bound: Minimum (master vs. all) sequence identity threshold. Default is 0.

NB. 'upper bound' must be greater than 'lower bound'

Output

A simple multiple FASTA format file showing all of the sub-optimal alignments is given in the output. The alternatives are ranked accordingly to the raw score. The alignment may be viewed and edited using Jalview.

Final Example

If you would like to test the alignment server, try using the following example. As input sequences use:

Sequence 1:

>t111
SKIVKIIGREIIDSRGNPTVEAEVHLEGGFVGMAAAPSGASTGSREALELRDGDKSRFLGKGVTKAVAAVNG
PIAQALIGKDAKDQAGIDKIMIDLDGTENKSKFGANAILAVSLANAKAAAAAKGMPLYEHIAELNGTPGK
YSMPVPMMNIINGGEHADNNVDIQEFMIQPVGAKTVKEAIRMGSEVFHHLAKVLKAKGMNTAVGDEGGYA
PNLGSNAEALAVIAEAVKAAGYELGKDITLAMDCAASEFYKDGKYVLAGEGNKAFTSEEFTHFLEE
LTKQYPIVSIEDGLDESDWDGFAYQTKVLGDKIQLVGDDLFVTNTKILKEGIEKGIANSILIKFNQIGSLTE
TLAAIKMAKDAGYTAVISHRSGETEDATIADLAVGTAAGQIKTGSMSRSDRVAKYNQLIRIEEALGEKAPYN
GRKEIKGQA

Sequence 2:

>1pdy
SITKVFARTIFDSRGNPTVEVDLYTSKGLFRAAVPSGASTGVHEALEMRDGDKSKYHGKSVFNAVKNVND
VIVPEIIKSGLKVTQQKECDEFMCKLDGTENKSSLGANAILGVSLAICKAGAAELGIPLYRHIANLANYDE
VILPVPAFNVINGGSHAGNKLAMQEFMILPTGATSFTEAMRMGTEVYHHLKAVIKARFGLDATAVGDEGGFA
PNILNNKDALDLIQEAIKKAGYTGKIEIGMDVAASEFYKQNNIYDLDFKTANNDGSQKISGDQLRDMYME
FCKDFPIVSIEDPFDQDDWETWSKMTSGTTIQIVGDDLTVTNPKRITTAVEKKACKCLLLKVNQIGSVTE
SIDAHLLAKKNGWGTMVSHRSGETEDCFIADLVVGLCTGQIKTGAPCRSERLAKYNQILRIEEELGSGAKFA
GKNFRAPS

Options:

Sequence to sequence alignment,
local algorithm,
3 suboptimal alignments,
BLOSUM62 matrix,
gap open 10,
gap extension 2,
no secondary structure.

References

ALIGN Pairwise alignment:
Silvio C.E. Tosatto, Alessandro Albiero, Alessandra Mantovan, Carlo Ferrari, Eckart Bindewald, Stefano Toppo.
Align: A C++ class library and web server for rapid sequence alignment prototyping .
Current Drug Discovery Technologies 3(3):167-173, 2006.
Jalview:
Michele Clamp, James Cuff, Stephen M. Searle, Geoffrey Barton
The Jalview Java alignment editor.
Bioinformatics 20(3):426-427, 2004.
Secondary structure prediction (PSIPRED):
David T. Jones
Protein secondary structure prediction based on position-specific scoring matrices.
Journal of Molecular Biology 292(2): 195-202, 1999.
Type of Alignments:
Durbin, R., Eddy, S.R., Krogh, A., and Mitchison, G.
Biological sequence analysis.
Cambridge University Press, 1998.
PSI-BLAST for profile generation:
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. 25:3389-3402, 1997.
CD-HIT:
Li W, Jaroszewski L, Godzik A.
Clustering of highly homologous sequences to reduce the size of large protein databases.
Bioinformatics 17(3):282-3, 2001.