Difference between revisions of "Introduction"

From victor
Jump to: navigation, search
Line 51: Line 51:
The simplest possible C++ code fragment to generate a global alignment is:
The simplest possible C++ code fragment to generate a global alignment is:
<syntaxhighlight line lang="cpp">
Blosum sub(matrixFile);
Blosum sub(matrixFile);
SequenceData ad(2, seq1, seq2);
SequenceData ad(2, seq1, seq2);
ScoringS2S sc(&sub, &ad);
ScoringS2S sc(&sub, &ad);
NWAlign nwAlign(&sc, &ad, gapPenalty, gapExtension);
NWAlign nwAlign(&sc, &ad, gapPenalty, gapExtension);

Revision as of 15:40, 19 August 2014

The Victor2.0 library (Virtual Construction Toolkit for Proteins) is an open-source project dedicated to providing a C++ implementation of tools for analyzing and manipulating protein structures. Victor is composed of four main modules:

  • Biopool - BIOPolymer Object Oriented Library. Generates the protein object and provides useful methods to manipulate the structure.
  • Align - ALIGNment generation and analysis.
  • Energy - A library to calculate statistical potentials from protein structures.
  • Lobo - LOop Build-up and Optimization. Ab intio prediction of missing loop conformation in protein models.


The Biopool class implementation follows the composite design pattern and for a complete description of the class hierarchy we recommend to see the Doxygen documentation.

Without going into implementation details a Protein object is just a container for vectors representing chains. Each vector has 2 elements: the Spacer and the Ligand Set. The Spacer is the container for AminoAcid objects whereas the LigandSet is a container for all other molecules and ions, including DNA/RNA chains. Ultimately all molecules, both in the Spacer and in the LigandSet are collections of Atom objects. The main feature in Biopool is that each AminoAcid object in the Spacer is connected to its neighbours by means of one rotational vector plus one translational vector.


This implementation make easy the modification of the protein structure and lot of functions were implemented to modify/perturbate/transformate the residue relative position in an efficient way, rotation and Translation vectors.

For more detail on how to use it look the Biopool features section.


The package comes with several options. The necessary data files (e.g. substitution matrices) are provided. The most important feature of the package is the modular object oriented design, which should allow a moderately experienced C++ programmer to rapidly implement and test new features for sequence alignment. Inside this package, you can use, different weighting schemes, scoring functions, ways to penalize gaps, and typologies of structural information. The Align library was designed to be modular and easy to expand. There are four basic components which are needed to use the alignment methods.

The four main components are:

  • Blosum The substitution matrix.
  • AlignmentData Stores information on sequence ("SequenceData") and, when needed, secondary structure ("SecSequenceData").
  • ScoringScheme Stores information on how a single position shall be scored in the alignment,e.g. sequence-to-sequence ("ScoringS2S"), profile-to-sequence ("ScoringP2S") or profile-to-profile ("ScoringP2P") scoring. It requires both "AlignmentData" and "Blosum" objects to be initialized.
  • Align The alignment algorithm. This can be either local (Smith-Waterman, "SWAlign"), global (Needleman-Wunsch, "NWAlign") or glocal/overlap (Free-Shift, "FSAlign"). It requires both "AlignmentData" and "ScoringScheme" objects.

If P2S or P2P scoring is used, the class "Profile" stores the necessary information to generate the profile from a multiple sequence alignment.

Two advanced options, which may be useful in certain circumstances, are supported by the software:

  1. ReverseScoring This allows the estimation of a staistical significance of the raw alignment score by testing it against an ensemble of alignments based on the reversed sequence in the form of a Z-score.
  1. Suboptimal alignments Rather than generating a single solution, the user may decide on a number of different, alternative, suboptimal alignments to be generated.

The simplest possible C++ code fragment to generate a global alignment is:

  1. Blosum sub(matrixFile);
  2. SequenceData ad(2, seq1, seq2);
  3. ScoringS2S sc(&sub, &ad);
  4. NWAlign nwAlign(&sc, &ad, gapPenalty, gapExtension);
  5. </syntaxhughlight>
  8. Supposing you have already found a template candidate, you need to align it against your target sequence. 
  9. The '''subali''' application let you choose from very different type of algorithms, strategies and parameters.
  11. Strategy:
  13. * Sequence to sequence
  14. * Profile to sequence
  15. * Profile to profile
  18. Alignment algorithm:
  20. * Local
  21. * Global
  22. * Freeshift
  25. Blosum substitution matrix:
  27. * 62
  28. * 45
  29. * 50
  30. * 80
  33. Weighting scheme (only for profiles):
  35. * PSIC (Sunyaev et al., 1999)
  36. * Henikoff (Henikoff & Henikoff, 1994)
  37. * SecDivergence (Rychlewski et al., 2000)
  40. Scoring function (only for profile to profile alignments):
  42. * Sum of pairs
  43. ** CrossProduct 
  44. ** LogAverage (Ohsen and Zimmer, 2003; von Ohsen et al., 2003)
  45. * Dot product 
  46. ** DotPFreq (Wang and Dunbrack, 2004)
  47. ** DotPOdds (Wang and Dunbrack, 2004)
  48. * EDistance (Euclidean distance) 
  49. * Pearson (Pietrokovski, 1996)
  50. * JensenShannon (Yona and Levitt, 2002)
  51. * Atchley metric
  52. ** AtchleyDistance (Atchley et al., 2005)
  53. ** AtchleyCorrelation (Atchley et al., 2005)
  55. ==Energy==
  57. '''Energy''' functions are used in a variety of roles in '''protein modelling'''. An energy function precise enough to always discriminate the native protein structure from all possible decoys would not only simplify the protein structure prediction problem considerably. It would also increase our understanding of the '''protein folding process''' itself.
  58. If feasible, one would like to use quantum mechanical models, being the most detailed representation, to calculate the energy of a protein. It can theoretically be done by solving the '''Schrödinger''' equation. This equation can be solved exactly for the hydrogen atom, but is no longer trivial for three or more particles. In recent years it has become possible to approximately solve the Schrödinger equation for systems up to hundred atoms with the '''Hartree-Fock''' or self-consistent field approximations. Their main idea is that the many-body interactions are reduced to several two-body interactions.
  60. Energy functions are important to all aspects of protein structure prediction, as they give a '''measure of confidence for optimization'''. An ideal energy function would also explain the process of protein folding. The most detailed way to calculate energies are '''quantum mechanical''' methods. These are, to date, still overly time consuming and impractical. Two alternative classes of functions have been developed: '''force fields''' and '''knowledge-based potentials'''.
  62. Force fields (e.g. AMBER) are empirical models approximating the energy of a protein with '''bonded and non-bonded interactions''', attempting to
  63. describe all contributions to the total energy. They tend to be very detailed and are prone to yield many erroneous local minima.
  64. An alternative are knowledge-based potentials (e.g. [78]), where the “energy” is derived from the probability of a structure being similar to interaction patterns found in the database of known structures. This approach is very popular for '''fold recognition''', as it produces a smoother “global” energy surface, allowing the detection of a general trend. Abstraction levels for knowledge-based potentials vary greatly, and several functional forms have been proposed.
  67. The '''energy functions''' presented in the package allow to optimize procedures. The main feature is its applicability in the context of the '''protein''' classes implemented in the package. It should be possible to invoke the energy calculation with any structure from all programs. At the same time the parameters of the energy models had to be stored externally to allow their rapid modification. With this considerations in mind, the package Energy was designed to collect the classes and programs dealing with energy calculation. The main design decision was to use the “strategy” design pattern from Gamma et al. The abstract class Potential was defined to provide a common interface for energy calculation. It contains the necessary methods to load the energy parameters during initialization of an object. Computing the energy value for objects of the '''Atom''' and '''Spacer''' classes as well as a combination of both is allowed.
  69. For more detail on how to use energy look [[Energy]]
  75. ==Lobo==
  77. Current database methods using solely experimentally determined loop fragments do not cover all possible '''loop conformations''', especially for longer fragments. On the other hand it is not feasible to use a combinatorial search of all possible '''torsion angle''' combinations.
  78. For an '''algorithm''' to be efficient, a compromise has to be found. One improvement in '''ab initio''' loop modelling is the use of '''look-up tables'''(LUT) to avoid the repetitive calculation of loop fragments. '''LUTs''' can be generated once and stored, only requiring loading during loop modelling. Using a set of LUTs reduces the computational time significantly.
  79. The next problem is how to best explore the '''conformational space'''. Especially for longer loops, it is useful to generate a set of different candidate loops to exclude improbable ones by ranking. The method should therefore be able to select different loops by '''global exploration''' of the conformational space independently of starting conditions. Methods building the loop stepwise from one anchor residue to the other bias the solutions depending on choices made in conformation of the first few residues. Rather a '''global approach''' to the '''optimization''' is required.
  80. This criterion is fulfilled by the '''divide & conquer algorithm''', which is recursively described by the following steps:
  81.  1. if start = end, compute result;
  82.  2. else use algorithm for:
  83.     (a) start to end/2
  84.     (b) end/2 to end
  85.  3. combine the partial solutions into the full result.
  87. Applied to loop modelling, the basic idea of a divide & conquer approach is to divide the loop into two segments of half the original length choosing a good central position, as shown:
  89. [[File:Divide&conquer.png|center|thumb|200px|caption]]
  91. The segments can be '''recursively divided and transformed''', until the problem is small enough to be solved analytically (conquered). The positions of '''main-chain atoms''' for segments of a single amino acid can be calculated analytically, using the vector representation. Longer loop segments can be stored in '''LUTs''' and their coordinates extracted by geometrically transforming the '''coordinates''' for single amino acids back into the context of the initial problem. To this end we need to define an unambiguous way to represent the conformation of any given residue along the chain and a set of operations to concatenate and decompose loop segments.
  93. For more detail on how to use Lobo look [[Lobo]]