Difference between revisions of "Introduction"
m (→Biopolymer Object Oriented Library (Biopool))
m (→Biopolymer Object Oriented Library (Biopool))
|Line 18:||Line 18:|
Revision as of 15:01, 19 August 2014
The Victor2.0 library (Virtual Construction Toolkit for Proteins) is an open-source project dedicated to providing a C++ implementation of tools for analyzing and manipulating protein structures. Victor is composed of four main modules:
- Biopool (BIOPolymer Object Oriented Library) - The core library that generates the protein object and provides useful methods to manipulate the structure.
- Align - ALIGNment generation and analysis.
- Energy - A library to calculate statistical potentials from protein structures.
- Lobo (LOop Build-up and Optimization) - Ab-intio prediction of missing loop conformation in protein models.
Biopolymer Object Oriented Library (Biopool)
The Biopool class implementation follows the composite design pattern and for a complete description of the class hierarchy we recommend to see the Doxygen documentation.
Without going into implementation details a Protein object is just a container for vectors representing chains. Each vector has 2 elements: the Spacer and the Ligand Set. The Spacer is the container for AminoAcid objects whereas the LigandSet is a container for all other molecules and ions, including DNA/RNA chains. Ultimately all molecules, both in the Spacer and in the LigandSet are collections of Atom objects. The main feature in Biopool is that each AminoAcid object in the Spacer is connected to its neighbours by means of one rotational vector plus one translational vector.
This implementation make easy the modification of the protein structure and lot of functions were implemented to modify/perturbate/transformate the residue relative position in an efficient way, rotation and Translation vectors.
For more detail on how to use energy look Biopool
ALIGNment generation and analysis (Align)
A C++ library for the generation of diverse alignments techniques of protein sequences and their analysis.
The package comes in the form of C++ source code with several options that can be compiled and used. The necessary data files (e.g. substitution matrices) are provided. The most important feature of the package is the modular object oriented design, which should allow a moderately experienced C++ programmer to rapidly implement and test new features for sequence alignment.
Inside this package, you can use, different weighting schemes, scoring functions, ways to penalize gaps, typologies of structural information.
The Align library was designed to be modular and easy to expand. There are four basic components which are needed to use the alignment methods.
The four main components are:
- Blosum The substitution matrix
- AlignmentData Stores information on sequence ("SequenceData") and, where needed, secondary structure ("SecSequenceData")
- ScoringScheme Stores information on how a single position shall be scored in the alignment,e.g. sequence-to-sequence ("ScoringS2S"),profile-to-sequence ("ScoringP2S") or profile-to-profile ("ScoringP2P") scoring, etc.Requires both an "AlignmentData" and a "Blosum" object.
- Align The alignment algorithm. This can be either local (Smith-Waterman, "SWAlign"),global (Needleman-Wunsch, "NWAlign") or glocal/overlap (Free-Shift, "FSAlign"). Requires both an "AlignmentData" and a "ScoringScheme" object.
If P2S or P2P scoring is used, the class "Profile" stores the necessary information to generate the profile from a multiple sequence alignment.
Two advanced options, which may be useful in certain circumstances, are supported by the software:
1) ReverseScoring This allows the estimation of a staistical significance of the raw alignment score by testing it against an ensemble of alignments based on the reversed sequence in the form of a Z-score.
2) Suboptimal alignments Rather than generating a single solution, the user may decide on a number of different, alternative, suboptimal alignments to be generated.
The simplest possible C++ code fragment to generate a global alignment is:
Blosum sub(matrixFile); SequenceData ad(2, seq1, seq2); ScoringS2S sc(&sub, &ad); NWAlign nwAlign(&sc, &ad, gapPenalty, gapExtension);
Energy functions implementation
Energy functions are used in a variety of roles in protein modelling. An energy function precise enough to always discriminate the native protein structure from all possible decoys would not only simplify the protein structure prediction problem considerably. It would also increase our understanding of the protein folding process itself. If feasible, one would like to use quantum mechanical models, being the most detailed representation, to calculate the energy of a protein. It can theoretically be done by solving the Schrödinger equation. This equation can be solved exactly for the hydrogen atom, but is no longer trivial for three or more particles. In recent years it has become possible to approximately solve the Schrödinger equation for systems up to hundred atoms with the Hartree-Fock or self-consistent field approximations. Their main idea is that the many-body interactions are reduced to several two-body interactions.
Energy functions are important to all aspects of protein structure prediction, as they give a measure of confidence for optimization. An ideal energy function would also explain the process of protein folding. The most detailed way to calculate energies are quantum mechanical methods. These are, to date, still overly time consuming and impractical. Two alternative classes of functions have been developed: force fields and knowledge-based potentials.
Force fields (e.g. AMBER) are empirical models approximating the energy of a protein with bonded and non-bonded interactions, attempting to describe all contributions to the total energy. They tend to be very detailed and are prone to yield many erroneous local minima. An alternative are knowledge-based potentials (e.g. ), where the “energy” is derived from the probability of a structure being similar to interaction patterns found in the database of known structures. This approach is very popular for fold recognition, as it produces a smoother “global” energy surface, allowing the detection of a general trend. Abstraction levels for knowledge-based potentials vary greatly, and several functional forms have been proposed.
The energy functions presented in the package allow to optimize procedures. The main feature is its applicability in the context of the protein classes implemented in the package. It should be possible to invoke the energy calculation with any structure from all programs. At the same time the parameters of the energy models had to be stored externally to allow their rapid modification. With this considerations in mind, the package Energy was designed to collect the classes and programs dealing with energy calculation. The main design decision was to use the “strategy” design pattern from Gamma et al. The abstract class Potential was defined to provide a common interface for energy calculation. It contains the necessary methods to load the energy parameters during initialization of an object. Computing the energy value for objects of the Atom and Spacer classes as well as a combination of both is allowed.
For more detail on how to use energy look Energy
LOop Build-up and Optimization (Lobo)
Current database methods using solely experimentally determined loop fragments do not cover all possible loop conformations, especially for longer fragments. On the other hand it is not feasible to use a combinatorial search of all possible torsion angle combinations. For an algorithm to be efficient, a compromise has to be found. One improvement in ab initio loop modelling is the use of look-up tables(LUT) to avoid the repetitive calculation of loop fragments. LUTs can be generated once and stored, only requiring loading during loop modelling. Using a set of LUTs reduces the computational time significantly. The next problem is how to best explore the conformational space. Especially for longer loops, it is useful to generate a set of different candidate loops to exclude improbable ones by ranking. The method should therefore be able to select different loops by global exploration of the conformational space independently of starting conditions. Methods building the loop stepwise from one anchor residue to the other bias the solutions depending on choices made in conformation of the first few residues. Rather a global approach to the optimization is required. This criterion is fulfilled by the divide & conquer algorithm, which is recursively described by the following steps:
1. if start = end, compute result; 2. else use algorithm for: (a) start to end/2 (b) end/2 to end 3. combine the partial solutions into the full result.
Applied to loop modelling, the basic idea of a divide & conquer approach is to divide the loop into two segments of half the original length choosing a good central position, as shown:
The segments can be recursively divided and transformed, until the problem is small enough to be solved analytically (conquered). The positions of main-chain atoms for segments of a single amino acid can be calculated analytically, using the vector representation. Longer loop segments can be stored in LUTs and their coordinates extracted by geometrically transforming the coordinates for single amino acids back into the context of the initial problem. To this end we need to define an unambiguous way to represent the conformation of any given residue along the chain and a set of operations to concatenate and decompose loop segments.
For more detail on how to use Lobo look Lobo