Exercise 1

1. On line bioinformatics resources

Uniprot database

UniProt is a comprehensive, high-quality and freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. UniProt Knowledgebase (UniProtKB) is a protein database partially curated by experts, consisting of two sections: UniProtKB/Swiss-Prot (containing reviewed, manually annotated entries) and UniProtKB/TrEMBL (containing unreviewed, automatically annotated entries). As of 19 March 2014, release "2014_03" of UniProtKB/Swiss-Prot contains 542,782 sequence entries (comprising 193,019,802 amino acids abstracted from 226,896 references) and release "2014_03" of UniProtKB/TrEMBL contains 54,247,468 sequence entries (comprising 17,207,833,179 amino acids).

The UniProt consortium comprises the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR). EBI, located at the Wellcome Trust Genome Campus in Hinxton, UK, hosts a large resource of bioinformatics databases and services. SIB, located in Geneva, Switzerland, maintains the ExPASy (Expert Protein Analysis System) servers that are a central resource for proteomics tools and databases.

Using the Uniprot web server

Uniprot collects a large number of different proteins, each entry is associated to an unique "Uniprot identification number".

Now, try to open one "entry", for instance P40337 (Human von Hippel-Lindau protein).

What information you obtain?

Data are organized by different sections, such as Function, Names & Taxonomy, Subcellular location, Pathology & Biotech, ecc

Please answer to the following questions using the Uniprot entry for pVHL

What is the pVHL main function?

How many isoforms are described for pVHL?

What is the pVHL main pathway?

Now try to perform another search, this time using the P02452 entry

What is the protein name?

What is the subcellular localization?

Where is located the triple-helical region?

Other useful resources

On the webpages of computers scattered around the world there are many "applications", or web server, that we can use to make Bioinformatics calculations. The web servers are nothing more than dynamic web sites where we fill the appropriate fields with our data, and after waiting for a short time, the Web server will then presents us a new web page with results. The web servers that we will use during our tutorials are numerous and range from the famous Google (to search other pages with information of interest) to web services that can be used to translate, identify, align and analyze nucleotide sequences or protein sequences.

During our tutorials we will use either several web server on line and different stand-alone programs (i.e. installed on the local machine). All these web servers and programs work with linux operating systems, as well as with both Windows and Mac operating systems. In other words, you can try these programs and practise the same exercises from your home computer connected to Internet. Here, a list of the main web servers and programs that we will see during our exercises. Some examples:

PubMed (from NCBI ) for bibliographic search
UniProt where you find information on the most relevant proteins
Translate (from Expasy) to translate a nucleotide sequence in the corrispondig amino acid sequence
OMA Browser to retrieve genes and ortholog protein sequences from different species
Water to perform a local alignment between two sequences (Smith-Waterman)
Align to perform several different alignment, e.g. seq-seq, seq-profile and profile-profile; local, global and freeshift
Blastp e PSI-BLAST (from NCBI) or FASTA (from EBI) to align your sequence against large sequence data-base (e.g. -> SwissProt/UniProt or the nr - no redundant)
ClustalW (from EBI) to perform multiple sequences alignment
Pfam, Interpro and ELM to retrieve information on domain composition and linear motifs
PROSITE to search information on functional domains, protein family and functional sites
GO to predict the biological function
SCOP e CATH for protein classification (structurally based)
If you have a crystal structure of your protein:
- PDB to identify the right crystal structure
- PDBsum a sort of large summary of all already known crystal structures
- ConSurf to identify conserved regions
When you don't have a crystal structure:
- HMMTOP to predict transmembrane helices
- CSpritz to predict disordered regions
- Homer to predict the 3D structure (homology modeling)
When you have to characterize a disordered protein:
- MobiDB a centralized resource for annotations of intrinsic protein disorder
- DisProt DisProt is a database of experimental evidences of disorder manually collected from literature

Other really useful software:

Jalview, a sequence editor on line and stand-alone
Jmol, an on line PDB viewer
Some stand-alone PDB viewers (can be installed on your home computer):
- RasMol (small and light, simple rendering)
- Chimera (advanced, high quality rendering)
- PyMol download for Windows)

To make an example of the usefulness of these online services, try inquiring some web servers looking for information about a well known protein, such as hemoglobin. Later, we will see in more detail the possibilities and how interpret results from different servers:

Open NCBI, select PubMed and type hemoglobin, Search -> to retrieve hemoglobin-related scientific publications
Open then NCBI, select Protein and type hemoglobin, Search -> to find all hemoglobin protein sequences available in the database. Click on one sequence and at the end of the page, the entire sequence of the selected protein
Open NCBI, select Nucelotide and type hemoglobin, Search -> a similar search but using nucleotide sequences
Open UniProt, type hemoglobin, Search -> will give you the list of manually annotated proteins. Now, try to open one "entry" (for instance P69905). What information you obtain?

Europe PMC

What is Europe PMC? Europe PMC is a repository, providing access to worldwide life sciences articles, books, patents and clinical guidelines. Europe PMC provides links to relevant records in databases such as Uniprot, European Nucleotide Archive (ENA), Protein Data Bank Europe (PDBE) and BioStudies. ".

The repository offers several tools to perform bibliographic searches. Among them, the most important is the possibility to automatically highlight relevant keywords within several abstracts. Now, try to search breast cancer (a generic definition for different cancer subtypes).

Several papers are sorted by different categories, such as Relevance, Date, Time Cited, while a box with Popular Content Sets is located on the right side.

What information you obtain?

How many papers you have found? (Free full text article only)

Now, please open the first result (If you have found no relevant information, please try with this example: G protein-coupled KISS1 receptor is overexpressed in triple negative breast cancer and promotes drug resistance).

Using the box on the right side you can highlight different relevant keywords

Please answer to the following questions using the "colon cancer" keywords

How many patents you have found?

Select the paper named "Hyperglycemia exacerbates colon cancer malignancy through hexosamine biosynthetic pathway". How many organisms are described?

Which diseases are presented in the abstract?

Now try to perform another search, this time using the apoptosis keyword

What is the most cited paper?

What is the most recent?

2. FASTA

Dynamic programs, such as Smith and Waterman, are ideal to aligns two sequences accurately, but they are too slow to perform similarity searches in databases. A modern personal computer perform a complete alignment in few minutes, but considering a large database, as GenBank that contains millions of sequences, then the time for execution may require many hours. For example, if you wanted to make a GenBank search for each gene identified in the genome of an entire organism, e.g. yeast with its 6000 genes, you should wait for several years to complete the analysis. Therefore, in order to perform similarity searches using large database faster programs are nedeed. The first program that really performed fast similarity search also using large databases has been FASTA , originally developed by Lipman and Pearson in 1985. This algorithm is based on an indexing words strategy, an intuition that speed up searches

FASTA can be used via web server from EBI website. The program also introduced a precise format (FASTA format), nowaday commonly used to describe amino acid sequences FASTA (for more information please scroll down this page, see Turquoise Box).

Try to identify these two proteins (i.e. obtaining the UniProt Entry name), firstly translating them in the corrisponding aminoacids sequences using Expasy (select Translate tool). Try to figure out what kind of input sequence is required by the "Translate" tool from Expasy (i.e. if in FASTA format or not) and then try to understand what is the correct reading frame. Once you identified the right reading frame, you can download the translated sequence by clicking the link in the right frame, then clicking on the first translated amino acid and choosing virtual FASTA sequence.

When you have your translated sequence in FASTA format, align it through FASTA, against the UniProtKB/SwissProt database. Analyze your results and try to get the UniProt ID for your unknown protein. To note: Waiting time may vary depending on current server load.

Group A: Sequence

Group B: Sequence

Once obtained the proteins name, through Expasy, you can determine the nature of the protein. To this aim, simply enter in the search form of UniProt the accession number retreived using psi-blast (see next exercise) or from fasta (use the sequence name lying in field: Primary accession number).
Please, inspect the different information you can obtain from this database, may be really helpful in the future...

3. BLAST

BLAST (Basic Local Alignment Search Tool) is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Originally developed and improved by Altschul and other researchers at NCBI since 1990. As FASTA, BLAST also relies on a words indexing strategy, but the two programs differ in both the index content and way to proceed after finding common words.

Here, we will only use protein databases, using the protein-oriented versions of blast, such as blastp and psi-blast.
Starting from the nucleotide sequence below, you are asked to identify which proteins are, how many domains are present and what are the associated functions.

Group A: Sequence

Group B: Sequence

Please, open Protein Blast and select blastp; Repeat the search using psi-blast and try to find out the differences among the two software.

Paste your amino acid sequence into input box labeled "Search"; as database, please select ("Choose databases" option) "nr" (one of the database of proteins). The other parameters should not be changed.
Click on "BLAST" to send your request.
This creates a temporary wait page, alignment against large databases may takes a long time.
After waiting few minutes (as indicated) it should pop up a new window containing your results.
If the server has not already finished, a waiting message is automatically updated. To Note: the waiting time may vary depending on the server load.

A lot of information can be obtained by clicking on the alignment and looking at the various fields provided by the program. Spend a little bit of time to understand if and how is easy to find information on the sequence that you provided as input.

What is the most relevant difference between blastp e psi-blast?

Interpreting the blastp output

In the page header, at top right (link "Show conserved domains ") you can open a window highlighting the domains found by program for the query sequence. By clicking on the image you can get more information on sequences that present domains. The results page includes, five parts (top to bottom). A entirely equivalent organization is for psi-blast results.

The first part gives information about the program (in this case BLASTP), the "query" sequence (your sequence), including its length (amino acids), and the databases used. It will also present a link called "Taxonomy reports", from which you will obtain information, according to data present in the NCBI taxonomy database.
The second part consists of a picture that illustrates the results:
- the thick red line represents the "query" sequence ;
- the numbers refer to its length in amino acids;;
- each fine line below, in various colors, indicatee an alignment between your sequence and a sequence available in database;
- the color of these lines represents alignment quality, according to the colour scale placed on top (red-best, Black-worst);
- by clicking on the single lines you will redirect to the alignment (located in the fourth part of the page) between the "query" and a given sequence of the database.
The third part is the list of protein producing significant alignments with the sequence "query" and begins with the sentence: "Sequences producing significant alignments:". These sequences are ordered according to the E value (right column). Each pattern contains a link (colored and underlined), to the related Entrez record. On the right, for each given sequence links to Gene ("G") database are also present.
The fourth part contains alignments between the "query" sequence and sequences from the chosen database. This section begins with the sentence: "Alignments".

Subsequently the alignments with the sequence "query" are shown. For each alignment is indicated:
- "Score": is the score of the alignment;
- "Espect": is the E value of the alignment;
- "Identities": is the number of identical amino acids/the length of the alignment; the resulting percentage is indicated in parentheses. Identity refers on that part of the alignment. Two amino acids are considered similar.
- "Positives": is the number of identical amino acids plus the number of similar amino acids; in parentheses is indicated the resulting percentage on that length of alignment; two similar amino acids are defined by the matrix used for the alignment;;
- "Gaps": indicates the number of gap present in aligning/the length of the array in question; brackets show the resultant percentage of gap.
  Length alignment = identities + gaps + other mismatch in the alignment.
  Then you have the proper alignment between the sequence "query" and the database sequence, named "Sbjct"
  
  The numbers indicate the position of the amino acids within their sequences.
  If in a given position the amino acid in the sequence "query" and the matching database sequence are the same, the corresponding letter is repeated in the line between the two sequences.
  The simbol "+" indicates that, in a given position of the alignment, the amino acid in the sequence "query" and the corresponding in the database sequence are similar.
  When in the line between the two sequences there is no character, means that in that specific position of the alignment, the amino acids in the "query" and the corresponding database sequence are different, or that one of the two sequences has a gap
The fifth part contains statistical details of the search.

4. DIY: Do It Yourself

It's time to test your skills with an unknown protein!

Please repeat both the Fasta and Blast searches using the nucleotide sequence of your unkown protein (one for each student).

Please, open NCBI and select blast; Repeat the search using psi-blast and try to find out the differences among the two software.

Paste your sequence into input box labeled "Search"; as database, please select ("Choose databases" option) "nr" (one of the database of proteins). The other parameters should not be changed.
Click on "BLAST" to send your request.
This creates a temporary page, please note that align a sequence against a large databases may require long time.
After waiting few minutes (as indicated) it should pop up a new window containing your results.
If the server has not already finished, a waiting message is automatically updated. To Note: the waiting time may vary depending on the server load.

What protein your sequence correspont to?

What is the UniprotID of your protein?

What it its main function?

5. Bibliographic search (Optional)

Search of an article by keywords

Search for an article in the biomedical field can be done via the PubMed database NCBI. Let's try retrieving the scientific publication on the human genome sequencing:

Using Google search for pubmed or open PubMed from the useful web servers list given before
Try using the following keywords: human genome sequence nature (How many publications you have found? Have you found the right article?)

Now let's use the same keywords but doing more targeted research.

Click on "Advanced search" to use the advanced search features
Choose the correct field from the pull-down menu, then add your keywords using the "AND" operator (observe the differences with the previous search):
- Title human genome
- Title sequence
- journal nature

This time the list of retrieved items is shorter. The reference that we were looking for is "Nature. 2004 Oct 21; 431 (7011): 931-45"; Click on it to read the abstract. Selecting the image with the words "nature" you will be redirect to the Publisher's website where you can download the article in electronic format (PDF).

Research of scientific article among journals

Now, search the electronic version (PDF format, printable) of an article on the bacterium E. coli genome sequencing. The right reference is:

Welch et al. (2002), PNAS, 99 (26), 17020-4

Bibliographic references are generally organized (not necessarily in this order) by author or authors et al., year of publication, journal abbreviation (in this example: Proceedings of the National Academy of Sciences of the United States of America), often in italics. Optionally, volume number followed by the file number (in brackets, often bold).

Since we know the exact bibliographic reference of the article, we do not need to look for him in PubMed. Now, we have to check if the University has the subscription to download this article in electronic format.

Open This is the on line library of the University of Padua
Select "Books, Journals", then search by "Title"
Type proceedings academy and look for the PNAS journal entry (U.S.A.)

* FASTA format: is one of the several way we can use to represent biological sequence. It is simple and easy to read.

The first row is always preceded by '>' followed by several information, e.g sequence name, gene code, official identification number, database; the second line and over are for the sequence.
An example:

>envelope protein (TVFV2E)
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKLL
AAVEAQQQMLKLTIWGVK

2017-04-27 (4)