PAVE OVERVIEW

PAVE OVERVIEW

Contents
Overview of the PAVE assembly
  References
Basic Searches
BLAST Search
Advanced Searches
Summary Tables
Contig Views
  Graphics View
  Bases View



Overview of the PAVE Assembly [ Top | Back ]
  PAVE1 (Program for Assembling and Viewing ESTs) is a software package for assembling ESTs. It uses Megablast2 for comparing ESTs and CCSs (contig censensus sequences), which are then filtered on a set of consistency rules in order to only assemble consistent ESTs. CAP33 is used to assemble sets of ESTs and the results are only retained if they pass another set of consistency rules. The consistency rules insure that mate-pairs are in the same contig. When a contig has one or more mate-pairs but assembles into two contigs, the contigs are joined into one with 50 n's between them.

Buried ESTs: In order to speed up processing, avoid running out of memory on large contigs, and increase the speed for displaying large contigs, PAVE has the concept of buried clones. When a large contig is displayed, only the non-buried clones will be shown. Select Show buried to view all the clones (high EST count contigs can be slow to display). If the contig description line says

    			CONTIG name_number - N ESTs (M buried, No Re-CAP)
    
this indicates that the buried clones are only aligned using the right coordinate of their parent clone; since the alignments are not correct, the ESTs will show in all red. If the word "No Re-CAP" is not present, then the alignments are correct.

Annotation: Assemblies with multiple libraries usually annotate contigs with the R statistic 4. A contig may also have a "Note", which by default identifies potential problem contigs (this may be changed for a given project). The following annotation may also exist (i.e. two annotation pipelines are available with PAVE, but if they have not been executed, then the corresponding annotations will not exist): (1) GC content and longest ORFs, (2) Best UniProt match 5, GO 6, and GOSlim 7. Typically the Annotation is taxonomy-specific (Plants, Vertebrates, Fungi). A seperate system allows viewing of the contigs not annotated by the taxonomy-specific database, these contigs are annotated from the full UniProt database in order to identify possible contamination.

UniProt match determinations: If the annotation pipeline has been run, each contig is given the top three Swiss-Prot and top three TrEMBL matches (limited to unique organisms). The minimum E-Value for a match to be included is 1e-20.

References

  1. Soderlund, C., E.Johnson, M. Bomhoff, and A.Descour. PAVE: Assembling and Viewing ESTs. In preparation.
  2. Huang, X., and A. Madan (1999). CAP3: A DNA sequence assembly program. Genome Res 9:868-87.
  3. Zhang, Z., S. Schwartz, L. Wagner, and W. Miller (2000). A greedy algorithm for aligning DNA sequences. J Comput Biol 7:203-214.
  4. Stekel, D.J., Git, Y., & Falciani, F. The Comparison of Gene Expression from Multiple cDNA Libraries. Genome Research 10, 2055-2061 (2000).
  5. Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res, 33, D154-159.
  6. Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R. and Apweiler, R. (2004) The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res, 32 Database issue, D262-266.
  7. Biswas, M., O'Rourke, J.F., Camon, E., Fraser, G., Kanapin, A., Karavidopoulou, Y., Kersey, P., Kriventseva, E., Mittard, V., Mulder, N. et al. (2002) Applications of InterPro in protein annotation and genome analysis. Brief Bioinform, 3, 285-295.


Basic Searches [ Top | Back ]
  Search for UniProt ID: If the assembly is annotated with UniProt, then this search will be available. Enter a valid UniProt ID, and if that protein has aligned to any consensus sequence in the database, it will be displayed. The example UniProt ID is a member of the most frequently occuring organism. The method of populating Best UniProt Match is covered above.

Search for UniProt Description: If the assembly is annotated with UniProt, then this search will be available. Enter a word expected in the UniProt description - case insensitive.

Search for EST: ESTs have the name that is shown on the contig displays which may have a substituted extension (.5 for 5' and .3 for 3') if there were non-uniform extensions across the libraries.

Search for Genbank gi: A GenBank gi search is offered if the assembly is annotated with this information. Enter the gi number of the EST of interest.

Search for GO: A GO search is offered if the assembly is annotated with this information. Enter the GO number of interest.

Search for Contig: Enter a contig name.

For all searches: if the system cannot find the string that you entered, it will search for the substring and list all results that contain the substring. If you wish to control where the wildcard goes, use '*' in the search string. Searches are case-insensitive.




Blast Search [ Top | Back ]
  The BLAST Search button under Basic Searches takes you to a query system where you can blast a user provided sequence (nucleotide or protein) against the PAVE-assembled contigs or the original library ESTs. Various pages in the query system offer links to the blast search with the sequence pre-populated. WARNING: Blasting long nucleotide sequences against a large protein database can cause the browser to timeout before returning results.


Advanced Searches
  Contig Search [ Top | Back ]
  The Contig Search button under Advanced Searches takes you to a query system where you can ask for all contigs meeting user-defined criteria or sort contigs on Test Statistic R by Stekel et al. The entrance page of the query system has Instructions that provide guidance on how to use this page. Which filters are offered depends on what annotation has been added. A search will result in a table. Click on a contig name in the results table to access the graphical view for that contig.

A good way to understand the queries is by example. Under Library Descriptions on the main page is a table that will go to examples of these queries. To run your own query after viewing an example, click the 'Restore Defaults' button on the query system results page and the system will clear and display instructions.

  Protein Search [ Top | Back ]
  When the assembly is annotated, the Protein Search button is offered from the main page under Advanced Searches. This system gives the set of proteins most likely to be expressed by the libraries. To be included in the set, a UniProt ID must be the top match for a contig, the contig(s) must match the UniProt ID with a E-Value of 1e-40 or better and at least 60% of all contig ESTs must match the UniProt ID. Only the best-formed contigs are included in this system.

  No Anno Search [ Top | Back ]
  When assemblies are annotated with a taxonomy-specific UniProt set the No Anno Search is made available for the contigs which did not receive a topmatch (E-Value <= 1e-20) from the taxonomic UniProt blast. These unannotated contigs are blasted against the full UniProt database and the results are offered here. This query can be useful for identifying contamination contigs.

  Annotation Search [ Top | Back ]
  When the assembly is annotated, the Annotation Search button is offered from the main page under Advanced Searches as well as the multiple example tables on the main page found by linking from Additional Summaries/Example Queries on the main page. The UniProt Search button takes you to a query page where you can query the assembly based on UniProt filters (e.g. UniProt IDs unique to a particular library, UniProt IDs matching contigs with ESTs from all libraries, UniProt IDs with a particular GO/GOSlim type, Test Statistic R by Stekel et al.). Click the UniProt ID in the resulting table to show all the contigs in the assembly that match the UniProt ID with each e-value/bitscore. The method of populating Best UniProt Match is covered above.

The best way to understand the queries are by example. The main page contains examples (e.g. the Distinct/Unique UniProt queries from the Annotation by EST Library table accessed by Additional Summaries/Example Queries on the main page). To run your own query after viewing an example, click the 'Restore Defaults' button on the query system results page and the system will clear and display instructions.

The UniProt Query system offers several variants on EST count: Total Match ESTs, Sum of Contig ESTs and Match Contig ESTs. These are defined as:

  • Match Contig ESTs: ESTs in this count must match the UniProt ID with a E-Value <= 1e-10 and its contig must match with 1e-20.
  • Total Match ESTs: all ESTs that matched the UniProt ID with an E-Value <= 1e-10 regardless of whether contig has a listed match to this particular UniProt ID (see warning below)
  • Sum of Contig ESTs: sum of EST counts for matching contigs regardless of whether the EST matches the UniProt ID. Contig match is only limited by default BLAST parameters
WARNING:
Just because the system says an EST's contig doesn't match a UniProt ID doesn't mean the contig doesn't match, it can match very well. It only means that the contig had a BETTER match FOR THAT ORGANISM, thus the contig's ESTs were assigned to the better match via Match Contig ESTs. This is an important point to remember when using the queries as a screening tool. The Total Match ESTs ignore the best organism match, the Match Contig ESTs do not.

The library counts displayed, and used for the Test Statistic R by Stekel et al., use the Match Contig ESTs accounting method. This ensures that UniProtIDs identified with high representation in a library are truly highly represented, however it is possible for another library to be falsely low due to poor quality sequence, mis-assembled contigs or mis-identification of the best organism match for a contig due to sequencing/assembly erros. Use this query system as a screening tool only.

  GO Search [ Top | Back ]
  When the assembly is annotated and GO databases were included during the annotation step, the GO Search button is offered from the main page as well as the Top 10 GO/SlimGO occurrences for each ontology from Additional Summaries/Example Queries. The GO Search button takes you to a query page where you can query the assembly based on GO (e.g. GOs unique to a particular library, GOs with particular wording, Test Statistic R by Stekel et al.) For an EST to be assigned to a GO it had to match a related UniProt ID with an E-Value of 1e-10 and its contig had to match the same UniProt ID with an E-Value of 1e-20.
  EC Search [ Top | Back ]
  When the assembly is annotated and GO databases were included during the annotation step, the EC Search button may offered from the main page. EC numbers are assigned only to the UniProts appearing the the Protein Search (the best quality contigs). Since the Protein Search is made from only the top UniProt matches and these are often TrEMBL proteins with no GO or EC assignments, proteins with no EC assignments can be given EC numbers by blasting against a UniProt set with EC numbers and taking the first E-Value = zero match. Steps for EC assignment:
  1. First UniProts are assigned EC numbers in their uniprot.dat listing
  2. Next UniProts without ECs are given ECs related to their GO annotation
  3. Finally, UniProts still without ECs are blasted against those with ECs in the uniprot.dat file. The first blast match with E-Value = 0 gives its EC number to the UniProt without the EC annotation.



Summary Tables/Example Query Tables [ Top | Back ]
  Depending on the extent of annotation offered, summary tables will appear on the page linked by Additional Summaries/Example Queries. These tables are meant to offer links to common queries as examples or starting points for your own queries. Once you've followed a link of a query similar to your desired query, either add/change the filters display settings for the example query or run your own query from scratch by clicking the 'Restore Defaults' button on the query system results page. The system will clear and display instructions.



Contig Views [ Top | Back ]
  WARNING: The graphical view can be slow if there are many ESTs in the contig.

At the top of the graphical view you are presented with the following links and viewing options:
Contig Details   Tabular display of information about the selected contig (e.g. EST count by library)
Download Download the consensus sequence for the current contig
Run Blast Links to a page that will blast the current contig against an assortment of libraries/contig sets or databases
Sort ESTs by Change the sort order of the ESTs
Change View Switch between the Graphics and Bases views and/or change the horizontal scale factor in the graphics view. By default, only the ESTs from your selected libraries will appear in the contig views. If there are others, you will see them as thin lines at the bottom of the image in the graphics view. Clicking on View ESTs from All Libraries will display all ESTs in detail.

Graphics View [ Top | Back ]

 


This is a graphical view of how the non-buried ESTs are assembled within the contig if you wish to see all ESTs regardless of bury status, use the 'Show Buried' link (none offered if no buried ESTs). The alignment of each EST with the consensus is represented by a black arrow on its respective row. You will notice different colored symbols along the arrow. Blue rectangles represent low quality regions of the EST (phred quality value < 20). Red rectangles signify mismatches with the consensus sequence. A green rectangle means a gap was inserted into the EST, while a small green arrow above the line means there was a gap in the consensus but the EST actually had a base at that location. A legend is given at the bottom of the image.

The header section of the image gives the PAVE Contig ID, the best UniProt match of the consensus sequence and the e-value of the blastx hit and a scale to indicate base position along the alignment. Clicking on the protein name sends you to its niceprot view at ExPASy. The consensus sequence is represented by the black arrow.

The first column in the body of the image lists the EST names. Note that some names appear with a "5-3" suffix. If you have chosen to sort by "Left Position, group 5'/3' pairs", PAVE will display the 5' and 3' read from the same clone on the same line if they do not overlap within the contig. These reads will be connected by a dashed line and the clone name will then be appended with "5-3" extension if the ESTs are correctly oriented within the contig, and "3-5" extension otherwise (user's extensions are used, as defined in the LIB.cfg file for the library). Clicking on a particular name or row will bring you to an information page for that EST. From there you may blast the EST against other databases, libraries or assembly contigs.

The background color of the rows also hold significance. A light grey background indicates a 3' EST aligned in the expected direction (reverse complemented), a light blue background means a 5' in the expected direction (not reverse complemented). Pink means the ESTs alignment was not in the expected direction and is therefore suspect. A white background is used for ESTs with an unknown direction (e.g. GenBank/454 data).
 

Bases View [ Top | Back ]

 
   

This view shows every base in the contig unless the EST is buried (use the Show Buried link to display all ESTs.) The leftmost column lists the ESTs. These are links to their respective detailed information pages. The bases are ordered according to their position with respect to the consensus. The color coding is similar to that used in the graphics view, however additional font properties are used here. Lower case blue letters indicate low quality (phred quality values < 20). Non-low quality bases matching the consensus are in black. Red letters are mismatches with the consensus. Green represents locations where some ESTs had bases, and others did not. If these bases are not in the consensus, it will contain green arrows at these locations. If, at a certain location, at least two different bases appear more than once, each with an average quality value greater than 20, the bases at this location are shown in bold face as a possible single nucleotide polymorphism - the ones matching the consensus are black, the ones non-matching are in red.