PAVE OVERVIEW
| Contents |
|
Overview of the PAVE assembly | |
![]() |
References | |
|
Basic Searches | |
![]() |
BLAST Search | |
|
Advanced Searches | |
![]() |
Summary Tables | |
![]() |
Contig Views | |
![]() |
Graphics View | |
![]() |
Bases View | |
| Overview of the PAVE Assembly [ Top | Back ] | ||
| PAVE1 (Program for Assembling and Viewing ESTs) is a software package for assembling ESTs. It uses Megablast2 for
comparing ESTs and CCSs (contig censensus sequences), which are then filtered
on a set of consistency rules in order to only
assemble consistent ESTs. CAP33 is used to assemble sets of ESTs and
the results are only retained if they pass another set of consistency rules.
The consistency rules insure that mate-pairs are in the same contig. When a contig
has one or more mate-pairs but assembles into two contigs, the contigs are
joined into one with 50 n's between them.
Buried ESTs: In order to speed up processing, avoid running out of memory on large contigs, and increase the speed for displaying large contigs, PAVE has the concept of buried clones. When a large contig is displayed, only the non-buried clones will be shown. Select Show buried to view all the clones (high EST count contigs can be slow to display). If the contig description line says
CONTIG name_number - N ESTs (M buried, No Re-CAP)
this indicates that the buried clones are only aligned using the right
coordinate of their parent clone; since the alignments are not correct,
the ESTs will show in all red. If the word "No Re-CAP" is not present,
then the alignments are correct.
Annotation: Assemblies with multiple libraries usually annotate contigs with the R statistic
4. A contig may also have a "Note", which by default
identifies potential problem contigs (this may be changed for a given project).
The following annotation may also exist (i.e. two annotation pipelines are
available with PAVE, but if they have not been executed, then the corresponding
annotations will not exist): (1) GC content and longest ORFs, (2) Best UniProt
match 5, GO 6,
and GOSlim 7. Typically the Annotation is taxonomy-specific (Plants, Vertebrates, Fungi). A seperate system allows viewing of the contigs not annotated by the taxonomy-specific database, these contigs are annotated from the full UniProt database in order to identify possible contamination.
| ||
| Basic Searches [ Top | Back ] | ||
|
Search for UniProt ID: If the assembly is annotated
with UniProt, then this search will be available. Enter a valid UniProt ID, and
if that protein has aligned to any consensus sequence in the database, it will be
displayed. The example UniProt ID is a member of the most frequently occuring organism. The method of populating Best UniProt Match is covered above.
Search for UniProt Description: If the assembly is annotated
with UniProt, then this search will be available. Enter a word expected in the UniProt description - case insensitive. For all searches: if the system cannot find the string that you entered, it will search for the substring and list all results that contain the substring. If you wish to control where the wildcard goes, use '*' in the search string. Searches are case-insensitive. | ||
| Blast Search [ Top | Back ] | ||
| The BLAST Search button under Basic Searches takes you to a query system where you can blast a user provided sequence (nucleotide or protein) against the PAVE-assembled contigs or the original library ESTs. Various pages in the query system offer links to the blast search with the sequence pre-populated. WARNING: Blasting long nucleotide sequences against a large protein database can cause the browser to timeout before returning results. | ||
| |
Advanced Searches | |
| Contig Search [ Top | Back ] | ||
| The Contig Search button under Advanced Searches takes you to a query system where you can ask for all contigs meeting user-defined criteria or sort contigs on Test Statistic R by Stekel et al. The entrance page of the query system has Instructions that provide guidance on how to use this page. Which filters are offered depends on what annotation has been added. A search will result in a table. Click on a contig name in the results table to access the graphical view for that contig.
A good way to understand the queries is by example. Under Library Descriptions on the main page is
a table that will go to examples of these queries. To run your own query after viewing an example, click the 'Restore Defaults' button on the query system results page and the system will clear and display instructions. |
||
| Protein Search [ Top | Back ] | ||
| When the assembly is annotated, the Protein Search button is offered from the main page under Advanced Searches. This system gives the set of proteins most likely to be expressed by the libraries. To be included in the set, a UniProt ID must be the top match for a contig, the contig(s) must match the UniProt ID with a E-Value of 1e-40 or better and at least 60% of all contig ESTs must match the UniProt ID. Only the best-formed contigs are included in this system. |
||
| No Anno Search [ Top | Back ] | ||
| When assemblies are annotated with a taxonomy-specific UniProt set the No Anno Search is made available for the contigs which did not receive a topmatch (E-Value <= 1e-20) from the taxonomic UniProt blast. These unannotated contigs are blasted against the full UniProt database and the results are offered here. This query can be useful for identifying contamination contigs. |
||
| Annotation Search [ Top | Back ] | ||
| When the assembly is annotated, the Annotation Search button is offered from the main page under Advanced Searches as well as the multiple example tables on the main page found by linking from Additional Summaries/Example Queries on the main page. The UniProt Search button takes you to a query page where you can query the assembly based on UniProt filters (e.g. UniProt IDs unique to a particular library, UniProt IDs matching contigs with ESTs from all libraries, UniProt IDs with a particular GO/GOSlim type, Test Statistic R by Stekel et al.). Click the UniProt ID in the resulting table to show all the contigs in the assembly that match the UniProt ID with each e-value/bitscore. The method of populating Best UniProt Match is covered above.
The best way to understand the queries are by example. The main page contains examples (e.g. the Distinct/Unique UniProt queries from the Annotation by EST Library table accessed by Additional Summaries/Example Queries on the main page). To run your own query after viewing an example, click the 'Restore Defaults' button on the query system results page and the system will clear and display instructions.
Just because the system says an EST's contig doesn't match a UniProt ID doesn't mean the contig doesn't match, it can match very well. It only means that the contig had a BETTER match FOR THAT ORGANISM, thus the contig's ESTs were assigned to the better match via Match Contig ESTs. This is an important point to remember when using the queries as a screening tool. The Total Match ESTs ignore the best organism match, the Match Contig ESTs do not. The library counts displayed, and used for the Test Statistic R by Stekel et al., use the Match Contig ESTs accounting method. This ensures that UniProtIDs identified with high representation in a library are truly highly represented, however it is possible for another library to be falsely low due to poor quality sequence, mis-assembled contigs or mis-identification of the best organism match for a contig due to sequencing/assembly erros. Use this query system as a screening tool only. |
||
| GO Search [ Top | Back ] | ||
| When the assembly is annotated and GO databases were included during the annotation step, the GO Search button is offered from the main page as well as the Top 10 GO/SlimGO occurrences for each ontology from Additional Summaries/Example Queries. The GO Search button takes you to a query page where you can query the assembly based on GO (e.g. GOs unique to a particular library, GOs with particular wording, Test Statistic R by Stekel et al.) For an EST to be assigned to a GO it had to match a related UniProt ID with an E-Value of 1e-10 and its contig had to match the same UniProt ID with an E-Value of 1e-20. | ||
| EC Search [ Top | Back ] | ||
When the assembly is annotated and GO databases were included during the annotation step, the EC Search button may offered from the main page. EC numbers are assigned only to the UniProts appearing the the Protein Search (the best quality contigs). Since the Protein Search is made from only the top UniProt matches and these are often TrEMBL proteins with no GO or EC assignments, proteins with no EC assignments can be given EC numbers by blasting against a UniProt set with EC numbers and taking the first E-Value = zero match. Steps for EC assignment:
|
||
| Summary Tables/Example Query Tables [ Top | Back ] | ||
|
Depending on the extent of annotation offered, summary tables will appear on the page linked by Additional Summaries/Example Queries. These tables are meant to offer links to common queries as examples or starting points for your own queries. Once you've followed a link of a query similar to your desired query, either add/change the filters display settings for the example query or run your own query from scratch by clicking the 'Restore Defaults' button on the query system results page. The system will clear and display instructions. | ||
| Contig Views [ Top | Back ] | ||||||||||||
| WARNING: The graphical view can be slow if there are many ESTs in the contig. At the top of the graphical view you are presented with the following links and viewing options:
|
| |
Graphics View [ Top | Back ] |
This is a graphical view of how the non-buried ESTs are assembled within the contig if you wish to see all ESTs regardless of bury status, use the 'Show Buried' link (none offered if no buried ESTs). The alignment of each EST with the consensus is represented by a black arrow on its respective row. You will notice different colored symbols along the arrow. Blue rectangles represent low quality regions of the EST (phred quality value < 20). Red rectangles signify mismatches with the consensus sequence. A green rectangle means a gap was inserted into the EST, while a small green arrow above the line means there was a gap in the consensus but the EST actually had a base at that location. A legend is given at the bottom of the image. The header section of the image gives the PAVE Contig ID, the best UniProt match of the consensus sequence and the e-value of the blastx hit and a scale to indicate base position along the alignment. Clicking on the protein name sends you to its niceprot view at ExPASy. The consensus sequence is represented by the black arrow. The first column in the body of the image lists the EST names. Note that some names appear with a "5-3" suffix. If you have chosen to sort by "Left Position, group 5'/3' pairs", PAVE will display the 5' and 3' read from the same clone on the same line if they do not overlap within the contig. These reads will be connected by a dashed line and the clone name will then be appended with "5-3" extension if the ESTs are correctly oriented within the contig, and "3-5" extension otherwise (user's extensions are used, as defined in the LIB.cfg file for the library). Clicking on a particular name or row will bring you to an information page for that EST. From there you may blast the EST against other databases, libraries or assembly contigs. The background color of the rows also hold significance. A light grey background indicates a 3' EST aligned in the expected direction (reverse complemented), a light blue background means a 5' in the expected direction (not reverse complemented). Pink means the ESTs alignment was not in the expected direction and is therefore suspect. A white background is used for ESTs with an unknown direction (e.g. GenBank/454 data). |
|
| |
Bases View [ Top | Back ] |
  
This view shows every base in the contig unless the EST is buried (use the Show Buried link to display all ESTs.) The leftmost column lists the ESTs. These are links to their respective detailed information pages. The bases are ordered according to their position with respect to the consensus. The color coding is similar to that used in the graphics view, however additional font properties are used here. Lower case blue letters indicate low quality (phred quality values < 20). Non-low quality bases matching the consensus are in black. Red letters are mismatches with the consensus. Green represents locations where some ESTs had bases, and others did not. If these bases are not in the consensus, it will contain green arrows at these locations. If, at a certain location, at least two different bases appear more than once, each with an average quality value greater than 20, the bases at this location are shown in bold face as a possible single nucleotide polymorphism - the ones matching the consensus are black, the ones non-matching are in red. |