Task Flow Chart


EST- Trace files

Translate Trace to Sequence

Sequence Masking

Quality check unmasked sequence unmasked sequence

BLAST Searching

Results Database Searching

BLAST Searching

Quality check no hits <« 50

Quality check uman EST/Blast

Results Database Searching contig with input

EST Clustering

Fig. 2 Flow chart of the ESTAnnotator: programs and rules used for high-throughput annotation of ESTs. (Modified from Ref. [4].)

(Modified from Ref. [4].)

nonredundant nucleic acid database (ftp://ftp.ncbi.nih. gov/blast/db) should reveal any similar nucleotide sequence deposited in the public databases. For the chromosomal location of the EST within the human genome, another BLAST search was performed against the NCBI human genomic sequence contig assembly database (ftp://ftp.ncbi.nih.gov/genomes/H_sapiens), which contains the nucleotide assemblies of the human chromosomes. For the identification of known, complete cDNA sequences matching our EST reads, additional searches were performed against the NCBI human mRNA database (ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/ RNA), which contains human ''model mRNAs'' con structed by prediction from genomic sequence data, and the RefSeq database (ftp://ftp.ncbi.nih.gov/refseq/, mRNA part), which contains curated mRNA data from human and other model organisms. If a highly similar mRNA corresponding to our EST read was found in any one of these mRNA databases, no clustering and further database searching were needed. Among the cartilage EST sequences 69.6% showed significant similarity to known genes/mRNAs in the human RefSeq collection, and another 4.8% were found homologous to human model RNAs. Approximately 23% of the cartilage EST sequences could not be identified as known transcripts, but showed significant similarity to genomic regions and/or other anonymous ESTs.[5] A subset of these potentially novel gene sequences is currently under detailed experimental scrutiny for expression in cartilage tissue using RT-PCR, Northern blotting, and mRNA in situ hybridization.

Using the NCBI human assembly database, a corresponding genomic location could be identified for more than 90% of the EST sequences. This information will be valuable in selecting possible candidate genes from regions of the human genome to which diseases related to malformations of the skeleton have been mapped genetically.

Clustering of Overlapping ESTs and BLAST Analysis at the Protein Level

Expressed sequence tag sequences which could not be reliably assigned to a known mRNA or gene were processed further to finally obtain an annotation. As each gene may have many alternative transcripts that contain exons in different combinations, it is not a trivial task to assign each EST to its progenitor gene. A BLAST search against other human EST sequences from dbEST was therefore used to find homologous, overlapping EST reads in order to extend the original EST sequence by clustering.

Fig. 3 Parts of the ESTAnnotator report: database search results list and graphical display of the EST contig assembly and protein BLAST hits.

Assembly of such EST clusters, each containing the input EST sequence, was performed by a contig assembly program (CAP),[7] and the resulting consensus sequence of each cluster was saved. The EST cluster consensus sequences (or the input EST sequence alone, in case no cluster was formed) were used for further similarity searching on the protein sequence level in order to detect even remote similarities to known proteins or other anonymous EST sequences in the databases. These searches were performed by BLASTX against the Swissprot protein database (ftp://ftp.ebi.ac.uk/pub/databases/swissprot/) to check for matching, already annotated proteins, and by TBLASTX against all ESTs of all organisms in dbEST to detect similarities to anonymous coding sequences of other organisms.

The ESTAnnotator Report

The final ESTAnnotator report (Fig. 3) is a web page that displays the database ID and the description line of the top three hits of the BLAST search results if their expectation value is below 0.01. Additionally, a link to the original BLAST output is provided. To illustrate the position of the BLAST hits and the clustered EST sequences, corresponding graphical outputs are displayed in the lower part. The alignment information can be accessed by clicking on the hits within the graphical output. By downloading the XML (Extensible Markup Language) report file from the server, the results from the database searches for each EST sequence can easily be parsed into a database file.

WWW Access by the Web Interface to HUSAR (W2H)

Using the W3H task system[8] allowed the immediate integration of ESTAnnotator into the W2H web interface.1-9-1 The ESTAnnotator is available at http://genius. embnet.dkfz-heidelberg.de/menu/biounit/open-husar/.

