Introduction

The protein-coding genetic information of higher eukary-otic genomes represents only a minor percentage of its total DNA, e.g., 1.5% in the human genome, meaning that genes are usually separated by large stretches of non-coding intergenic DNA. Additionally, eukaryotic genes are organized mosaic-style: rather short exons (with an average size of 150 bp in humans), which form the mature mRNA and contain the essential coding information, are separated by introns, which are spliced out from the primary RNA transcripts and do not contribute to protein-coding. Because of this complex architecture of higher organism genomes, it is notoriously difficult to recognize exons in large stretches of genomic DNA sequences, e.g., obtained in the framework of genome projects, and to identify those exons correctly which altogether make up a complete gene. As a ''shortcut'' alternative to the sequencing and identification of the protein-coding gene repertoire of an organism, Adams et al. introduced a strategy called ''EST sequencing'' (EST = expressed sequence tag). This approach bypasses all complexities of genome structure by focusing only on the transcribed portions of a genome: the procedure starts with the isolation of an mRNA population from a certain tissue. After converting the mRNA molecules into their complementary DNA (cDNA), all resulting cDNA molecules are cloned into suitable vector/host systems. Then, usually thousands of clones are being chosen at random for a DNA sequencing of their cDNA integrates, yielding a catalogue of EST sequences which essentially represents a collection of the transcribed portion of the genome (i.e., the genes). Gathering of sequence information from the protein-coding parts of the cDNA can be optimized by producing 5'EST reads instead of 3'EST reads, which usually cover the 3'untranslated gene regions.

The EST approach is extremely cost-effective and fast, and it gives a good overview of those genes that are active in the tissue used as source for the initial RNA preparation. A dedicated freely searchable database (dbEST), a division of GenBank (http://www.ncbi.nlm.

nih.gov/Genbank/GenbankOverview.html), has been set up to collect EST data from a huge number of diverse organisms (see http://www.ncbi.nlm.nih.gov/dbEST/ dbEST_summary.html), and this database currently contains more than 5.6 million human EST reads alone. Despite this wealth of data, EST sequencing is still an essential tool for the discovery of novel genes that have not been identified by genomic sequencing and gene prediction alone.

The production of a gene catalogue from EST data requires several steps of bioinformatical sequence analysis: 1) Because of the single-pass, ''quick but dirty'' sequencing strategy, bad sequence data and EST sequences contaminated with vector or repetitive noncoding DNA have to be removed. 2) Overlapping EST reads have to be clustered to obtain a contig sequence of their underlying cDNA. 3) Both EST singletons and EST clusters have to be annotated by searching for similarity to known genes or proteins already existing in nucleotide and protein sequence databases. Here we describe an EST annotation tool that automates these steps and which was successfully used in categorizing 5000 EST sequence reads during a search for genes involved in differentiation and disease processes in human fetal cartilage tissue.

Getting Started With Dumbbells

Getting Started With Dumbbells

The use of dumbbells gives you a much more comprehensive strengthening effect because the workout engages your stabilizer muscles, in addition to the muscle you may be pin-pointing. Without all of the belts and artificial stabilizers of a machine, you also engage your core muscles, which are your body's natural stabilizers.

Get My Free Ebook


Post a comment