Background

Genome annotation is a process that involves procedures in both the wet and the dry lab. The wet lab performs experiments, identifies genes, polymorphisms, and other genomic elements, and studies genes' functions. The dry lab maps those identified elements onto genomic sequences. In this article, we restrict the term of ''annotation'' to dry lab sequence mapping. The terms annotation and mapping can be used alternatively here. To perform an annotation, two basic inputs are required: gDNA sequence data and interesting genomic elements (feature source data) such as genes and markers. Based on the ATGC sequence data, annotations of genes and their regulatory elements can reveal the true functional units of the genome. Annotations of other elements such as genetic markers, oligos, and primers are essential for genetic studies as they are the basis of many genetic experiments.

Ever since the sequence data were obtained from the genome sequencing project, scientists started to perform large-scale systematic genomewide annotation on those sequence data. Today, the human genome sequencing is just about finished. The human genomic sequences were assembled by National Center for Biotechnology Information (NCBI) and Celera. The sequence data have been annotated by several public or private institutes using their own tools and data sources. The data can be accessed at Web-based databases, including NCBI's Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi? taxid=9606), University of California Santa Cruz's (UCSC's) Genome Browser (http://genome.ucsc.edu/) and Sanger Center's Ensembl (http://www.ensembl.org/), and Celera (http://www.celera.com/). For the public annotation, both gDNA sequences and the genomic feature source data are available. For example, the human chromosomal sequences can be found at NCBI's human genome database; human gene data can be found at either Genbank or Unigene database; SNP data can be accessed at dbSNP.

In the course of many research projects, investigators need to perform regionwide local annotation. As lab technologies have improved, especially with introduction of high-throughput experimental methods, many investigators produce large amounts of experimental data and knowledge on genes, regulatory elements, and variants. For example, one ABI 3730 sequencing machine can resequence 1.8-Mb sequence in 1 day. Hundreds of SNPs could be discovered in 1 day. Besides the data generated in the investigators' own lab, many new data can be found in the latest journal publications. Often enough, these new data cannot be found at the time of publication in the existing public genomic annotation databases because of the significant time gap between the time of data deposit to public databases and the time of releasing an updated genome annotation using the corresponding data, assuming all the data are collected by the public databases. Normally, this kind of delay would be more than 3 months. Unfortunately, this assumption about public data collection is not always true either. SNP data are not required to be deposited into any centralized database when a paper is published. It is frequently observed that some SNPs studied or discovered are absent in dbSNP. But dbSNP is the only SNP database used in all public annotation. An effective research design requires efficient use of existing knowledge. Hence data produced in an individual investigator's lab and all other data outside of public annotation databases are as valuable as data from public annotation. They need to be integrated with data existing in the current version of public genome annotation. This basic requirement poses an important challenge to current bioinformatics technology because of the following major problems:

1. Heterogeneous annotation target sequence or platform. Because of the human genome sequence assembly, annotation is kept updated at a speed of several months per release, different annotation databases could use different versions of sequence assembly at certain time period. Investigators may want to use different gDNA sequences either for a small regional sequence or a whole chromosome. Sequence assembly is difficult for some genomic regions because of enriched repeat sequences or difficulty of cloning in the public sequencing project. An individual investigator might have better quality sequence assembly of one region than the public assembly. All these different flavors of gDNA sequences created needs of user's local annotation.

2. Enormous amount of data. As described above, an individual lab could generate or collect large amount of data for local annotation.

3. Heterogeneous data formats. Different data source could use very different data format. Preparing input data for annotation tools and formatting output data to meet requirement of data integration and further data mining are not trivial.

4. Annotation quality control. For any intent of feature mapping, the results can be either success or failure. The annotation quality is highly dependent on the sensitivity and specificity of the mapping methods and parameter settings. Any simple bug in the annotation algorithm could cause major defects of the annotation results. The observed problems in public annotation include missing of features which should be mapped, wrong mapping positions, and so on. Because the current public annotation does not provide much log information, it is difficult to detect errors and to find out the causes.

There are a number of tools supporting region-wide local annotation. They can be roughly put into three categories:

1. Use public source data (stored in databases) and some gene prediction algorithm to annotate user's gDNA sequence. Genotator,[1] NIX (Williams et al., http:// www.hgmp.mrc.ac.uk/Registered/Webapp/nix/), GeneMachine,[2] GAIA,[3] Alfresco,[4] GESTALT,[5]

RUMMAGE,[6] and Oak Ridge National Laboratory Genome Analysis Pipeline (http://compbio.ornl.gov/ tools/pipeline) provide integrated annotation, including mapping of known or predicted genes and/or regulatory elements by running multiple gene-prediction programs and searching against static public databases. But none of them incorporate methods for SNP mapping, which is essential for positional cloning projects for complex diseases. None of these systems takes source data supplied by the end-user, unless the user can modify the databases in the annotation system.

2. Annotate user's own source data to user's own gDNA sequence. Freeware, such as Artemis,[7] Sequin,[8] and some commercial software such as Vector NTI, provides a good interface to do manual annotation. They do not support batch annotation. Some other programs can be used to assist annotation. For example, BLAST[9] is good at homolog sequence searching. Sim4,[10] est_genome,[11] and Spidey[12] could be used to define the intron-exon structure of a gene. e-PCR[13] can be used to map STSs. However, most of these programs produce data in their own formats which cannot be directly converted into standard format annotation. Therefore, strictly speaking, they are not real annotation tools.

3. DNannotator can use user's own collection of source data (either from public places or generated in the labs) to annotate both user's own gDNA sequence and public chromosomal sequences. DNannotator complements the existing tools mentioned above. It is the first toolkit providing SNP mapping and the only one with the capability to migrate annotations from one sequence platform to another. DNannotator was first described in Nucleic Acids Research (2003).[14] It can be accessed at http://sky.bsd.uchicago.edu/ DNannotator.htm.

Design and Function of DNannotator

DNannotator takes annotation source data, such as SNPs, genes, primers, etc., prepared by the user, and/or a specified target of genomic DNA, and performs de novo annotation. DNannotator can also robustly migrate existing annotations in Genbank format from one sequence to another given that the new sequence covers the same genomic region. The annotation migration function is useful when we are dealing with different versions of sequence assembly or different scope of a region, e.g., one is small regional sequence and another is whole chromosome sequence. The major functions of DNanno-tator are illustrated in Fig. 1. The functions of DNanno-tator are divided into two groups: one for annotation over

Getting Started With Dumbbells

Getting Started With Dumbbells

The use of dumbbells gives you a much more comprehensive strengthening effect because the workout engages your stabilizer muscles, in addition to the muscle you may be pin-pointing. Without all of the belts and artificial stabilizers of a machine, you also engage your core muscles, which are your body's natural stabilizers.

Get My Free Ebook


Post a comment