Databases and Analysis Programs

A good deal of the early work in bioinformatics focused on processing and analyzing gene and protein sequences catalogued in databases such as GenBank, EMBL, and SWISS-PROT. Such databases were developed in academia or by government-sponsored groups and served as repositories where scientists could store and share their sequence data with other researchers. With the start of the Human Genome Project in 1990, efforts in bioinfor-matics intensified, rising to the challenge of handling the large amounts of DNA sequence data being generated at an unprecedented rate. By the mid-to late-1990s, much of the efforts in bioinformatics centered around genomic data, generated by the Human Genome Project and by private companies, and around proteomic data.

Early analysis of sequence information focused on looking for similarities between genes and between proteins. Algorithms were developed to help researchers rapidly identify similar gene or protein sequences. Such tools were extremely useful for determining whether a newly sequenced piece of DNA was at all similar to sequences already entered in a database. To determine how multiple sequences align and to view their similarities, multiple-alignment programs were developed. Such programs helped scientists compare the sequences of closely related genes or compare the sequence of a particular gene or protein as it appears in several species.

To better understand the functional roles of new nucleotide and amino acid sequences, researchers developed algorithms to look for particular sequence "domains." Domains are regions where a particular sequence of

