Design of the data repository

The project data repository was built with a Structured Query Language (SQL) relational database server. The data structure was divided into two main parts: (1) an intermediate structure presenting an exact copy of the data from documents submitted by the project participants, to make the data available for further pro cessing, and for checking correctness of the submitted documents; and (2) the main data structure designed to hold the integrated project data.

The structure can be divided into four main sections: (1) experiment description, (2) protein identifications made by data producers from peptide sequences, (3) MS/ MS peak lists, and (4) protein identifications from database searches made by groups other than the data producers.

In the database design (Fig. 1), experiments performed by the project participants are stored in the entity Experiment. This entity is referenced directly by the entity Laboratory and by a set of look-up entities: Specimen, Depletion, SeparationProtein, ReductionAlkylation, SeparationPeptide, and MassSpec. Experiment also has a many-to-many relationship with a free text protocol description (entities Protocol and ExperimentProtocol). At the experiment level the database structure branches into two sections. The first section started by the entity IdentificationSet stores protein identifications submitted by the participants. The second section started by the entity MsRun stores MS peak lists and the results from their analysis. The two-branched database structure reflects the changes in the project data collection model, from identification-oriented at the beginning to a more fine-grained description utilized later.

The database can capture three sets of protein identifiers from the same experiment. The first set stores protein identifications made by data producers in the entity Identification. The second set stores the results of peptide list searches done by the data integration center, in the entity ProteinByPeptides. This set captures peptide group information. The third set of identifiers (multiple subsets of these identifiers are possible) is derived from the same experimental results, but this time by an analytical group other than the data producer, through the MsRun branch of the database (entities MsRun, MzPeak, and ProteinByMsSearch).

The main project database does not store SELDI peak lists or MS/MS raw spectra. These data are available as downloadable files.

Was this article helpful?

0 0

Post a comment