Receipt of the data

The data documents were uploaded using a web-based submission site established at the University of Michigan. During submission each document received a unique ID number used subsequently by the document tracking and transforming mechanism. The XML documents submitted by email were processed separately. Data from the received documents were transferred to an intermediate database. The transfer was done automatically for each web-submitted document and separately for the emailed XML submissions. The data in the intermediate structure represent an exact copy of the data from the original documents, without any transformation or integration. The intermediate database allows checking the correctness of the structure of the submitted documents and makes the data available for the integration procedures. Verified data were then rewritten using a consistent format for protein accession numbers, database names, peptide sequences, peak lists, and experimental categories.

Depletion i

IdentificationSet

PK identificationSetld*

FK experimentld*

date

reference

person

email

description

PK,FK identificationSetld*

Search Software

SeparationProtein

PK separationProteinld*

manufacturer description

ReductionAlkylation

PK reductionAlkylationld*

description

SeparationPeptide

PK separationPeptideld*

manufacturer description

MassSpec

PK massSpecld*

manufacturer model description

Experiment

PK experimentld*

FK laboratoryld* FK specimenld* FK depletionld* FK separationProteinld* FK reductionAlkylationld1 FK separationPeptideld FK massSpecld*

depletionDescription description operationTime operationCost

Specimen

PK specimenld*

description

Protocol

PK protocolld*

title text

ExperimentProtocol

PK,FK protocolld* PK,FK experimentld*

MsRun

PK msRunld*

FK experimentld1 reference description i

IdentificationSet

PK identificationSetld*

FK experimentld* date reference person email description

Laboratory

PK laboratoryld"

code name address name_pi email_pi phone fax person email date

MsScan

PK msScanld*

FK msRuntld number parentlonMass parentlonCharge

DatabaseSearch

PK databaseSearchld*

FK laboratoryld* FK searchSoftwareld* FK sourceDatabaseld* modifFixedList modifVariableList peptideMassToierance ionToierance description i

Search Software

PK searchSoftwareld*

description

SourceDatabase

PK sourceDatabaseld*

description

Identification

PK,FK identificationSetld*

sourceDatabaseld* searchSoftwareld* proteinName accessionNumber peptideSequenceList confidence comments

MzPeak

PK,FK msScanld* PK mz*

intensity

ProteinByMsSearch

PK,FK databaseSearchld* PK,FK msScanld* PK accessionNumber* PK variant*

peptideSequenceList scorePeptide scoreProtein

ProteinByPeptides

PK,FK identificationSetld* PK,FK id*

PK,FK sourceDatabaseld* PK accessionNumber*

sequenceCoverage selected selectedScore score

IdentificationPeptide

PK,FK identificationSetld* PK,FK id* PK peptide* PK charge* PK mod* xCoor deltaCn rSp sp ionPerc score confidence rnxc nxc si manualy

Fig. 1 Entity-relationship diagram of the HUPO PPP data repository. Boxes symbolize entities or tables; connecting lines represent relations between the entities.

Inference from peptide level to protein level

In the pilot phase of the HUPO PPP, proteins were identified by MS experiments, followed by searches of protein databases to find peptide sequences matching observed spectra. Often, such a search returns a cluster of proteins, all of which contain the same set of matching peptides. Problems with ambiguity of protein identifications obtained from searches of tandem mass spectra and methods for managing them have been widely discussed, e.g., by Nesvizhskii et al. [7] and Sadygov et al. [8]. In these earlier works, protein identifications were inferred from lists of assigned peptides accompanied by probabilities that those assignments are correct. In the present report, however, we integrated lists of peptides obtained using several different search algorithms and different search databases, which frequently lacked identification probabilities. Although during the course of the project, participants were asked to additionally submit peptide and protein identification probabilities or scores, as well as peak lists and raw MS spectra, the main part of integrating the results was based solely on the sequences of the submitted peptides. The raw spectra and peak lists were subject to separate analysis and will be described elsewhere.

The integration workflow we describe here benefits from the collaborative character of the studies and is based on a heuristic approach that assumes that the proteins most likely to be truly present in the sample are those supported by the largest number of maximally independent experiments. The workflow additionally takes into account the "level of annotation" of the protein, thus preferentially selecting the proteins with the most extensive description available.

The workflow algorithm includes several consecutive steps:

(1) Assemble peptide sequence lists: Protein identifications submitted by the participating laboratories were accompanied by lists of sequences of matched peptides. All the lists were collected to form a set of distinct peptide sequence lists. Each list in that set preserves all references to its origin, e.g., if a particular list is reported from more than one experiment, it has more than one reference.

(2) Search the peptide lists: Each peptide sequence list obtained in the previous step was subsequently searched against the IPI version 2.21 (July 2003) database [9]. This was selected as the standard database of the project. Each match requires 100% identity between sequences and disregards flanking residues.

(3) Select one representative protein from each cluster of equivalent protein hits: Often, more than one entry in the reference protein database matches all of the components of a peptide sequence list. We call this set of matching entries a "cluster of equivalent protein hits" for that peptide sequence list. The clusters for different lists may overlap. When they do, we wish to choose one protein entry from the intersection of several clusters to represent all proteins in each of the overlapping clusters,that is, the proteins identified by each of the associated peptide sequence lists. The selection is done as follows.

Each protein entry in the reference database receives three integer scores:

(a) The number of different laboratories reporting a peptide sequence list whose cluster includes this protein.

(b) The number of distinct experiments (laboratories x specimens x protocols) reporting a peptide sequence list whose cluster includes this protein.

(c) The number of identifications (laboratories x specimens x protocols x clusters) for clusters including this protein. For each peptide sequence list, the cluster member with the largest value of score (a) is chosen as the representative protein entry. Scores (b) and (c), followed by criteria (d-g) listed below, are applied in succession to break numeric ties at higher levels.

(d) Well-described protein - product of a well-described gene. The EnsEMBL gene model was used for the annotation. The "well-described" proteins and genes are those with a nonempty description line, and without words like "fragment", "similar to", "hypothetical", "putative", etc. in their description.

(e) Well-described protein-product of any gene.

(f) Well-described protein not assigned to any gene.

(g) Protein not assigned to any gene and described as a fragment, by its similarity to another protein, or with no IPI description line at all. Any remaining ties are broken by selecting the protein having the lower IPI number.

As a result, one protein will generally be chosen as the representative entry from several overlapping clusters of equivalent protein identifications. This simplifies later comparisons between laboratories and experiments. This particular choice for a representative protein is motivated by the idea that the protein whose identification is supported by the largest number of independent experiments is the protein most likely to be actually present in the specimen. Score (a) counts each laboratory only once, no matter from how many specimens or with how many different pep-tide sequence lists the laboratory identified this protein. Next in importance, score (b) counts the number of independent experiments in which the protein was identified. Score (c) counts all reported peptide sequence lists, even if several results are from the same experiment. Criteria (d-g) indicate the level of annotation for each database entry. They facilitate selection of the best-described proteins.

Was this article helpful?

0 0

Post a comment