Summary of contributed data

Laboratories participating in the project submitted a total of 12 667 distinct protein accession numbers. This number includes 11 253 accession numbers from MS/ MS - both MALDI and LC-ESI, and an additional 1414 IDs from FT-ICR-MS. FT-ICR-MS identified 2230proteins, but 816 were also identified by the MS/MS technologies. In addition, participating laboratories contributed 653 identifications from MALDI-MS peptide mass fingerprints. These data were analyzed separately and will be reported elsewhere.

The majority of reported protein identifications from the MS/MS and FT-ICR-MS experiments (11 960 of 12 667 - 94%) were obtained by searching the tandem mass spectra against the IPI database. The remaining 6% were generated using either the Swiss-Prot or NCBInr databases (Tab. 2). Almost all of the submitted peptide sequence lists (12 388 of 12 667 - 98%) were matched in the standard database for the project, i.e., IPI version 2.21. The 2% of peptide sequence lists for which no exact match was found in this database most likely represent up to 5% mismatch between database entries, which is permitted when constructing the IPI database (see [9]). We believe that the submitting laboratory searched one of the source databases for IPI, rather than IPI itself, and matched the spectrum to a source entry which is included in IPI as a secondary rather than a master entry.

The 12 388 reported identifications with peptides matching the IPI 2.21 database correspond to 18 098 distinct peptide sequence lists. Searching these lists against IPI 2.21 results in 15 710 matching entries. For each of 12 303 of these lists (68%), exactly one of 6601 IPI entries was matched. These were reported with 7000 different protein accession numbers, including Swiss-Prot and NCBI identifiers. The 6% reduction from 7000 to 6601 distinct identifiers comes from converting Swiss-

Tab. 2 Usage ofthe search databases

Category Search database

IPI Swiss-Prot NCBInr All three

Submitted protein identifications

11 960

199

508

12 667

Submitted identifications with peptide sequence lists

11 741

196

451

12 388

found in IPI database

98%

98%

89%

98%

Entries in IPI database matching submitted peptide

15 463

488

552

15 710

sequence lists

Average number of IPI entries per submitted protein

1.3

2.5

1.2

1.3

identification

Tab. 3 Effectiveness of the integration process

Category

Number of IPI entries matching single-peptide sequence list

One More than one One or more

(distinct IDs) (indistinct IDs) (all IDs)

Submitted peptide sequence lists 12 303 5795 18 098

Submitted protein accession numbers 7000 5388 12 388

Matching entries in IPI database 6601 9668 15 710

Matching entries in IPI database after the 6601 3273 9506 integration

Reduction level of submitted accession numbers 6% 39% 23% to IPI entries

Prot and NCBI identifiers to IPI identifiers. As these identifications are already unique, the integration workflow did not additionally reduce these 6601 accession numbers.

In the remaining 5795 (32%) cases, each peptide sequence list matches more than one IPI protein sequence, resulting in an ambiguous identification or a cluster of equivalent hits (Tab. 3). In this group of ambiguous identifications, searches ofthe 5795 peptide sequence lists return 9668 distinct IPI protein accession numbers. The integration workflow reduces this group to a set of 3273 distinct proteins, which explain the presence of all reported peptides. In the next step, the 6601 accession numbers from the group of uniquely identified proteins are combined with the 3273 accession numbers from the group of ambiguous identifications. Of the resulting 9874identifications, 9506 represent distinct accession numbers.

Details of the integration process for the 5795 clusters of ambiguous hits are presented in Tab. 4. Scores (a-c) evaluate the level of confirmation of each protein identification by the number ofcompletely independent experiments.

Tab. 4 Number of clusters qualified on different levels of the integration

Integration level Numberofclusters

A Number of laboratories 1680 2288

B Number of experiments 419

C Number of reports 189

D Well-described EnsEMBL gene 2429 3507

E Any EnsEMBL gene reference 99

F No EnEMBL reference 286

G Poorly described protein 693

Total number of potentially ambiguous peptide sequence lists processed 5795

Tab. 5 Distribution of numbers of entries from the HUPO PPP and complete IPI databases in the integration categories

Integration Complete IPI HUPO PPP database category database

No. Fraction of No. of Fraction of all Fraction of of entries all entries proteins identifications IPI entries

All 56 530 9506 17%

In 2044 (35%) of the cases, the decision of protein selection was done on the basis of the score (a): selecting a protein detected by the largest number of laboratories. In 1680 (82%) ofthose cases it was a single protein, and no additional selection step was required. In the remaining 18% of the cases, selection by score (a) returned more than one protein. The tie was then broken using additional scoring categories (d-g). In 2966 (51%) of the cases, all proteins in the cluster were indistinguishable using scores (a-c) and the decisions were made exclusive using categories (d-g).

The categories (d-g) classify IPI database entries by the amount of detail in their description. It is then reasonable to compare such a classification of proteins in the project database with the same classification of proteins in the complete IPI database. Details of this comparison are given in Tab. 5. This shows that 41% of entries from the HUPO PPP database and 24% of the entries from the IPI database belong to the highest category (d) - the best-described proteins. The intermediate categories (e) and (f) include relatively few proteins while category (g) - the least described proteins - contains the majority of the entries, 49 and 63% for the HUPO PPP and IPI databases, respectively. For the HUPO PPP database, the ratio between the percentage of entries from categories (d) and (g) is 41/49% = 0.84. This ratio for the IPI database is 24/63% = 0.67. Thus, the laboratories were more likely to identify better-described proteins. This result can be interpreted as confirming the presence of proteins that were previously studied in detail, possibly because of their relative abundance or ease of identification. Alternatively, the integration workflow itself preferred the best-described proteins wherever possible, pushing the ratio toward category (d).

To further compare results from the HUPO PPP with all the proteins from IPI, we compared the distributions of peptide sequence length (number of amino acid residues per peptide) in both data sets (Fig. 2). The distribution of peptide length from the HUPO PPP database is noticeably shifted toward longer peptides - median equal to 12.9 residues - in comparison to the distribution of the lengths of tryptic peptides in IPI-median equal to 10.5 residues. We hypothesize that the under-representation of short peptides may be explained by the nature of the tandem mass spectrum search algorithms which require the spectra from short pep-tides to be of much better quality than spectra from longer peptides, to result in a significant match. Many laboratories did not report any peptides shorter than five residues. The fraction of nontryptic peptides in each peptide length bin is very small. These peptides were identified in a few nonenzyme-specific database searches and, as they passed quality control in the participating laboratories, they were included in our analysis. The origin of these peptides is not analyzed in this paper, but we speculate that they may be products of other endogenous proteases present in the tissue of origin or in human plasma [10].

Based on the nonuniform reporting of short peptides from participating laboratories, the limited spectral data available for short peptides, and the limited power for protein identification using a peptide present in multiple protein sequences, we decided to eliminate peptides shorter than six residues from further analysis. In doing so, we disregarded two protein identifications, each based on a single peptide of five amino acids. This reduces the number of accepted protein identifications from 9506 to 9504accession numbers.

Was this article helpful?

0 0

Post a comment