Falsepositive identifications

False-positive peptide identifications exist and are widely acknowledged to be a problem [7, 8, 11-15]. One arises whenever the top-scoring database match for a particular spectrum has a score which passes all reporting thresholds, yet the matched database sequence is not the same as that of the biological specimen in the instrument. This will occur for a variety of reasons. The spectrum may represent a mixture of different peptides with almost equal parent masses and elution times. The biological specimen may be a contaminant or an allelic variant not recorded in the database being searched. Even if the database contains the correct amino acid sequence, this sequence may fall outside the scope of the search, due to PTMs or requirements for proteolytic cleavage. In each of these cases, the top-scoring match

Fig. 4 Distribution of MS/MS and FT-ICR-MS protein identifications as a function of the number of peptides detected per protein.
Fig. 5 Proposed classification of the identification stringency levels; the number of protein identifications at each level is shown in parentheses.

within scope and within the database is returned by the search software. If its score passes reporting thresholds, the (mis)match will be accepted and reported as a peptide identification.

False-positive and true-positive peptide identifications show opposite behavior when we accumulate large numbers of peptide identifications, as in this project [7, 11]. One expects false-positive peptide identifications to accumulate roughly proportional to total peptide identifications. However, the chance that two or more false-positive peptide identifications coincide on the same database entry should be no better than random. On the contrary, a protein which is present at a detectable concentration in the specimen will produce many tryptic peptides in nearly stoichiometric quantities. Increased sampling should increase the number of distinct peptides, which are reported, and all of these will map to the same (correct) database entry. This means that, as we accumulate more and more peptide identifications, the class of protein identifications based on a single peptide reported project wide is simultaneously depleted of correct peptide identifications (as these are promoted to multiple-peptide protein identifications) and refilled with false-positive protein identifications. Below, we consider a range of values for the fraction of such false-positive identifications. One major participating HUPO laboratory, after manually reviewing several hundred oftheir protein identifications, concluded that a single peptide constituted sufficient evidence in perhaps 20% ofthe cases where only one peptide from a protein had been seen. The acceptance rate after manual review was much larger for proteins identified using two or three peptides, precisely because ofthe selection described above. Manual review of all the spectra was not feasible, and all of their identifications were submitted to the database.

To assess the confidence of protein identifications, we use a Poisson model for the distribution of false-positive peptide matches. Two parameters are needed to specify the model: the total number of database proteins and the number of peptide level matches that are incorrect.

The IPI version 2.21 database contains 56 530 sequences, with some redundancy and overlap between entries. To model the database integration procedure, the two largest tryptic peptides from each database entry were calculated, and all entries containing exact matches to these two peptides were collapsed into a sequence group. This process resulted in 49 924 sequence groups. This is used as the number of bins in the random model.

Lower and upper bounds for the number of false peptide level matches are estimated by assuming either that all of the lower confidence single-peptide identifications are erroneous or that all single-peptide identifications, regardless of confidence, are erroneous. Ofthe 6484 identifications based on a single peptide project wide, 1956 were assigned with high confidence by at least one participating laboratory and 4528 are lower confidence identifications. The Poisson distribution parameter l is chosen so that the random model predicts the assumed number of false single-peptide identifications. The range for l lies between 0.146 and 0.211. The estimate of 80% false-positive rate cited above gives l = 0.168, within this range. Values for l larger than 0.211 would predict more protein-level identifications due to false positives alone than the 9504 total identifications reported, and are inconsistent with the random model.

For each k = 0,1, 2, 3, ... the expected number of database entries (out of49 924) supported by exactly k false-positive peptide matches is calculated from a Poisson distribution. These are allocated in proportion among the reported protein identifications with s > k supporting peptides. Only the predictions for which s = k result in false-positive identifications at the protein level. The principle here is that a protein identification is considered correct if at least one of its supporting peptide identifications is correct. The allocation is illustrated in Tab. 6, and protein-level confidence is summarized in Fig. 6 and Tab. 7.

Fig. 6 At protein level, false-positive identifications are strongly concentrated among the protein identifications based on a single peptide project wide. This figure shows predicted error rates (1-confidence, vertical axis) from the Poisson model as a function of l (horizontal axis, expressed as the expected number of false-positive peptide reports per IPI database entry). Four curves represent the classes of protein identifications based on exactly one, exactly two, two or more, and three or more distinct peptides reported project wide.

Fig. 6 At protein level, false-positive identifications are strongly concentrated among the protein identifications based on a single peptide project wide. This figure shows predicted error rates (1-confidence, vertical axis) from the Poisson model as a function of l (horizontal axis, expressed as the expected number of false-positive peptide reports per IPI database entry). Four curves represent the classes of protein identifications based on exactly one, exactly two, two or more, and three or more distinct peptides reported project wide.

At the lower bound, the random model predicts 268 false-positive identifications at protein level among 1746 proteins with exactly two distinct peptides reported project wide, and 10 false positives among 1274 proteins with three or more distinct peptides project wide. The confidence within each class is the observed number of identifications minus predicted false positives, divided by the observed number of identifications. A lower bound on error becomes an upper bound on confidence. These upper bounds are a confidence of 85% for identifications based on exactly two peptides and 99% for those based on three or more peptides. Corresponding worst-case estimates are 70 and 97% for exactly two and for three or more peptides, respectively.

We acknowledge uncertainty in the exact value for l. However, qualitative interpretations of the data are not sensitive to l. For the quantity of data accumulated in this study, and throughout the range of choices for l, the confidence in protein identifications based on four or more peptides easily exceeds 0.99 and for identifications based on exactly three peptides project wide, it varies from 0.95 (l = 0.211) to 0.99 (l = 0.146). Both classes achieve the traditional 95% confidence threshold for accepting an assertion as true, regardless of l. The confidence for identifications based on exactly two peptides project wide varies from 0.7 (l = 0.211) to 0.85

Tab. 6 Allocating predicted false positives total is the number of observed identifications among observed identifications for 1 = 0.146. with exactly s supporting peptides, and each

Predicted total number of proteins with exactly row total is the number of identifications pre-

k false-positive supporting peptides (right-hand dicted to have exactly k false-positive support-

column) is allocated proportionally among the ing peptides. Only the cases where s = k (main observed identifications with s > k supporting diagonal, bold type) produce false-positive peptides (preceding columns). Each column identifications at the protein level

S 0 1 2 3 4 >5 Total number of pro teins with k false-positive peptides predicted from Poisson model

k

0

40 420

1 956

445.87

140.24

57.64

121.53

43 141.28

1

4 528

1 032.16

324.65

133.42

281.33

6 299.56

2

267.97

84.29

34.64

73.04

459.94

3

9.83

4.04

8.52

22.39

4

0.26

0.55

0.82

>5

0.02

0.02

Number of ob

40 420

6 484

1 746

559

230

485

49 924

served protein

identifications

s, number of distinct peptides project wide; k, number of distinct false-positive peptides.

Tab. 7 Confidence in protein identifications as predicted by the Poisson model

Number Reported Predicted false Confidence of pepti- identifi- positives

Number Reported Predicted false Confidence of pepti- identifi- positives

des s

cations

1 = 0.146

1 = 0.211

1 = 0.211

1 = 0.146

1

6484

4528

6484

0

0.302

2

1746

268

533

0.695

0.847

3

559

10

28

0.950

0.982

4

230

0.26

1.08

0.995

0.999

>2

3020

278

562

0.814

0.908

>3

1274

10

29

0.977

0.992

>4

715

0.27

1.12

0.9984

0.9996

>5

485

0.01

0.04

0.9999

0.9999

(1 = 0.146). Again, regardless of 1, these identifications would be described in lay language as "probably correct, but by no means sure". The majority of single-pep-tide identifications are false under any reasonable values for 1.

We have chosen to concentrate further analysis on the 3020 identifications made with two or more peptides project wide for two reasons. Excluding identifications based on exactly two peptides would exclude a large number of identifications that we believe are probably correct. Second, it would introduce a strong bias toward highly abundant proteins. Since the goal of the PPP is to identify a representative set of blood proteins, we chose to base subsequent analyses on the 3020 core data set, realizing that we are including a number of false-positives, but yielding a more representative view of the human plasma proteome.

The wide range of concentrations for proteins in blood plasma and serum presents an additional complication. Clinical ELISA assays, where available, report a measurable concentration for many proteins that were never reported by MS. Almost every protein in the body is potentially present at some concentration in blood plasma or serum, whether as an intact protein or as degradation products. There is no set of proteins we can exclude as known negatives; a large number of potential positives are present at unknown but low concentrations. A similar situation is found in Saccharomyces cerevisiae. A recent tagging experiment [16] measured protein concentrations spanning four orders of magnitude for 4251 proteins, roughly 80% of all proteins expressed in log-phase yeast. Two separate MS/MS surveys conducted earlier [11, 17] show low concordance in protein identifications. They reported roughly 1500 proteins each, with 57% of proteins in common and 43 or 41% reported in one survey but not in the other. In yeast, as well as in this project, the reporting of low-abundance proteins is highly variable.

Was this article helpful?

0 0

Post a comment