Analysis of confidence of protein identifications

High false-positive rates are acknowledged to be a major problem in protein identification. Estimates can be generated, at least in relatively homogeneous datasets, by probabilistic methods using PeptideProphet and ProteinProphet, by matching to reversed-sequence databases [15-20]. The alternative of careful manual inspection of the spectra becomes a huge task and is subjective. The spectrum may represent a mixture of different peptides with almost equal parent masses and elution times. The biological specimen may have allelic variants or a contaminant not recorded in the database. Even if the sequence is correct, PTMs may take the sequence outside the scope of the match. However, true positives may be a problem, too, especially when the database sequence is simply not the same as that of the biological specimen analyzed.

To estimate the confidence of protein identifications across our heterogeneous database, we compared the observed data on number of peptide matches per identification to a model in which identifications are randomly distributed. False-positive and true positive peptide identifications should show opposite behavior when numbers of identifications become large. We expect false-positive IDs to accumulate roughly proportional to the total, so that the chance of two or more false-positive identifications coinciding on the same database entry should be the product of their random probabilities. In contrast, a protein which is present in detectable concentration will produce many tryptic peptides in nearly stoichiometric quantities. Increased sampling, therefore, should increase the number of distinct pep-tides mapping to the same (correct) database entry. This model results in a Poisson distribution of number of peptides matched per sequence. Two parameters are needed to specify the model, the total number of proteins (Ndb) and the expected proportion offalse peptide matches per database entry (lambda, ranging in this case from 0.211 to 0.146). The IPI 2.21 database contains 49 924 sequences after adjustment for redundancy. The upper bound for lambda corresponds to the assumption that every identified protein has at least one false-positive matching peptide; this bound eliminates all single-peptide hits. The lower bound accepts as

Fig. 4 Plot of estimated error rate for subsets of PPP proteins based on one, two, or three or more peptides, Poisson model.

correct all 1956 protein identifications based on a high confidence single peptide report, but treats all the 4528 lower confidence single peptide identifications as false. Throughout this range of values of lambda, proteins with four or more supporting peptides are predicted to be correct with better than 0.99 confidence; with exactly three peptides, 0.95-0.98; and with exactly two peptides 0.70 to 0.85 (Fig. 4). We based our annotations on the 3020 identifications made with two or more peptides project-wide to avoid a bias toward highly abundant proteins, if we had limited annotation to proteins based on three or more peptides. Furthermore, a substantial majority of protein IDs based on exactly two peptides is probably correct. Independent conclusions from manual review of a large number of spectra led one of our investigators to estimate at least 20% of one-peptide hits appear to be true positives. In addition, MacCoss et al. [21] concluded that the chance that multi-peptide proteins are false-positives declines exponentially with the number of peptides identified.

Quantitation of protein concentrations

A critical parameter for detection and identification of proteins is the abundance or concentration of the protein and its isoforms. We generated a calibration curve for a set of sentinel proteins for which quantitative immunoassays were available. Four different immunoassay and antibody microarray methods were performed by four independent laboratories (DadeBehring, Genomics Institute of Novartis Foundation, Molecular Staging, and Van Andel Research Institute). A total of 323 assays measured 237 unique analytes (Haab et al. [22]). In the cases of multiple assays, we cannot be certain that the same epitopes were targeted. This approach permits assessment of systematic variation in concentration of proteins associated with blood preparation methods (serum and the three anticoagulation methods for plasma in each specimen set) and, after matching to IPI identifiers, facilitates an analysis of dependence on concentration for MS-based protein identifications using the HUPO PPP specimens. Some proteins were at such low concentrations that they were even undetectable with immunoassay or microarray methods. After extensive curation, we matched 76 IPI proteins among the 9504 dataset (based on one or more peptides) and 49 proteins among the 3020 protein dataset (based on two or more peptides) to quantitative analytes. Fig. 1 in Haab et al. [22] shows four parameters used to determine the sensitivity of detection of these proteins as a function of immunoreactive concentration: number of labs reporting that protein, number of peptides on which protein IDs were based, percent coverage of the protein sequence, and score. The correlation coefficient for the total number of pep-tides matching that protein is r = 0.86 for the 3020 dataset and r = 0.90 for the 9504protein dataset:

As expected, the most abundant proteins are the most readily detected, with essentially 100% agreement; with much less abundant proteins, only the laboratories with protocols and instruments capable of much more sensitive detection identified these proteins. Among the 49 proteins matched to the 3020 protein dataset, 12 are biologically interesting proteins identified with measured concentrations from 200 pg/mLto 20 ng/mL (Tab. 3).

Tab. 3 Least abundant proteins identified with two or more peptides (included in core dataset) with measured concentrations in the range of 200 to 20000 pg/mL serum or plasma


Concentraion (Pg/mL)

Alpha fetoprotein






PDGF-R alpha


Leukemia inhibitory factor receptor










Activated leukocyte adhesion mol


Selectin L


0 0

Post a comment