Crosslaboratory comparison confidence of the identifications

The distribution of the number of protein identifications among participating laboratories is shown in Fig. 3. Individual laboratories are encoded using their numeric identifiers. The 18laboratories identified a total of 9504 distinct IPI proteins. The number identified by individual laboratories varied from 52 to 4569. The laboratories were asked to mark as "high confidence" those identifications that passed more stringent criteria, chosen by each laboratory individually, although the PPP did issue guidance after the June 2004 Jamboree Workshop for SEQUEST searches to use Xcorr > 1.9, 2.2, 3.75 for 11, 21, and 3+ ions, respectively, plus DCn > 0.1 and RSp < 4 for tryptic peptides. The number of these lab-reported high-confidence identifications ranged from 21 to 789. To further assess the confidence of protein identifications from individual laboratories, we counted the number of proteins, which were also reported by a second laboratory. We considered such

Fig. 2 Comparison of distributions of length of tryptic peptides (dark gray bars), tryptic peptides with missed cleavages allowed (light gray bars), and all peptides, including nontryptic peptides (white bars) detected in the course ofthe project using MS/MS (both MALDI and LC) and FT-ICR-MS methods, to the distribution of the length of tryptic peptides from the complete IPI database (gray line).

Fig. 2 Comparison of distributions of length of tryptic peptides (dark gray bars), tryptic peptides with missed cleavages allowed (light gray bars), and all peptides, including nontryptic peptides (white bars) detected in the course ofthe project using MS/MS (both MALDI and LC) and FT-ICR-MS methods, to the distribution of the length of tryptic peptides from the complete IPI database (gray line).

Fig. 3 Distribution of MS/MS and FT-ICR-MS protein identifications among 18 participating laboratories, encoded using their numeric identifiers.

identifications to be confirmed. The fraction of confirmed identifications is higher for laboratories, which submitted lower numbers of proteins. This may be caused by several factors including the followings. (1) Different stringencies for acceptance ofthe identifications - smaller sets may mean that more stringent criteria have been used and the resulting proteins are more likely to be true identifications. (2) Differences in experimental techniques - smaller sets of proteins may be obtained by shallower sampling, picking up only the more abundant, i.e., more frequently identified proteins. (3) The intrinsic nature of the confirmation process - the more sensitive the procedures used by a particular laboratory are, the more likely it is that it will be the only laboratory reporting a particular identification. Thus, the requirement for confirmation penalizes the laboratories that submitted the largest data sets.

The level of cross-laboratory confirmation of the identifications, as a function of the number of peptides detected across experiments and laboratories, is shown in Fig. 4. The first category - all identifications - has a confirmation level equal to 25%. The second category, resulting from elimination of single-peptide identifications, dramatically reduces the number of proteins from the original 9504 to 3020, and at the same time raises the confirmation level to 75%. The absolute number of confirmed identifications in these two categories is virtually the same, meaning that of 6484 single-peptide protein identifications almost none was confirmed. Limiting the identifications to those which are supported by an even larger number of peptides causes a further reduction in the number of proteins and a rise in the confirmation level.

The analysis described above led us to categorize protein identifications into four classes, based on the level of the identification confidence. The four categories are organized in a diamond-shaped parallelogram (Fig. 5). Identifications from the least stringent category - "all identifications" (9504 proteins) - are divided into two more stringent, parallel categories: "high-confidence identifications" (2857 proteins), including proteins reported at least once as high-confidence, and "multi-peptide identifications" (3020 proteins), including proteins for which two or more distinct peptides were reported project wide, following data integration. The most stringent category "high-confidence multipeptide identifications" (1555 proteins) includes proteins from the intersection of the preceding categories. Proteins in this category are identified with two or more distinct peptides, requiring at least one to have been reported as part of a high-confidence protein identification.

0 0

Post a comment