Discussion

The PPP integration workflow is based on a heuristic approach that the protein identifications most likely to be true are those which are supported by the largest number of independent experiments. The strength of the "independent experiment" term is gradually loosened in consecutive steps ofthe algorithm to select a single protein, which represents a whole cluster of equivalent identifications.

Such an optimization approach, by its nature, may not always lead to the smallest set of proteins possible. For example, let us consider a simplified problem where there are only six protein identifications in the database - A, B, C, D, E, and F. All of them are products of independent experiments. Furthermore, they are single-peptide identifications associated with distinct peptides a, b, c, d, e, and f, respectively. Searching for these peptide sequences in the protein database shows that the peptides can be found in three different proteins with overlapping sequences - p1, p2, and p3.

Fig. 7 depicts the problem: rows represent the three proteins, columns the six peptide identifications. If a particular peptide can be found in a specified protein, it appears in the appropriate row.

Scoring the proteins using the algorithm results in: pi = 4 (four different identifications), p2 = 3 (three different identifications), and p3 = 2 (two different identifications). This leads to the following assignment ofthe protein accession numbers to the identifications: IDA ? pi, ID B ? pi, IDC ? pi, ID D ? pi, IDE ? p2, IDF ? p3. Although it complies with the algorithm, the selection ofprotein p2 for identification E is not optimal from a mathematical point of view. If protein p3 were assigned instead of p2, the size ofthe set of proteins would reach its minimum. In a real experiment, the coincidence of such a particular overlap ofthe protein sequences and specific scoring conditions necessary to cause the algorithm to fail is very rare. Processing a subset ofthe HUPO PPP MS/MS and FT-ICR-MS data resulting in 9504distinct protein identifications caused the algorithm to fail (i.e., not to reach the minimum) in only ten cases.

Maximizing the number of independent supporting experiments also biases the selection of representative proteins towards those with the longest sequence, as illustrated in Fig. 8. The algorithms used to construct the IPI database also systematically select longer precursor sequences in preference to shorter forms [9].

A more sophisticated approach might incorporate additional sources of biological information in choosing a representative protein for each group. Sources of such information include protein annotation databases like GO [i8] or HPRD [i9]. We chose not to pursue this option because current annotation databases have limited coverage and might introduce historical biases into the protein identification process.

The integration algorithm seeks to assign the minimum number of proteins necessary to account for the observed peptide sequence lists. With no a priori knowledge ofwhich proteins are present in the blood, an alternative, and equally valid, approach would be to list all proteins from which each peptide might have been derived. Fig. 9 compares the results ofthis latter approach with the integration algorithm presented above. Note that many proteins not selected by the integration algorithm may, nevertheless, have been the source of a large number of observed peptides.

Fig. 7 Theoretical example presenting a situation where the integration workflow may not produce the minimal possible set of proteins.

Fig. 8 Length bias in representative protein selection. Shown in the figure are a precursor, p1, and two proteolytically cleaved products, p2 and p3. Precursor contains all the identifying peptides contained in the products. As a result, the integration algorithm will select the precursor independent of other knowledge about which form might be present in the sample.

Fig. 8 Length bias in representative protein selection. Shown in the figure are a precursor, p1, and two proteolytically cleaved products, p2 and p3. Precursor contains all the identifying peptides contained in the products. As a result, the integration algorithm will select the precursor independent of other knowledge about which form might be present in the sample.

Concluding remarks

The pilot phase for the HUPO PPP is the first large-scale collaborative proteome project ever undertaken, and our experience highlights the challenges in data integration that are likely to be encountered in future high-throughput and collaborative proteomics studies. Several issues are identified.

A key decision was to define one recommended protein database and release, IPI 2.21 of July 2003, for all subsequent work in the project. Although this was not universally adhered to by all project participants, it simplified early data comparisons and later merging of results. However, the decision to standardize on IPI release 2.21 also complicated the annotation process. By the time the data-gathering phase of the project had concluded, this release was necessarily out of date. The process of mapping version 2.21 identifiers to version 3.01 identifiers proved to be challenging because of the large number and complex nature of the changes that have taken place in the underlying sequence collection.

We overestimated the laboratories' ability to use XML data formats. Although tools and support for XML were offered, the vast majority of laboratories chose to submit data in Word/Excel formats.

We underestimated the importance of collecting peak lists and raw spectra. The decision to collect data at the level of protein identifications rather than individual peptide identifications meant that information defined at the peptide level, such as peak lists and SEQUEST scores, were not collected.

Fig. 9 Number of identifying and supporting observations. This figure shows a scatterplot for all the 15 695 proteins in IPI version 2.21, which contain at least one peptide observed in the project. X axis is the number of distinct peptides assigned to a protein by the integration algorithm. Yaxis is the number of distinct

(laboratory x experiment x specimen) observations of a peptide which could have been derived from the protein. Note that for some proteins not selected or assigned only one peptide by the integration algorithm, a large number of supporting observations are present in the data set.

Fig. 9 Number of identifying and supporting observations. This figure shows a scatterplot for all the 15 695 proteins in IPI version 2.21, which contain at least one peptide observed in the project. X axis is the number of distinct peptides assigned to a protein by the integration algorithm. Yaxis is the number of distinct

(laboratory x experiment x specimen) observations of a peptide which could have been derived from the protein. Note that for some proteins not selected or assigned only one peptide by the integration algorithm, a large number of supporting observations are present in the data set.

In order to use tools like PeptideProphet and ProteinProphet [15, 7] to assess the reliability of protein identifications, search results or complete sets of peak lists are required, including those which match with extremely low scores. At the inception of the project, the decision was to perform all data analysis at the participating laboratories and to submit only protein identifications to the central repository. The initial submission forms specified only a minimal set of supporting data. As the project progressed and the data repository group assumed more responsibility for quality assurance, we requested more supporting data from the contributing laboratories including mass spectrum peak lists and full binary data files.

The decision to request a pilot round of data submissions proved invaluable in allowing the data repository group to assess the data and identify the problems described above. As a result of this pilot round of data submissions, significant changes were introduced during the project's operation. As a consequence, the data collection/integration center had to deal with the data formatted according to both the old and the new protocols, but the final product of the project was greatly enhanced.

A revised database schema for future projects has been developed; this more extensive, finer-grained schema will better serve the future needs of the PPP, and will also serve as the core for schemata tailored to meet the requirements of other HUPO tissue projects (e.g., liver, brain). In this revised protocol, all entries, whether they contain new data or reanalysis of existing data, are assigned an accession number as a point of reference for use in the publications. The schema is straightforwardly extensible to accommodate additional technologies. For example, we are coordinating with project participants that generate quantitative data. Reliable quantitations, both relative and absolute, can come from a variety of methods such as differential gel electrophoresis, isotope tagging or chemical modification for MS, and protein array technologies [20].

There is also a need to "point outwards" to different resources, often done by creating a field to capture a Uniform Resource Indicator or URI (a generalized version of the familiar URL web address). Such resources include annotation resources such as UniProt (http://www.uniprot.org), EnsEMBL (http://www.en-sembl.org), HPRD (http://www.hprd.org), and PeptideAtlas (http://www.peptide atlas.org) [21]. Importantly, URIs can also link to "raw" mass spectrum data repositories (the original output of a mass spectrometer scan as opposed to the heavily processed peak list); these data are increasingly in demand for in-depth analyses [22], but require special handling separate from the main project database, due to their size (see also Martens et al., this issue).

In addition to its main goals of beginning the map of the human plasma pro-teome and assessing the power of different techniques to resolve proteins, the HUPO-PPP pilot phase has generated an extensive "real world" collection of data that will be invaluable in developing and testing enhanced software tools for pro-teomics. Both the structure of the revised schema and the experience gained in the pilot phase of the PPP will contribute to other HUPO proteome initiatives, in particular the Liver and Brain Proteome Projects, and the HUPO Proteomics Standards Initiative [23], which seeks to provide general standards for proteomics, both for the level of detail required when reporting work (the Minimum Information About a Proteomics Experiment, MIAPE) and the file format in which such information should be captured.

Was this article helpful?

0 0

Post a comment