Data management was one of the key elements in the pilot phase of the HUPO Plasma Proteome Project (PPP). Data submission and collection approaches were defined collaboratively by the Bioinformatics and Technologies Committees, and were extensively discussed at the PPP Workshop in Bethesda, USA in July 2003 .
* Originally published in Proteomics 2005, 13, 3246-3261
Ideally, experimental methods and the data generated by their execution would be fully described in a thoroughly decomposed manner, facilitating sophisticated searches and analyses. However, when dealing with the results from real experiments multiple compromises must be made. The first concerns the level of detail that can be requested: while it is, in principle, desirable to have all methodological steps, parameters, data, and analyses described in full detail, many laboratories lack automated laboratory information management systems and manual record keeping is laborious, limiting the granularity of information that can be captured. The second compromise concerns the degree to which experimental reports will be decomposed and structured by the submitter: from a long run of free text as in a journal paper to a fully annotated list of all the relevant items of information, arranged in an elaborate and well-specified hierarchy that captures the interrelationships of those items. It is notoriously difficult to automatically extract even the simplest information from free text [2, 3]. However, thoroughly classifying information for submission is burdensome. Indeed, developing standards, data definitions, forms or submission tools, and the associated documentation and training material is a substantial task. Third, the pilot phase of the PPP was designed to encourage individual laboratories to push the limits of their technologies to detect and identify low-abundance proteins; the Technology Committee was not able to define in advance all the parameters that emerged as desirable inputs for analysis in this broad, largely voluntary collaboration. The fourth compromise concerns the design and implementation of the data systems used for storage of the data at the central repository. It is desirable to retain as close a link as possible to the original submissions from the participating laboratories in the central repository, but this implies that the details of which data sets superseded earlier submissions, exceptions encountered in the data loading, and other detailed information on submission processing need to be encoded in subsequent queries, complicating the task of writing and debugging software to analyze the data.
Finally, a compromise at the level of the overall project relates to the choice of sequence database used for analysis and whether to "freeze" on a particular release of the sequence database. The results of protein identification by search of mass spectra against a database are necessarily dependent on the database being searched. Freezing on a particular protein sequence database release not only facilitates comparison of identification data sets but also prevents corrections and revisions to the protein sequence collection from being incorporated into the identification process. Further, freezing on a particular protein sequence database release complicates the task of linking the findings of the current study to evolving knowledge of the human genome and its annotation, because many of the entries in the protein sequence database available at the initiation of the project have been revised, replaced, or withdrawn over the course of this project, and continue to be revised.
The major aim of the pilot phase of the HUPO PPP was the comparison of protein identifications made from multiple reference specimens by all participating laboratories. An additional important aim was the development of an efficient method of data acquisition, storage, and analysis in such a big collaborative prote-omics experiment . Here we describe the data management system developed during the pilot phase of the HUPO PPP.
Was this article helpful?