Development of the data model

To encourage participation by laboratories, the data model focused on identifications of whole proteins as a high-level, concise description of experimental results, requiring a minimum of data input, transmission, and potential reformatting. The guidance specified the collection of the protein accession numbers and names, binary descriptions of the confidence of the protein identifications (high or lower), lists of identified peptides, and free text descriptions of experimental protocols. Analysis of the preliminary results brought to the fore a major problem with a data integration and validation process based exclusively on protein accession numbers. Participating laboratories used not only different search databases but also different algorithms to assemble protein identifications from their database search results. Additionally, the estimation of confidence of the identification, based on search scores and laboratory binary judgment, was inconsistent. To address these problems, the original data model was enhanced to include the peak lists used to obtain protein identifications, and raw spectra in the instrument native format.

The expanded data model is generally in concert with recently proposed guidelines for publication of protein and peptide identification data [4]. Since our studies were started before publication of these guidelines, our data collecting decisions do not reflect all of the requirements proposed by Carr et al. [4] Tab. 1 compares the guidance proposed in [4] with the information collected in the present study. The HUPO PPP data model consists of the following main objects:

