Identification of novel peptides using whole genome ORF search

A fascinating annotation from the PPP database has been used by States to enhance the annotation of the human genome itself [60]. The mass spectra data obtained by PPP investigators represent a resource for identifying novel and cryptic genes that may have been missed in previous annotations ofthe human genome. A total of 583 proteins in the 3020 protein set, including 185 identifications supported by three or more peptides, is not associated with genes in EnsEMBL. These are confident to highly confident experimental observations. The fact that they are not associated with known genes demonstrates that the annotation of the human genome remains incomplete.

To test the feasibility of this approach, we searched all ORFs using peak list data from six PPP laboratories (17, 30, 37, 41, 52, 55). NCBI human genome sequence build 33 was translated in all three reading frames and both strands; all non-redundant ORFs were assembled into chromosome specific sequence collections. The open source tool X!Tandem [61] was used in these analyses, with requirements for multiple mass spectra and a threshold hyperscore of 30 to accept peptide matches and greatly reduce the likelihood of false positive matches to ORFs. In all, 118 novel peptides were identified as highly probable matches to ORFs in the human genome not previously known to have protein products. This kind of protein-to-DNA mapping of the human genome is a notable bonus of the Plasma Proteome Project.

