Annotation of the Hupo Ppp core datasets

From the inception, HUPO has intended that the Plasma Proteome Project facilitate extensive and innovative annotation of the human plasma and serum pro-teome. A large element of the Jamboree Workshop was focused on collaborative annotation. Several papers in this issue report on those collaborations.

Ping et al. [56] emphasize use of peptide identification results from MS/MS to reveal cleavage of signal peptides, proteolysis within hydrophobic stretches in transmembrane protein sites, and PTMs. Using 2446 of the 3020 PPP from IPI that matched to EnsEMBL gene products, they highlight subproteomes comprised of glycoproteins, low Mr proteins and peptides, DNAbinding proteins, and coagulation pathway, cardiovascular, liver, inflammation, and mononuclear phagocyte proteins. Surprises include 216 proteins matched by Gene Ontology to DNA binding and 350 to the nucleus, including histone proteins, suggesting detection of proteins released by apoptosis or other means of cell degradation. Using the Novartis Atlas of mRNA expression profiles for 79 human issues, liver dominated as the source of the major ity of proteins, although many of these proteins are also produced in other tissues. Many classic protein markers of leukocytes were not detected, including markers of B-cell, T-cell, granulocyte, platelet, and macrophage lineages, presumably all at low abundance with little shedding. In contrast, some quite low abundance proteins were found repeatedly, such as VCAM-1 and especially IL-6.

Signal peptide cleavage sites are generally predicted based on presence of a hydrophobic stretch of amino acids flanked at one end by basic amino acids. Seeking experimental evidence for such cleavage sites, these authors focused on semi-tryptic peptides, presuming that the signal cleavage event does not involve trypsin in vivo. Such evidence may override database predictions, as, apparently, in the cited example of SERPINA3/alpha-1-antichymotrypsin. They also identified two previously unreported proteins that undergo regulated intramembrane proteolysis, one of which releases an extracellular immunoglobulin domain - a reason not to reject all immunoglobulin matches. The MS/MS spectra can be examined for evidence of unrecognized PTMs. Using the Osprey tool, they found an average of nearly six protein-protein interactions per protein for a subset of 652 proteins; if they are circulating as multi-protein complexes, they will be less likely to be cleared through the kidney glomeruli.

Berhane et al. [57] focused on 345 proteins of particular interest for cardiovascular research. They classified the proteins into eight categories, most of which have relevance to other organ systems, as well: markers of inflammation in cardiovascular disease, vasoactive and coagulation proteins, signal transduction pathways, growth and differentiation-associated, cytoskeletal, transcription, channels and receptors, and heart failure and remodeling-related proteins. Of particular interest were the detection for the first time in plasma of the ryanodine receptor, part of the intracellular calcium channel in cardiac (and skeletal) muscle, and smoothelin, a structural protein restricted to smooth muscle cells, co-localized with actin. They used a number of identified peptides as an indicator of abundance of the protein (as in Section 3.3, above); for the first two categories, about 50% of proteins were identified with less than ten peptides, whereas no proteins among transcription factors had more than ten peptides and 56% had the minimum of two peptides. No cardiac contractile proteins were identified, even though they are far more abundant than transcription factors or signaling proteins in the heart, suggesting that necrotic cell death and uncontrolled cell rupture had no part in the appearance of any of the detected proteins in the healthy donors studied.

Muthusamy et al. [58] utilized a Java 2 Platform literature search tool to facilitate manual curation of functional classes of proteins, starting with the PPP set of 3020 IPI proteins (2446genes). They subjected protein and nucleotide sequences in NCBI to BLAST queries to identify splice isoforms; they report that 51% of the genes encoded more than one protein isoform (a total of 4932 products). A total of 11 381 single nucleotide polymorphisms involving protein-coding regions were mapped onto protein sequences.

The Core Dataset of 3020 proteins was annotated with use of Gene Ontology for subcellular localization, molecular processes, and biological functions, showing very broad representation of cellular proteins. Subcellular component classification of the 1276 IPI-3020 proteins included in GO showed a relatively high proportion of proteins from membrane compartments (26%), nuclei (19%), cytoskele-ton (11%), and other cell sites (23%), compared with the expected predominance of secreted proteins ("traditional plasma proteins") (14%). GO analyses of molecular processes showed 39% binding, 28% catalytic, 7% signal transducer, 6% transporter, 4% transcription regulator, and 3% enzyme regulator. GO analyses ofbiological functions revealed 36% metabolism, 25% cell growth and maintenance, 5% immune response, 1% blood coagulation and 1% complement activation. Examination of specific Gene Ontology terms against a random sample of 3020 from the human genome (Supplementary Fig. 1) shows some proteins >3 SD from the expected line. Categories over-represented include extracellular, immune response, blood coagulation, lipid transport, complement activation, and regulation of blood pressure, as expected; on the other hand, surprisingly large numbers of cytoskeletal proteins, receptors and transporters also were identified.

An InterPro analysis similarly compared the 3020 protein dataset with the finegrained protein families and domains described for the full IPI v2.21 56 530human proteins dataset (Supplementary Fig. 2). Over-represented domains include EGF, intermediate filament protein, sushi, thrombospondin, complement C1q, and cysteine protease inhibitor, while underrepresented include Zinc finger (C2H2, B-box, RING), tyrosine protein phosphatase, tyrosine and serine/threonine protein kinases, helix-turn-helix motif, and IQ calmodulin binding region, compared with frequencies in the entire human genome.

Ofthe 1297 of the 3020 protein dataset that had identifiers in Swiss-Prot 44, 230 were annotated as transmembrane proteins. Another 25 have mitochondrial transit signals, and an N-terminal signal sequence occurred in 373 proteins. Putative PTMs were noted for 254, including 85 with phosphorylation and 45 with glycosy-lation sites. A separate analysis of nearly twice as many proteins based on EnsEMBL matches using the Human Protein Reference Database (; Muthusamy et al., [58]) found 628 with a signal sequence, 405 with transmembrane domains, 153 with a total of 1169 phosphorylation events, and 112 with a total of 555 glycolysation events.

One of the aims of the HUPO initiatives, as noted in the Section, is to link organ-based proteomes (liver, brain) with detection of corresponding proteins in plasma, and with proteins that are mediators, or at least, biomarker candidates, ofinherited or acquired diseases. Using the Online Mendelian Inheritance in Man (OMIM), we found 338 of our 3020 IPI proteins that match EnsEMBL genes in OMIM, including RAG 2 for severe combined immunodeficiency (SCID)/Omenn syndrome, polycystin 1 for polycystic kidney disease (PKD), and BRCA 1, BRCA 2, p53, and APC for inherited cancer syndromes.

In the final article of this special issue, Martens et al. [59] describe the development and usefulness of the EBI PRoteomics IDEntifications database (PRIDE). The HUPO PPP dataset was the first large dataset to populate this database. The aim is to make publicly available data publicly accessible, in contrast to voluminous lists in printed articles or, more often now, in journals' websites, with custom layouts not suited to computer-based re-analysis. PRIDE offers an Application Pro gramming Interface. In contrast, tables in PDF are described as notoriously difficult to extract. As noted, the PPP established a short-term solution with a relational database using a Microsoft Structured Query Language (SQL) server, which centralized all data collection and served as the testbed for the centralized, project-independent database that is now PRIDE. In turn, PRIDE has been designed with several features intended to facilitate future collaborative studies.

Was this article helpful?

0 0

Post a comment