Constructing a PPP database for human plasma and serum proteins

Google Sniper

Search Engine Traffic Guide

Get Instant Access

Data management for this project included guidance and protocols for data collection, then centralized integration, analysis, and dissemination of findings worldwide via a communications infrastructure. As described in great detail by Adamski et al. [7, 8], key challenges were integration of heterogeneous datasets, reduction of redundant information to minimal identification sets, and data annotation. Multiple factors had to be balanced, including when to "freeze" on a particular release of the ever-changing database selected for the PPP and how to deal with "lower confidence" peptide identifications. Freezing of the database was essential to conduct extensive comparisons of complex datasets and annotations of the dataset as a whole. However, it complicates the work of linking findings of the current study to evolving knowledge of the human genome and its annotation. Many of the entries in the protein sequence database(s) available at the initiation of the project or even the analytical phase were revised, replaced, or withdrawn over the course of the project, and continue to be revised. Our policies and practices anticipated the guidelines issued recently by Carr et al. [9], as documented by Adamski et al. [7].

The 18 participating laboratories using MS/MS or FT-ICR-MS submitted a total of 42 306 protein identifications using various search engines and databases to handle spectra and generate peptide sequence lists from the specimens analyzed.

Tab. 1 Protein identifications by lab, by specimen, and by methods

Lab Specimen Deple- Protein Reduction/ Peptide Mass Search 3020 3020 Single

ID tion separation alkylation separation spectrum software High Lower peptide confidence confidence

1

bl-cit

aig

none

iam

rp/scx/rp

esi-ms/ms_decaxp

PepMiner

61

39

12

1

bl-edta

aig

none

iam

rp/scx/rp

esi-ms/ms_decaxp

PepMiner

35

30

14

1

bl-hep

aig

none

iam

rp/scx/rp

esi-ms/ms_decaxp

PepMiner

50

38

13

1

bl-semm

aig

none

iam

rp/scx/rp

esi-ms/ms_decaxp

PepMiner

21

6

5

1

b2-cit

aig

none

iam

rp/scx/rp

esi-ms/ms_decaxp

PepMiner

57

37

12

1

b2-hep

aig

none

iam

rp/scx/rp

esi-ms/ms_decaxp

PepMiner

58

30

12

1

b2-semm

aig

none

iam

rp/scx/rp

esi-ms/ms_decaxp

PepMiner

59

31

12

1

b3-semm

aig

none

iam

rp/scx/rp

esi-ms/ms_decaxp

PepMiner

17

6

7

2

bl-cit

none

cho affinity

iam

scx/rp

esi-ms/ms_qtof

SEQUEST

165

79

94

2

bl-semm

none

cho affinity

iam

scx/rp

esi-ms/ms_qtof

SEQUEST

136

48

38

2

nibsc

none

cho affinity

iam

scx/rp

esi-ms/ms_qtof

SEQUEST

171

121

85

11

bl-cit

none

cho affinity

iam

l-p

esi-ms/ms_decaxp

SEQUEST

59

4

9

11

bl-edta

none

cho affinity

iam

l-p

esi-ms/ms_decaxp

SEQUEST

64

6

4

11

bl-hep

none

cho affinity

iam

l-p

esi-ms/ms_decaxp

SEQUEST

62

9

15

11

bl-semm

none

cho affinity

iam

l-p

esi-ms/ms_decaxp

SEQUEST

64

3

16

12

bl-cit

aig

none

iam

rp/scx/rp

esi-ms/ms_deca

SEQUEST

111

0

113

12

bl-edta

aig

none

iam

rp/scx/rp

esi-ms/ms_deca

SEQUEST

111

0

101

12

bl-hep

aig

none

iam

rp/scx/rp

esi-ms/ms_deca

SEQUEST

127

0

130

12

bl-semm

aig

none

iam

rp/scx/rp

esi-ms/ms_deca

SEQUEST

123

0

111

17

bl-semm

aig

Is sds

iam

l-p

esi-ms/ms_lcq

SEQUEST

50

19

7

21

bl-cit

top6

rotofor-ief/rp / ld-sds

iam

l-p

esi-ms/ms_qtof

MASCOT

40

0

1

21

bl-cit

top6

rotofor-ief/rp /ld-sds

none

none

maldi-ms/msabi4700

MASCOT

51

0

3

21

bl-cit

top6

rotofor-ief/rp /ld-sds

none

l-p

esi-ms/ms_qtof

MASCOT

39

0

1

21

bl-edta

top6

rotofor-ief/rp /ld-sds

iam

l-p

esi-ms/ms_qtof

MASCOT

40

0

Lab Specimen Deple- Protein Reduction/ Peptide Mass Search 3020 3020 Single

ID tion Separation alkylation separation spectrum software High Lower peptide confidence confidence

21

bl-edta

top6

rotofor

ief/rp/ld-sds

none

none

maldi-ms/msabi4700

MASCOT

51

0

3

21

bl-edta

top6

rotofor

ief/rp/ld-sds

none

rp

esi-ms/ms_qtof

MASCOT

39

0

1

21

bl-semm

top6

rotofor

ief/rp/ld-sds

iam

rp

esi-ms/ms_qtof

MASCOT

40

0

1

21

bl-semm

top6

rotofor

ief/rp/ld-sds

none

none

maldi-ms/msabi4700

MASCOT

51

0

3

21

bl-semm

top6

rotofor

ief/rp/ld-sds

none

rp

esi-ms/ms_qtof

MASCOT

39

0

1

22

bl-semm

top6

Is sds

iam

rp/scx/rp

esi-ms/ms_decaxp

SEQUEST

277

0

161

24

bl-semm

a

rp

iam

rp

esi-ms/ms_qtrap

MASCOT

7

12

1

24

bl-semm

none

rp

iam

rp

esi-ms/ms_qtrap

MASCOT

17

21

3

26

b2-cit

none

rotofor

ief/ld-sds

iam

rp

esi-ms/ms_qtof

MASCOT

160

44

12

28

bl-cit

ig

none

none

rp

esi-fticr

VIPER

218

45

208

28

bl-semm

ig

none

none

rp

esi-fticr

VIPER

223

50

239

28

b2-cit

ig

none

none

rp

esi-fticr

VIPER

255

140

346

28

b2-semm

ig

none

none

rp

esi-fticr

VIPER

244

181

405

28

b3-cit

ig

none

none

rp

esi-fticr

VIPER

214

188

359

28

b3-semm

ig

none

none

rp

esi-fticr

VIPER

218

193

384

29

bl-cit

top6

none

iam

scx/l"p

esi-ms/ms_decaxp

SEQUEST

19

129

136

29

bl-cit

top6

none

iam

scx/rp/2mz

esi-ms/ms_decaxp

SEQUEST

51

160

181

29

bl-edta

top6

none

iam

scx/rp

esi-ms/ms_decaxp

SEQUEST

50

199

264

29

bl-edta

top6

none

iam

scx/rp/2mz

esi-ms/ms_decaxp

SEQUEST

82

491

557

29

bl-hep

top6

none

iam

scx/rp

esi-ms/ms_decaxp

SEQUEST

26

97

122

29

bl-semm

top6

none

iam

scx/rp

esi-ms/ms_decaxp

SEQUEST

90

338

432

29

cl-cit

top6

none

iam

scx/rp/2mz

esi-ms/ms_decaxp

SEQUEST

82

449

517

29

cl-edta

top6

none

iam

scx/rp/2mz

esi-ms/ms_decaxp

SEQUEST

72

555

610

29

cl-hep

top6

none

iam

scx/rp/2mz

esi-ms/ms_decaxp

SEQUEST

82

227

Lab Specimen Deple- Protein Reduction/ Peptide Mass Search 3020 3020 Single

ID tion Separation alkylation separation spectrum software High Lower peptide confidence confidence

29

cl-semm

top6

none

iam

scx/rp/2mz

esi-ms/ms_decaxp

SEQUEST

97

519

570

29

nibsc

top6

none

iam

scx/rp/2mz

esi-ms/ms_decaxp

SEQUEST

82

371

432

33

nibsc

top6

ffe/rp

none

rp/ziptip

maldi-ms/ms_qstar

Digger

54

0

0

33

nibsc

top6

ffe/rp

none

rp/ziptip

maldi-ms/ms_qstar

MASCOT

58

0

3

34

bl-hep

top6

zoom-ief/ld-sds

iam

rp

esi-ms/ms_decaxp

SEQUEST

123

148

146

34

bl-semm

top6

zoom-ief/ld-sds

iam

rp

esi-ms/ms_ltq

SEQUEST

427

741

1172

40

bl-hep

none

aig affinity/rp

iam

scx/rp

esi-ms/ms_lcq

Sonar

160

253

185

41

bl-cit

none

gradiflow/tca

none

scx/rp

esi-ms/ms_qstar

SEQUEST

72

0

34

41

bl-edta

none

gradiflow/tca

none

scx/rp

esi-ms/ms_qstar

SEQUEST

62

0

16

41

bl-hep

none

gradiflow/tca

none

scx/rp

esi-ms/ms_qstar

SEQUEST

51

0

7

41

bl-semm

none

gradiflow/tca

none

scx/rp

esi-ms/ms_qstar

SEQUEST

76

0

27

41

nibsc

none

gradiflow/tca

none

scx/rp

esi-ms/ms_qstar

SEQUEST

53

0

1

43

bl-cit

aig

none

iam

rp

esi-ms/ms_qtof

MASCOT

26

0

0

43

bl-edta

aig

none

iam

rp

esi-ms/ms_qtof

MASCOT

31

0

0

43

bl-hep

aig

none

iam

rp

esi-ms/ms_qtof

MASCOT

37

0

0

43

bl-hep

aig

none

iam

rp

maldi-ms/msabi4700

MASCOT

26

0

0

43

bl-semm

aig

none

iam

rp

esi-ms/ms_qtof

MASCOT

24

0

0

43

nibsc

aig

none

iam

rp

esi-ms/ms_qtof

MASCOT

21

0

0

46

cl-semm

top6

none

iam

rp

esi-ms/ms_ltq

SEQUEST

185

522

571

55

bl-cit

none

sax

iam

rp

esi-ms/ms_ltq

SEQUEST

216

48

73

High and lower confidence

1. PepMiner results: score >80/100

2. ProteinProphet: high p > 0.95; lower 0.95 > p > 0.2

11. Xcorr > 1.5/2.0/2.5 for charge states +1/ + 2/ + 3. Tryptic cleavage rules. High confidence: two or more peptide ids or single peptide ID manually inspected; spectrum must show high signal and top 3 ions must be assigned either b or y. Otherwise, lower confidence

12. PeptideProphet high confidence p > 0.35. All IDs reported as high confidence.

17. SEQUEST results: no-enzyme searches, acceptance criteria not stated. (For the automatic interpretation of fragment ion spectra the SEQUEST algorithm is used screening the NCBI protein database (weekly updated version)). The chosen parameters are: aver

21. MASCOT result; high confidence only: probability > 98%, numerous isoforms identified

22. SEQUEST result: Xcorr > 1.9/2.5/3.75 forcharge states +1/ + 2/ + 3, no manual inspection, no other criteria used

24. MASCOT result. High confidence: if two or more peptides, each of them has to have MASCOT score > 20; if single peptide, it has to have MASCOTscore > 30.

26. High confidence fully bryptic peptides: MASCOT individual peptides score >21 or total score >80; if single peptide hit, score >60; if lower scores, manually inspected to check fragment ions and mass error.

28. Confidence is based on reproducibility of identification in triplicate analyses of a sample. High confidence = identification of AMT peptides for a given ORF in two or three of triplicate FT-ICR analyses. Lower confidence = identification of AMT peptides in only one of three FT-ICR analyses. VIPER and Q-Rollup software were used to match FT-ICR accurate masses to the AMTdatabase

29. High confidence: Xcorr > 1.9/2.2/3.75 (for charges +1/+2/ + 3), del-taCn > 0.1, and Rsp < 4. Lower confidence: Xcorr > 1.5/2.0/2.5 (for charges 11/12/13), deltaCn > 0.1

33. High confidence: Digger nxc > 0.3; MASCOT score > 15

34. High and lower confidence both used PPP stringent segment parameters ofXcorr >1.9, 2.2 and 3.15; deltaCN >0.1; Rsp <4; high-two or more peptides; lower-one peptide.

40. Sonar results. High confidence: protein expect value < 1; lower confidence: protein expect value > 1

41. DTA Select results, criteria not stated, manually inspected

43. MASCOTresults: protein p-value < 0.05 and at least one peptide with MASCOTscore > 20.

46. High confidence: Xcorr > 1.9/2.2/3.75 (for charges +1/+2/ + 3), deltaCn > 0.1, and Rsp < 4; lower confidence: Xcorr >1.5/2.0/2.5.

55. Identical sets of .dta files were searched using SEQUEST, Sonar and X!Tandem.SEQUESTcriteria:Xcorr > 1.8/2.0/2.5 forcharge states 11/12/ 13, deltaCn > 0.1, Sp < 200. X!Tandem criteria: expectation value < 0

These reports matched to 15 710 non-redundant entries (of which 15 519 were based on peptides with six or more amino acids) in the International Protein Index, which had been chosen as the standard reference database for this Project (IPI version 2.21, July 2003) [9]. We designed an integration algorithm which selected one representative protein among multiple proteins (homologs and isoforms) to which identified peptides gave 100% sequence matches. This integration process resulted in 9504 proteins in the IPI v2.21 database identified with one or more peptides. From this point of view, the PPP database is conservative, counting homologous proteins and all isoforms of particular proteins (and their corresponding genes) just once, unless the sequences actually differentiated any additional matches. We included at this stage proteins identified by matches to one or more peptide sequences of "high" or "lower" confidence according to cutpoints utilized with the various search engines used by different MS/MS instruments. Tab. 1 shows the details of the cutpoints or filters used by each investigator and the numbers of "high" and "lower" confidence protein IDs. All laboratories utilizing SEQUEST were asked to reanalyze their results using the PPP specified filters of Xcorr values > 1.9, 2.2, and 3.75 for singly, doubly, and triply charged ions, with deltaCN value > 0.1 and Rsp > 4 for fully tryptic peptides for "high confidence" identifications; most did so. No equivalency rules were applied across all the search algorithms for all the cutpoints.

However, Kapp et al. [11] provide such a cross-algorithm analysis for three specified false-positive rates using one laboratory dataset. Since the approaches and analytical instruments used by the various laboratories (Tab. 1) were far too diverse to utilize a standardized set of mass spec/search engine criteria, we created a relatively stringent defined set of protein IDs from the 9504 above by requiring that the same protein be identified with at least a second peptide. In a peptide chromatog-raphy run for MS, not all peaks are selected for MS/MS analysis, and the identification of peptide fragment ions is a low-percentage sampling process. Thus, additional analyses in the same lab and in other labs would be expected to enhance the yield of peptide IDs. Consequently, MS data from the individual laboratories were combined to increase the probability of peptide and protein identification. The use of different instrumentation with proprietary software and different search engines for identification made it unfeasible to apply a standard set ofparameters to peptide sequences. Therefore, we required a minimum of two distinct peptides to be inferred from mass spectra and matched 100% to the database protein sequence, as a uniform criterion for a given protein to be considered identified.

Of this total of 9504 protein IDs, 6484 were based on one peptide, while 3020 were based on two or more peptides (Tab. 2). That process generated the list of 3020 proteins (5102 before integration) which is utilized as our Core Protein Dataset for the HUPO PPP knowledge base. Full details with unique IPI accession numbers for each protein are accessible for examination and re-analysis at http:// www.bioinformatics.med.umich.edu/hupo/ppp and www.ebi.ac.uk/pride. Fig. 2 shows the numbers of proteins identified with > n peptides with the percentage of those IDs confirmed in a second laboratory. Of these peptides, the vast majority were ten or more amino acids in length, with a median of 12.9 and a minimum of six amino acids in this dataset; the distribution of lengths is shifted to the right compared with the theoretical tryptic peptides from the total IPI database. The 3020proteins represent a very broad sampling of the IPI proteins in terms of characterization by pi and by molecular weight of the transcription product (often a "precursor" protein).

Tab. 2 Protein identifications by lab and specimen, based on two or more peptides for each protein match, generating the PPP 3020 protein core dataset

Lab Id

nibsc

bl-cit

bl-edta

bl-hep

bl-serum

b2-cit

b2-hep

b2 serum

b3-cit

b3-serum

cl-cit

cl-edta

cl-hep

cl-serum

plasma

serum

both

1

0

100

65

88

27

94

88

90

0

23

0

0

0

0

197

108

220

2

292

244

0

0

184

0

0

0

0

0

0

0

0

0

399

184

469

11

0

63

70

71

67

0

0

0

0

0

0

0

0

0

102

67

120

12

0

111

111

127

123

0

0

0

0

0

0

0

0

0

277

123

348

17

0

0

0

0

69

0

0

0

0

0

0

0

0

0

0

69

69

21

0

78

78

0

78

0

0

0

0

0

0

0

0

0

78

78

78

22

0

0

0

0

277

0

0

0

0

0

0

0

0

0

0

277

277

24

0

0

0

0

51

0

0

0

0

0

0

0

0

0

0

51

51

26

0

0

0

0

0

204

0

0

0

0

0

0

0

0

204

0

204

28

0

263

0

0

273

395

0

425

402

411

0

0

0

0

565

572

693

29

453

323

724

123

428

0

0

0

0

0

531

627

309

616

1576

867

1839

33

60

0

0

0

0

0

0

0

0

0

0

0

0

0

60

0

60

34

0

0

0

271

1168

0

0

0

0

0

0

0

0

0

271

1168

1251

40

0

0

0

413

0

0

0

0

0

0

0

0

0

0

413

0

413

41

53

72

62

51

76

0

0

0

0

0

0

0

0

0

113

76

137

43

21

26

31

43

24

0

0

0

0

0

0

0

0

0

51

24

52

46

0

0

0

0

0

0

0

0

0

0

0

0

0

707

0

707

707

55

0

264

0

0

0

0

0

0

0

0

0

0

0

0

264

0

264

nibsc

bl-cit

bl-edta

bl-hep

bl-semm

b2-cit

b2-hep

b2-semm

b3-cit

b3-semm

cl-cit

cl-edta

cl-hep

cl-semm

plasma

serum

both

679

1016

876

838

1749

568

88

470

402

419

531

627

309

1124

2580

2353

3020

Distribution of protein identifications in function of peptides detected per protein

Distribution of protein identifications in function of peptides detected per protein

Fig. 2 Number of proteins identified as a function of number of peptides matched.

The PPP database permits future users to choose their own cut-points for subanalyses, including 2857 proteins identified at least once with "high confidence" criteria; 1555 proteins based on two or more peptides, at least one of which was reported as high confidence (from the intersection of the 3020 and the 2857); and 1274 proteins based on matching to three or more peptides.

Fig. 3 shows the methods used and the log of the number of proteins identified by the various laboratories. At the top of the figure are results with MALDI-MS. Four labs reported MALDI-MS without MS/MS for certain specimens. For example, Lab 22 analyzed all four samples of each of the B1, B2, and B3 specimens by MALDI-MS, and then used in-depth ESI-MS/MS Deca-xp for B1 serum only. Altogether there were 367 distinct protein IDs by MALDI-MS, of which 226 were confirmed by MS/ MS or FT-ICR/MS in the core dataset of 3020 IPI proteins, while 141 were not so confirmed. The mean and median numbers of peptides for the confirmed proteins were significantly higher than for those not confirmed. The MALDI-MS data were not used in identifying the 3020 protein dataset or creating Fig. 2.

The capillary LC-FT-ICR-MS results (Lab 28) were included. This method (Adkins et al. [12]) depends upon previous ion-trap MS/MS studies to generate a database ofhighly accurate mass and normalized elution time parameters for each peptide. Proteins in new specimens cannot be recognized if those proteins were not already detected and characterized in creating (and updating) the AMT database. Only 22% of 722 proteins identified across the six PPP specimens had more than one peptide match; ProteinProphet clustered these 722 into 377 non-redundant proteins. The LC-MS/AMT method has the potential to expedite analysis of large numbers of specimens once the mass tolerance is tightened, the elution times are made highly reproducible, and the AMT parameters are known for a very substantial number of true-positive peptides. Even then, however, samples of dif-

Fig. 3 Categorization of depletion, fractionation, and MS methods and yield of proteins identified (log scale).

fering origin and complexity may have different PTMs and different elution times, limiting the usefulness of the AMT tags. At present, peptide coverage seems to be quite limited. However, powerful MS-FT-ICR-MS (MS3) combinations are being introduced [13]. Lab 28 contributed valuable data on serum/plasma comparisons. Adkins et al. [12] also demonstrated that their approach gives a rough quantitative estimation of protein concentrations based on average ion current for all the pep-tides identified for 18 particular proteins, correlated in log-log plots with nephelometric immunoassay results.

The most striking difference in MS was the comparison of LCQ-Deca XP1 ion trap (IT) and LTQ linear IT MS/MS instruments by Lab 34. The analyses were of two different specimens from the BD B1 set, using similar depletion, protein array pixelation prefractionation, and tryptic peptide fractionation (Tang et al. [14] this issue). LCQ analysis of B1-heparin-plasmayielded 575 IDs, while LTQ analysis of B1-serumyielded 2890protein IDs, both with the PPP high-stringency SEQUEST filters. Many low abundance proteins in the low ng/mL to pg/mL range were identified. The comparison is complicated, however, by the fact that the protein identifications used different amounts of starting material. Depletion was applied to 193 mL (14.5 mg) of plasma and 415 mL (35.3 mg) of serum. After the fractionation steps, fractions equivalent to 0.6 mL (45 mg) of the plasma and 2.4 mL (204 mg) of the serum were analyzed in the LC-MS/MS. Thus, some or possibly most of the difference in yield may be attributable to a larger volume analyzed. There were some other differences, as well including use of protease inhibitors with the depletion buffer, higher DTTconcentration, fewer Micro-Sol-IEF fractions, and data-dependent MS/MS scans of the three most abundant ions with the LCQ instead often ions in the LTQ Bl-serum experiment. There were also some differences in the searching of databases with one (serum) versus two (plasma) missed cleavage sites permitted. Tang et al. [14] describe extensive sensitivity analyses of experimental parameters that affect the tradeoff between numbers of high confidence protein IDs and analysis time. For example, gas phase fractionation to analyze different segments of the m/z range in each run was judged to be inefficient.

Labs 46 and 55 also employed LTQ instruments and obtained large numbers of identifications for reference specimens Cl-serum and Bl-citrate-plasma, respectively (Tables 1 and 2, Fig. 3).

Was this article helpful?

0 0
The Google LSI Handbook

The Google LSI Handbook

Here's your chance to learn the secret formula that only the top webmaster's know about, that helps them easily dominate any keyword term. Discover How To Unravel The Mysteries Of Googles Search Engine Rankings, and Stay One Step Ahead Of The Rest In The keywords War!

Get My Free Ebook


Post a comment