Dont let the statistician leave the biggest challenges with you instead challenge him

You have a short list, and start screening for a gene which you know is up-regulated. You have confirmed this in single gene assays several times, and it can also be found in the literature. However, the gene is not in the list. Did the microarray experiment disprove previous results? Or did you learn better not to believe in microarrays? We think that the most likely problem is the p-value filter. It was too harsh. You start looking for your pet gene in the complete ranked list and you find it up-regulated but somewhat below the cutoff line. Is it allowed to enlarge the list and include all genes down to this gene? Yes, but it comes at a price: the list will be contaminated by more false positives. How badly contaminated is it? This question lies not on the main tracks of statistics, and it is controversial as to whether it is a good one. The main track is that you first define the standard, then the statistician produces the list of genes, and you are left with the challenge to interpret it. Extending the list, as suggested above, assigns jobs differently. Now, you define the list and the statistician is left with the challenge to estimate its degree of contamination. Recently, this challenge was accepted by parts of the statistical community. And not surprisingly, the first software tools like SAM became very popular.

Back to the question: how badly contaminated is the extended list? Or, statistically, what is the expected proportion of false positives in it? Storey (14) gave a first answer by introducing the q-value. The q-value of a gene is, roughly, an estimated FDR of the list that includes all genes up to this gene. The main difference between the Benjamini-Hochberg ideas and the Storey ideas is that of controlling the FDR versus estimating it. Or in easier words, it is the difference between whether you need to define a tolerable FDR and the computer is producing the list or vice versa.

0 0

Post a comment