Pvalues

Now we proceed deeper into statistical analysis, describing classical main roads first. You want to filter out genes that are not differentially expressed. The most widely used filters are p-values. P-value filters consider false positives (genes that are not differentially expressed but passed the filter) as dirt, and they follow firm standards on how much pollution is tolerable. They will comply to these standards and prevent pollutants from entering your list. Be aware that the filter will not hesitate to absorb truly induced genes, too. This cannot be controlled directly. Fortunately, it is you who sets the standards, and thus you can indirectly calibrate the resulting list of genes. How do these standards express themselves in statistical analysis? In our experience, the most widely spread association with p-values is: 'They need to be below 0.05.' That is a standard of cleanness. What happens if we cut off all genes with p < 0.05? For a non-induced gene, the chance to survive this treatment is 5%. With 20000 genes on the array and 19 500 of them non-induced, this leads to around 975 (= 19500 x 0.05) false positives in the list. Is your standard of hygiene higher than that? In this case you can adjust the filter. Note that the simple computation above depends on the number of non-induced genes on the chip. If the chip was much smaller, say 500 genes and 250 of them non-induced, you only have to expect about a dozen false positives (250 x 0.05 = 12.5). That might be tolerable. In general, larger chips need stronger filters to achieve the same standards for clean gene lists.

As genes are not independent of each other but connected through pathways and coregulation, we recommend using empirical p-values that are derived by permuting the condition label vector. The empirical p-value is the percentage of random scores that exceed the original score. To compute empirical p-values from, say, 10 000 permutations, type:

pvalue <- score\$result\$pvalue

0 0