Two Group Analysis (Microarray)




Preprocessing


Normalization


For differential expression computation, B-Fabric applies by default the quantile normalization implemented in the Bioconductor package preprocessCore to make sample comparable.


Probe Selection


For completeness the result table will contain p-values and for all probes. However for probes where the signal was not above background the p-values will not be useful!!

A probe is considered Present (above background) for the comparison if it is considered as Present in at least 50% of the samples of one condition. Whether a probe satisfies this criterion is reported in the result table in the column isPresent.

The FDR is only computed for those probes that are present! For the remaining probes the FDR is set to NA.

It is recommended to further follow up only those genes where probes do have a valid FDR!

Using probes with FDR values set to NA, implies an increased risk of following up false positive changes. However using probes with small changes may be warranted if the effect on the gene expression is expected to be small (e.g. because of cell mixtures, or low dosages, ...).

Hypothesis Test Methods


The method that is used to compute the significance of the differential expression is indicated by the Method parameter in the resulting html report.

Available methods are:
  • limma: Uses the limma package in Bioconductor
  • paired limma: Uses the limma package in Bioconductor with a paired model
  • t-test: Student's t-test
  • paired t-test: Paired Student's t-test
  • Wilcox: Wilcoxon's rank-sum test

The above methods compute the p-values and we use Benjamini-Hochberg's algorithm to compute the false discovery rate (FDR) in the result table.

Result File Format


The result is generated as a tab-separated text file that can be loaded into Excel. You must make sure that annotation columns are loaded as "text" format, otherwise Excel may convert some gene symbols into dates or may round integer Gene IDs or chromosomal coordinates!!!

The columns of the file are from left to right:

Column Names Column Values
Probe Identifier ID of the microarray probe
Optional columns like: Gene Symbol, ... Annotation
IsControl whether the probe is designed to be a control for hybridization, labeling or gridding
log2 Signal the average log2 hybridization signals of the two conditions compared
isPresent whether the probe signal was in at leas one of the conditions above background
log2 Ratio log2 of the expression ratio of the two conditions
ratio ratio of the average expression in the two conditions
pValue significance value computed by the hypothesis test
fdr false discovery rate (FDR) associated to the set of probes with this or higher signifiicance. The FDR is "N/A" for probes where isPresent is FALSE and for probes that showed only a small amount of variation and fold-change in the entire data set
Avg of .... average log2 expression values of the two conditions compared
column names that are sample names the columns hold the normalized log2 expression of the probe in that sample
(Optional) Cluster If clustering was performed the color of the cluster the probe is part of. If a probe has an empty string here, the probe was not used for clustering


How to select candidate genes from the result table


The probes/genes in the result table are sorted according to p-value. If you want to select "promising" candidate genes, you should follow the rulres

  • do not use probes where the "IsControl" column shows TRUE, these are probes that served as hybridization or other controls
  • do not use probes where the "IsPresent" column shows FALSE, these are probes where the signal did not go well above the background signal in any of the conditions, see the Probe Selection section for information how the present status was computed
  • do filter genes based on p-value, a commonly accepted choice is p=0.05. Do also record the FDR that you obtain for this choice. This is the maximal value in the FDR column, after filtering away everything with p>0.05; Reviewers will probably ask for this FDR value.
  • do filter genes based on log-ratio, a commonly accepted choice is greater 1 or below -1.
  • to be more stringent you may also want to remove low expressors. This would be additional to the IsPresent filtering. Depending on the platform, "low expressed" genes are those with log2 expression like 2 to 5.