R-Server Analyses
Table of contents
The following analyses are available:
- Agilent QC
- Affymetrix QC
- Two Group Analysis
- ANOVA Analysis
- Exon Two Group Analysis
- Exon ANOVA Analysis
R Analysis Options can be passed in the name string of the analysis.
Advanced Options
R-Server analyses allow the specification of advanced options in the name of the analysis. The syntax is
<analysis name> [<option name>=<option value> ....]
Spaces in an option must be escaped by a backslash.
File locations
dataRoot the root directory for the data filesconfig$dataRoot = "/srv/www/htdocs"
annoDir the directory where all chip annotation files are located
config$annoDir = "/srv/GT/reference/microarray/annotations"
htmlDir the output html directory
config$htmlDir = "."
cssFile the CSS file to be included in the header of the report
config$cssFile = "/usr/local/ngseq/bfab_scripts/bfabStyle.css"
aptCommand the command to launch the Affymetrix Power Tools (APT)
config$aptCommand = "/usr/local/ngseq/bfab_scripts/affymetrix/apt/Linux/bin/apt-probeset-summarize"
cdfRepository the directory with all the CDF files. The suffix .CDF of the files must be uppercase
config$cdfRepository = "/srv/GT/reference/microarray/cdfRepository"
gseaDatabaseDir the directory with the geneset databases for the GSEA analysis
config$gseaDatabaseDir = "/srv/GT/reference/microarray/annotations/GSEA/GeneSetDatabases"
jarDir the directory with the jar files for metacore connection
config$jarDir = "/usr/local/ngseq/bfab_scripts"
logFile the file where the starting and the finishing of jobs is reported
config$logFile = "/srv/GT/reference/microarray/jobLog.txt"
exonmapConfDir location info for exonmap databases
config$exonmapConfDir="/srv/GT/reference/microarray/.exonmap"
xmapcoreConfDir location of xmapcore databases
config$xmapcoreConfDir="/srv/GT/reference/microarray/.xmapcore"
Probe Detection Thresholds
useSigThresh boolean flag telling whether a probe must have a signal above a threshold in order to be considered presentconfig$useSigThresh = TRUE
sigThresh the signal threshold for a probe to be present, only used if useSigThresh is TRUE. The Agilent QC method overwrits this threshold and sets it to 200.
config$sigThresh = 25
useDetectionPThresh boolean flag telling whether the detection p-value should be considered. Only used if the data has actually detection p-value, which is currenty only true for MAS5 processed Affy data.
config$useDetectionPThresh = FALSE
detectionPThresh threshold for the detection p-value, see useDetectionPThresh
config$detectionPThresh = 0.05
Input and Preprocessing Options
normMethod Normalization method to be used. Valid options are: "none", "quantile", "logMean", "vsn". Different analyses will use different default normalizations. The normalization used is reported in the HTML-Report.config$normMethod = "quantile"
doBatchNorm whether batch normalization usign the pamr package should be applied. Is only valid for experimmental designs where the batch-factor is balanced with respect to the other experimental factors. As a side effect this normalizes each probe to mean 1, which means that signal thresholding can not be used in combination with this flag. Requires in the sample annotation a column called "batch" that indicates the batch for each sample.
config$doBatchNorm = FALSE
Input and Preprocessing Options: Agilent
AgilentSignalColumn the column to read from Agilent TXT files. For two channel data the corresponding control signal column is a also read. Common choices are "gMedianSignal", "gProcessedSignal", "LogRatio". For "LogRatio" the control columnb will be "gProcessedSignal"config$AgilentSignalColumn = "gMedianSignal"
AgilentFlagColumn the column in the Agilent File that provides the present flags
config$AgilentFlagColumn = "gIsWellAboveBG"
loadAgilentFeatureData for Agilent files the loaded signal always holds one value per probe. Replicate spots/features are averaged. If this is true, the original feature values are also kept and stored in the rawData as featureSignal.
config$loadAgilentFeatureData = FALSE
useBarcodeAsReplicateId this will load the slide barcode and fill the replicate slot of the sample annotation with the barcode. Using this flag is deprecated.
config$useBarcodeAsReplicateId = FALSE
normMethod Normalization method to be used. Valid options are: "none", "quantile", "logMean", "vsn". Different analyses will use different default normalizations. The normalization used is reported in the HTML-Report.
config$normMethod = "quantile"
doBatchNorm whether batch normalization usign the pamr package should be applied. Is only valid for experimmental designs where the batch-factor is balanced with respect to the other experimental factors. As a side effect this normalizes each probe to mean 1, which means that signal thresholding can not be used in combination with this flag. Requires in the sample annotation a column called "batch" that indicates the batch for each sample.
config$doBatchNorm = FALSE
Input and Preprocessing Options: Affymetrix
AffyPreprocessing preprocessing method for Affy data, supported values are "rma" and "mas5"config$AffyPreprocessing = "rma"
runMas5 flag whether additionally MAS5 processing should be run
config$runMas5 = FALSE
useFirstExonOnly flag whether for exon arrays only probe sets in the 5'-exon of genes should be used. Only for exploratory analysis checking for degradation
config$useFirstExonOnly = FALSE
useLastExonOnly flag whether for exon arrays only probe sets in the 3'-exon of genes should be used. Only for exploratory analysis checking for degradation.
config$useLastExonOnly = FALSE
useExonicOnly should only "exon"-targeting probe sets as defined by the xmap annotation database be used?. Only relevant when using exon arrays. Speeds up processing and reduces false positives when only looking for expression in well defined genes.
config$useExonicOnly = FALSE
removePoorAffyProbes should probes that have a signal above affyPorbeSignalFilterThresh in less than minAffyPresentProbeValues samples be removed?. Particularly useful for tiling arrays where many probes may have bad hybridization properties
config$removePoorAffyProbes = FALSE
affyProbeSignalFilterThresh threshold for removal of "non-working" probes on Affy chips
config$affyProbeSignalFilterThresh = 32
minAffyPresentProbeValues number of samples in which a value above affyProbeSignalFilterThresh is required in order to keep a probe
config$minAffyPresentProbeValues = 3
minAffyProbeCount min number of probes to keep in a probe set, even if some of them are "non-working"
config$minAffyProbeCount = 3
exonLevelCdf set of cdfs that define exon-level probe sets and for which exon-level analyses can be applied
config$exonLevelCdf = c("ratexonpm", "mouseexonpm", "exon.pm", "raex10stv1", "huex10stv2", "moex10stv1")
xmapCdf set of cdfs that are available in the xmap database
config$xmapCdf = c("ratexonpm", "mouseexonpm", "exon.pm")
exonProbeLevel wich probes from Affy exon chips to use: "core", "extended", or "full"; this is ignored then using xmapCDFs
config$exonProbeLevel = "core"
Plotting Options
writeScatterPlots flag telling whether scatter plots should be drawnconfig$writeScatterPlots = TRUE
logColorRange for log-ratio heatmap plots, the color range will be -logColorRange to +logColorRange, values outside this interval are clamped
config$logColorRange = 4
topGeneSize for the top gene QC correlation plots and sample clustering this number of probes with highest variance in the data set is used
config$topGeneSize = 100
maxGenesForClustering maximum number of genes to use for clustering, if there are more genes then only the most varyiing are used
config$maxGenesForClustering = 2000
minGenesForClustering minimum number of genes needed for a clustering
config$minGenesForClustering = 30
showGeneClusterLabels should gene labels be drawn on the clustering heatmap
config$showGeneClusterLabels = FALSE
plotDegradation should degradation plots be generated; only for Affymetrix Exon Data; plots value from the first exon versus values from the last exon
config$plotDegradation = FALSE
highVarThreshold for the heamap showing the most varying genes; only genes where the standard deviation of the log2 values exceeds the thresholds are used
config$highVarThreshold = 0.5
showTailEffects show the tail effects for miRNA arrays
config$showTailEffects = FALSE
Differential Expression Options
minimalLog2Effect threshold for prefiltering probes on variance before running hypothesis tests; see the hypothesis test section for an explanationconfig$minimalLog2Effect = 0.3
pValueHighlightThresh only probes with a p-value below are highlighted in the plots
config$pValueHighlightThresh = 0.01
log2RatioHighlightThresh only probes with an absolute log2 ratio above the threshold are highlighted
config$log2RatioHighlightThresh = 0.5
testMethod the test method to use in a two groups-test; possible values: "t-test", "Wilcox", "limma"
config$testMethod = "t-test"
tukeyThresh for probes with ANOVA p-values below this threshold we compute the Tukey post-hoc tests
config$tukeyThresh = 0
Gene Set Analysis Options
runGO whether GO analysis should be run or not; will only be run if a probe to gene mapping is availableconfig$runGO = FALSE
runMetaCore whether MetaCore's pathway analysis should be run; will only be run if a probe to gene mapping is available
config$runMetaCore = FALSE
pValThreshGO only probes with a differential expression p-value below will be used as input for overrepresentation analysis
config$pValThreshGO = 1e-2
log2RatioThreshGO only probes with a higher expression change will be used as input for overrepresentation analysis
config$log2RatioThreshGO = 0
pValThreshFisher only GO categories with a Bonferroni-Holm correted p-value below will be shown
config$pValThreshFisher = 1e-4
pValThreshFisherKegg only Kegg pathways with a Bonferroni-Holm corrected p-value below will be shown
config$pValThreshFisherKegg = 1e-2
minCountFisher only GO categories that have at least this many genes are searched for overrepresented genes
config$minCountFisher = 3
runGSEA whether the Gene Set Enrichment Analysis (GSEA) should be run; is very slow
config$runGSEA = FALSE
pValThreshGsea only Gene Sets with a p-value below will be reported
config$pValThreshGsea = 1e-4
maxNumberGroupsDisplayed the maximum number of GO groups to show in the HTML tables
config$maxNumberGroupsDisplayed = 40
Output Options
writeAllProbes whether all probes should be written in the result of a test; if falseconfig$writeAllProbes = TRUE
doZip whether text files should be zipped
config$doZip = TRUE
writeAffyTxt whether the Affy Txt files produced by APT should be written
config$writeAffyTxt = FALSE
Annotation Options
probeAnnotationFromBioC the names of the probe annotation fields from bioconductor packages that should be used; and how they should be renamedconfig$probeAnnotationFromBioC = c("Gene Symbol"="SYMBOL", "Gene Description"="GENENAME", "Entrez Gene ID"="ENTREZID")
geneColumnSet the annotation columns that can be used to map probes to genes
config$geneColumnSet =c("Entrez Gene ID", "Gene Symbol", "Ensembl Gene ID", "Gene Symbol [Agilent]")
useAnnotationFromFile if annotations should be loaded from existing annotation files only; not trying to use Bioconductor packages or BioMart
config$useAnnotationFromFile = TRUE
Other Options
printChips print some header information from the loaded Agilent files on stdoutconfig$printChips = TRUE
saveImage save an image at the end of the analysis?
config$saveImage = FALSE
saveRawData save an .RData file holding the rawData Object right after import
config$saveRawData = FALSE
subset for testing purposes, use only a subset of probes/reads to speed up the processing
config$subset = FALSE
NGS Options
readUnmapped if unmapped reads should be loaded from BAM files; specific methods will have their own defaultsconfig$readUnmapped = TRUE
multiMatch if multi-matching reads should be loaded; values: "all";; future will be implement also: "unique", "random" (take randomly one of the hits), N (use all up to a multi-matching of N)
config$multiMatch = "all"
Choosing CDF Files for Affymetrix Data
Affymetrix data can be analysed using different groupings of probes into probe sets. The grouping is defined by CDF files. Next to the Affymetrix' standard CDF files we support also files from the brain array group:http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp
3'-IVT Arrays
By default, the Affymetrix CDF is used, you can overwrite this with the CDF option field in the analysis settings. If your CDF file is not available, please contact your FGCZ Bioinformatics contact to install it for you.Exon arrays
For exon arrays, the data analysis can be done at the "exon" or at the "transcript" level, the available CDFs are:Species | Exon level (exonmap) | Transcript level (Brainarray) |
Human | exon.pm | HuEx10stv2_Hs_ENSG |
Mouse | mouseexonpm | MoEx10stv1_Mm_ENSG |
Rat | ratexonpm | RaEx10stv1_Rn_ENSG |
The default choice of the CDF file depends on the analysis:
- Affymetrix QC: Brainarray version is used; because we're only looking at overall quality and sample similarity at gene expression level.
- Two Groups Analysis, ANOVA Analysis: Brainarray version is used
- Exon Two Groups Analysis, Exon ANOVA Analysis: exonmap version is used
List of manually installed CDF environments
Species | Environments | |
Arabidopsis | ATH1121501_At_TAIRG, ATH1121501_At_TAIRT, atsschiptilingprobes, atsschipallprobes, atsschipathprobes, Atdschip_expr |
Manually generated annotation files
- rice.txt: for the Affymetrix Rice chip. Holds annotations extracted from Affymetrix annotation file. The column "Entrez Gene ID" does not hold the Entrez Gene ID but a mixture of Gebank and TIGR Ids extracted from the "Target Description" column. This was done in order to have a usable Gene column for GO analysis.
Preprocessing
Affymetrix
- 3'IVT arrays and if a CDF file is available: B-Fabric uses the Affymetrix Power tools implementation to run the RMA and MAS5 algorithm
- Exon arrays: B-Fabric uses bioconducturs rma implementation
KEGG Pathway analysis
We use the mapping of full species names to the three-character kegg organism acronym:
ftp://ftp.genome.jp/pub/kegg/genes/taxonomy
Arabidopsis ATH1 array etc. is not supported because we don't have NCBI gene id annotations for these.