% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % %\VignetteIndexEntry{Quality assessment for gated Flow Cytometry Data} %\VignetteDepends{flowWorkspace,hwriter} %\VignetteKeywords{QA,svg,flowCore} %\VignettePackage{QUALIFIER} \documentclass[11pt]{article} \usepackage{times} \usepackage{hyperref} \usepackage[authoryear,round]{natbib} \usepackage{times} \usepackage{graphicx} %\usepackage{subfigure} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textsf{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \title{QUALIFIER: Quality Assessment of Gated Flow Cytometry Data} \author{Mike Jiang,Greg Finak} \begin{document} \maketitle \begin{abstract} \noindent \textbf{Background} The current \Rpackage{flowQ} package does the quality assessment on the ungated FCM data.However,there is need to identify deviant samples by monitoring the consistencies of the underlying statistical properties of different gated cell populations(such as white blood cells,lymphacytes,monocytes etc). The current package was also not designed for dealing with large datasets. To meet these needs, We developed \Rpackage{QUALIFIER} package using the gating template created in flowJo and performing QA checks on different gated populations. It divides the data preprocessing from the actual outlier detection process so that the statistics are calcuated all at once and the outlier detections and visualization can be done more efficiently and interactively. \Rpackage{ncdfFlow} is used to solve the memory limit issue for large datasets. \noindent \textbf{keywords} Flow cytometry,Quality Assessment, high throughput,svg,flowWorkspace,ncdfFlowSet \end{abstract} <>= library(QUALIFIER) @ \section{Parsing the QA gating template} Optionally, parsing xml workspace can be done in parallel mode. It can speed up the process for large datasets. <>= library(Rmpi) library(snowfall) library(ncdfFlow) @ \Rpackage{ncdfFlow} also needs to be loaded in order to support netCDF storage for flow data that will solve the limitation of memory issue, \Rfunction{parseWorkspace} function from \Rpackage{flowWorkspace} package is used to parse the flowJo workspace. If Rmpi and snowfall package are loaded and it will automatically switch to the parallel mode and \Robject{nslaves} arugment is used to specify the number of computing nodes used. <>= ws<-openWorkspace("~/QA_MFI_RBC_bounary_eventsV3.xml") G<-parseWorkspace(ws,execute=TRUE,isNcdf=TRUE,nslaves=6) saveNcdf("G","gatingHierarchy") save(G,file="gatingHierarchy/GS.Rda") @ The result G is a \Robject{GatingSet} containing multiple \Robject{GatingHierarchy} within which gated cell populations are stored. Note that this step is most time consuming especially for large datasets. So it is convienient to save the gatingset once the parsing is done so that it be loaded directly from disk later on for the further processing \section{Calculating the statistics} This is the second preprocessing step followed by parsing gating template from flowJo workspace. Firstly,we need to save the gating hierarchies are calculated and the sample annoation data (containing all the meta information about the FCS files and samples) into a global environment. <>= anno<-read.csv("~/FCS_File_mapping.csv")###read annotation data db<-new.env() saveToDB(db,G,anno) @ Then statistics of each gated population is extracted and saved in db.Again,\Rfunction{getQAStats} can be speeded up by running in parallel mode. It uses \Rpackage{parallel} package and automatically detection the number of computing nodes available. Optionally \Robject{nslaves} arugment can also be provided to manually specify the computing node. <>= library(parallel) getQAStats(db) ls(db) db$statsOfGS[1:5,] @ It is recommened to save all the preprocessed data to avoid the efforts of recomputing from the beginning: <>= save(db,file="ITNQASTUDY.rda")#save stats @ Once this is done,the more interactive quality assessment task can be performed based on the statistics extracted for each gated population. \section{Defining qaTasks} We provide a function to create a list of \Rclass{qaTask} objects by reading external csv spreadsheet containing descriptions of each QA task: <>= data("ITNQASTUDY") @ <>= checkListFile<-file.path(system.file("data",package="QUALIFIER"),"qaCheckList.csv.gz") qaTask.list<-makeQaTask(db,checkListFile) qaTask.list[1:2] @ This is a convenient way to construct multiple \Rclass{qaTask}s. Users can also create the individual \Rclass{qaTask} by using \Rfunction{new} method. \section{Quality assessment and visualiztion} The package provides two important methods:\Rfunction{qaCheck} and \Rfunction{plot} to perform quality assessment and visualize the QA results. They both use the information stored in \Robject{qaTask} object and the \Robject{formula}, which is given either explicitly by the argument or implicitly by the \Robject{qaTask} object. It is generally of the form y $\sim$ x | g1 * g2 * ... ,y is the statistics to be checked in this QA, It must be one of the four types: "MFI": Median Fluorescence Intensity of the cell population specified by \Rclass{qaTask}, "proportion": the percentage of the cell population specified by \Rclass{qaTask} in the parent population, "count": the number of events of the cell population specified by \Rclass{qaTask}, "spike": the variance of intensity over time of each channel ,which indicating the stability of the fluorescence intensity. x specifies the variable plotted on x-axis (such as date) in \Rfunction{plot} method. g1,g2,.... are the conditioning variables, which divide the data into subgroups and apply the outlier detection whitin each individual groups or plot them in different panels. They may also be omitted,in which case the outliers detection is peformed in the entire dataset. For example, RBC Lysis efficiency(percentage of WBC population) check is defined by \Rclass{qaTask} . <>= qaTask.list[["RBCLysis"]] @ According to the formula stored in \Rclass{qaTask}, it uses the statistical property "proportion" and groups the data by "Tube"(or staining panel). "RecdDt" is reserved for plotting purpose. Cell population is defined as "WBC\_perct" <>= qaCheck(qaTask.list[["RBCLysis"]],outlierfunc=outlier.cutoff,lBound=0.8) @ As we see,\Rfunction{qaCheck} reads all the necessary information about the gated data from \Rclass{qaTask} object. The only thing needs to be specified by end users is how the outliers are called. This is done by providing an outlier detection function \Rfunction{outlierfunc} that takes a numeric vector as input and returns a logical vector as the output. Here "outlier.cutoff" is used and threshold "lBound"("less than",using uBound for "larger than") is specified. <>= plot(qaTask.list[["RBCLysis"]]) @ By default all the data are plotted,argument "subset" can be used to visualize a small subset. <>= plot(qaTask.list[["RBCLysis"]],subset=Tube=='CD8/CD25/CD4/CD3/CD62L') @ With \Robject{scatterPlot} flag set as true and \Robject{subset} properly specified plot method can generate scatter plots for the selected FCS files, <>= plot(qaTask.list[["RBCLysis"]],subset=name=='06087181_F01_I010.fcs',scatterPlot=TRUE) @ \includegraphics{QUALIFIER-plot-subset2} x term in the formula is normally ignored in \Rfunction{qaCheck}.However,when "plotType" of the \Rfunction{qaTask} is "bwplot", it is used as the conditioning variable that divides the data into subgroups within which the \Rfunction{outlierfunc} is applied. <>= qaTask.list[["MNC"]] @ This qaTask detects the significant variance of MNC cell population percentage among aliquots,which have the same "coresampleid". Plot type of this object tells the method to group data by "coresampleid". <>= qaCheck(qaTask.list[["MNC"]],z.cutoff=1.5) @ Interquartile Range based outlier detection function is used to detect outliers <>= plot(qaTask.list[["MNC"]]) @ The red circles in the boxplot indicate the possible outlier samples and the box of red color indicates the entire sample group has significant variance and is marked as the group outlier. Again,with \Robject{scatterPlot} and \Robject{subset} arguments, scatter plots can be generated for the selected FCS files or sample groups, <>= plot(qaTask.list[["MNC"]] ,scatterPlot=TRUE ,subset=coresampleid==11730) @ \includegraphics{QUALIFIER-plot-MNC-scatter} We can also apply simple aggregation to the statisics through the formula. <>= qaTask.list[["BoundaryEvents"]] @ Here the default formula only extracts the "proportion" from each individual channel. In order to check the total percentage of boundary events of all channels for each fcs file, we can write a new formula by applying aggregation function "sum" to "proportion" and group the data by fcs file ("name" in this case). <>= qaCheck(qaTask.list[["BoundaryEvents"]] ,sum(proportion) ~ RecdDt | name ,outlierfunc=outlier.cutoff ,uBound=0.0003 ) @ And we still can visualize the results chanel by chanel. <>= plot(qaTask.list[["BoundaryEvents"]],proportion ~ RecdDt | channel) @ Another three examples: QA check of Fluorescence stability overtime using t-distribution based outlier detection function. <>= qaCheck(qaTask.list[["MFIOverTime"]] ,rFunc=rlm ,z.cutoff=3 ) plot(qaTask.list[["MFIOverTime"]] ,y=MFI~RecdDt|stain ,subset=channel%in%c('FITC-A') ,rFunc=rlm ,par=list(scales=list(y=c(relation="free"))) ) @ Note that the robust linear regression is applied in each group in order to capture the significant MFI change over time. The individual outliers within each group is detected based on the residue. \Robject{par} can be used to pass any lattice arguments to control the apparance of the plot. <>= qaCheck(qaTask.list[["spike"]] ,outlierfunc=outlier.t ,alpha=0.00001) plot(qaTask.list[["spike"]],y=spike~RecdDt|channel ,subset=Tube=='CD8/CD25/CD4/CD3/CD62L'&channel%in%c('FITC-A') ) @ When minitoring the total number of events for each tube, a pre-determined events number can be provided as the threshold to the qaCheck method. <>= tubesEvents<-read.csv(file.path(system.file("data",package="QUALIFIER"),"tubesevents.csv.gz"),row.names=1) tubesEvents<-QUALIFIER:::.TubeNameMapping(db,tubesEvents) @ <>= qaCheck(qaTask.list[["NumberOfEvents"]] ,formula=count ~ RecdDt | Tube ,outlierfunc=outlier.cutoff ,lBound=0.8*tubesEvents ) plot(qaTask.list[["NumberOfEvents"]] ,subset=Tube=='CD8/CD25/CD4/CD3/CD62L' ) @ \Robject{tubesEvents} could be a one-column data frame or a named list/vector. Threshold values are stored in the column or list/vecor and conditioning values stored in rownames or names of the list/vector. <>= tubesEvents @ <>= qaCheck(qaTask.list[["RedundantStain"]],z.cutoff=1) plot(qaTask.list[["RedundantStain"]] ,y=proportion~coresampleid|channel:stain ,subset=stain%in%c('CD8') ) @ \section{Creating quality assessment report} Besides the interactive visualization provided by \Rfunction{plot} method,we also provide one routine to generate all plots in one report.This function reads the QA results calculated by \Rfunction{qaCheck} and the meta information of each QA task provided in spreadsheet qaCheckList and generate the summary tables and svg plots. Svg plots provide tooltips containing the detail information about each sample as and hyperlinks of densityplot for each individual FCS file. <>= qaReport(qaTask.list,outDir="~/output",plotAll=FALSE) @ \Robject{plotAll} is the argument to control the plotting of the individual scatter plot for each FCS file. When TRUE, all the FCS files are plotted. If FALSE,only the FCSs marked as Outliers will be plotted. It can also be set to "none" meaning that no scatter plot will be generated, which provides the quick review of the html report. \section{Conclusion} By the formula-based \Rfunction{qaCheck} and \Rfunction{plot} methods,different QA tasks can be defined and performed in a generic way. And \Rfunction{plot} only reads the outliers detection results pre-calculated by \Rfunction{qaCheck}, which reduces the cost of interactive visualization. Two kinds of lattice plots are currently supported:xyplot and bwplot(boxplot),depends on the \Robject{plotType} in \Rfunction{qaTask} object. When the output path is provided by \Robject{dest}, the svg plot is generated. In svg plot, each dot or box (or only the one marked as outliers) is annotated by the tooltip or hyperlink.which further points to the individual density plot of the gated population. \clearpage %\bibliographystyle{plainnat} %\bibliography{cytoref} \end{document}