%\VignetteIndexEntry{Importing flowJo Workspaces into R} %\VignetteKeywords{flow cytometry, single cell assay, import} %\VignettePackage{flowWorkspace} \documentclass[11pt]{article} \usepackage{Sweave} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \title{flowWorkspace: A Package for Importing flowJo Workspaces into R} \author{Greg Finak } \begin{document} \maketitle \section{Purpose} The purpose of this package is to provide functionality to import relatively simple \textit{flowJo} workspaces into R. By this we mean, accessing the samples, groups, transformations, compensation matrices, gates, and population statistics in the \textit{flowJo} workspace, and replicating these using (primarily) \textit{flowCore} functionality. \section{Why Another flowJo Workspace Import Package?} There was a need to import \textit{flowJo} workspaces into R for comparative gating. The \textit{flowFlowJo} package did not meet our needs. Many groups have legacy data with associated flowJo XML workspace files in version 2.0 format that they would like to access using BioConductor's tools. Hopefully this package will fill that need. \section{Support} This package supports importing of \textbf{Version 2.0 XML workspaces only}. We cannot import \textbf{.jo} files directly. You will have to save them in XML workspace format, and ensure that that format is \textit{workspace version 2.0}. The package has been tested and works with files generated using flowJo version 9.1 on Mac OS X. XML generated by older versions of \textit{flowJo} on windows should work as well. We do not yet support \textit{flowJo}'s \textbf{Chimera} XML schema, though that support will be provided in the future. The package supports import of only a subset of the features present in a flowJo workspace. The package allows importing of sample and group names, gating hierarchy, compensation matrices, data transformation functions, a subset of gates, and population counts. BooleanGates are now supported by flowWorkspace. \section{Data Structures} The following section walks through opening and importing a flowJo workspace. \subsection{Loading the library} Simply call: <>= library(flowWorkspace) @ The library depends on numerous other packages, including \textit{graph, XML, Rgraphviz, flowCore, flowViz, RBGL}. \subsection{Opening a Workspace} We represent flowJo workspaces using \texttt{flowJoWorkspace} objects. We only need to know the path to, and filename of the flowJo workspace. <>= d<-system.file("extdata",package="flowWorkspaceData"); wsfile<-list.files(d,pattern="A2004Analysis.xml",full=T) @ In order to open this workspace we call: <>= ws<-openWorkspace(wsfile) summary(ws) @ We see that this a version 2.0 workspace file. It's location and filename are printed. Additionally, you are notified that the workspace file is open. This refers to the fact that the XML document is internally represented using 'C' data structures from the \Rpackage{XML} package. After importing the file, the workspace must be explicitly closed using \Rfunction{closeWorkspace()} in order to free up that memory. \subsection{Parsing the Workspace} With the workspace file open, we have not yet imported the XML document. The next step parses the XML workspace and creates R data structures to represent some of the information therein. Specifically, by calling \Rfunction{parseWorkspace()} the user will be presented with a list of \textit{groups} in the workspace file and need to choose one group to import. Why only one? Because of the way flowJo handles data transformation and compensation. Each group of samples is associated with a compensation matrix and specific data transformation. These are applied to all samples in the group. When a particular group of samples is imported, the package generates a \Rclass{GatingHierarchy} for each sample, describing the set of gates applied to the data (note: polygons, rectangles, quadrants, and ovals and boolean gates are supported). The set of GatingHierarchies for the group of samples is stored in a \Rclass{GatingSet} object. Calling \Rfunction{parseWorkspace()} is quite verbose, informing the user as each gate is created. The parsing can also be done non--interactively by specifying which group to import directly in the function call (either an index or a group name). An additional optional argument \texttt{execute=T/F} specifies whether you want to load, compensate, transform the data and compute statistics immediately after parsing the XML tree. <>= G<-parseWorkspace(ws,name=1,path=ws@path,isNcdf=FALSE,cleanup=FALSE,keep.indices=TRUE); #import the first group #Lots of output here suppressed for the vignette. @ When isNcdf flag is set TRUE,the data is stored in ncdf format on disk. <>= G @ We have generated a \Rclass{GatingSet} with $\Sexpr{length(G)}$ samples, each of which has 19 associated gates. Subsets of gating hierarchies can be accessed using the standard R subset syntax. At this point we have parsed the workspace file and generate the gating hierarchy associated with each sample imported from the file. The data have been loaded, compensated, and transformed in the workspace, and gating has been executed. The resulting \Rclass{GatingSet} contains a replicated analysis of the original flowJo workspace. <>= G<-lapply(G,function(x)execute(x)) @ We can plot the gating hierarchy for a given sample: <>= require(Rgraphviz) plot(G[[1]]) @ We can list the nodes (populations) in the gating hierarchy: <>= getNodes(G[[1]]) @ Note that the number preceding the period in the node names is just an identifier to help uniquely label populations in the gating hierarchy. It does not represent any information about population statistics. We can get a specific gate definition: <>= getGate(G[[1]],getNodes(G[[1]])[3]) @ We can extract the dimensions relating to a specific gate: <>= getDimensions(G[[1]],getNodes(G[[1]])[3]) @ We can extract vertices of a gate: <>= getBoundaries(G[[1]],getNodes(G[[1]])[3]) @ We can get the population proportion (relative to its parent) for a single population: <>= getProp(G[[1]],getNodes(G[[1]])[3]) @ Or we can retrieve the population statistics for all populations in the sample: <>= getPopStats(G[[1]]) @ We can plot the coefficients of variation between the counts derived using flowJo and flowCore for each population: <>= print(plotPopCV(G[[2]])) @ We can plot individual gates: note the scale of the transformed axes. <>= print(plotGate(G[[1]],getNodes(G[[1]])[6],lwd=2,cex=2)) @ If we have metadata associated with the experiment, it can be attached to the \Robject{GatingSet}. <>= d<-data.frame(sample=factor(c("sample 1", "sample 2")),treatment=factor(c("sample","control")) ) G@metadata<-new("AnnotatedDataFrame",data=d) pData(G); @ We can retrieve the subset of data associated with a node: <>= getData(G[[1]],getNodes(G[[1]])[3]); @ Or we can retrieve the indices specifying if an event is included inside or outside a gate using: <>= getIndices(G[[1]],getNodes(G[[1]])[3]) @ The indices returned are relative to the parent population (member of parent AND member of current gate), so they reflect the true hierarchical gating structure. If we wish to do compensation or transformation manually, we can retrieve all the compensation matrices from the workspace: <>= C<-getCompensationMatrices(ws); C @ Or we can retrieve transformations: <>= T<-getTransformations(ws) names(T) names(T[[1]]) T[[1]][[1]] @ <>= A<-names(T) B<-names(T[[1]]) @ \Rfunction{getTransformations} returns a list, each element of which corresponds to a transformation applied to a group of samples. The transformation is presented as a list of functions to be applied to different dimensions of the data. Above, the transformation is applied to all samples of the group and for each sample in the group, the appropriate dimension is transformed using a channel--specific function from the list. The list of samples in a workspace can be accessed by: <>= getSamples(ws); @ And the groups can be accessed by: <>= getSampleGroups(ws) @ The \texttt{compID} column tells you which compensation matrix to apply to a group of files, and similarly, based on the name of the compensation matrix, which transformations to apply. \subsection{Converting to flowCore Objects} You may want to convert the imported workspace into \Robject{flowCore} objects, such as workflows. We provide this functionality via the \Rfunction{flowWorkspace2flowCore} function. \Rfunction{flowWorkspace2flowCore} extracts the compensation matrices,transformation functions and all the gates from GatingHierarchies generated by flowWorkspace package and converts them to the respective views and actionItems of workFlows defined by flowCore package. It takes a gatingHierarchy, flowJoWorkspace or GatingSet as the input, and returns one or multiple workflows as the result, depending on whether the gating hierarchies for each sample (including gate coordinates) are identical. <>= wfs<-flowWorkspace2flowCore(G,path=ws@path); wfs @ \Rfunction{plotWf} plots the workflow tree <>= plotWf(wfs[[1]]) @ Finally, when we are finished with the workspace, we close it: <>= closeWorkspace(ws); ws @ \subsection{Exporting to FlowJo OSX 9.2} The \Rfunction{exportAsFlowJoXML} function can be used to export a \Robject{flowCore::workFlow} as an XML workspace for FlowJo 9.2 OSX. If flowWorkspace has been used to import an existing FlowJo workspace, \Rfunction{flowWorkspace2flowCore} can be used to obtain a \Robject{workFlow} for exporting. Currently this function can export one workFlow at a time. \subsection{Parallel Support} Parsing and gating can be time--consuming. This latest version (>1.0.0) of flowWorkspace supports parallelization via multicore, snowfall, and Rmpi. If multicore is loaded, or a snowfall cluster is initialized, flowWorkspace will use snowfall or multicore (in that order of preference) to parse the workspace. Parallel gating of the workspace can be performed by loading Rmpi and running parseWorkspace(). This corresponds to the execute() step of the parseWorkspace function. Rmpi is needed to handle concurrent reads/writes to the ncdfFlowSet file. Parallel gating / parsing will work with netCDF-backed data or if data is stored in RAM. \subsection{Deprecated Functionality} The following behaviour is no longer supported and has been replace by more extensive netCDF support via the ncdfFlow package. If you have particularly large data files (millions of events), then you won't want to keep the data around, nor the indices for gate membership. Instead, pass the options \texttt{cleanup=TRUE, keep.indices=FALSE} to the \Rfunction{execute()} function, and the data will be scrubbed after computing population statistics. With future improvements making use of the netCDF framework, and bitvector representations of population memberships; this will improve memory usage in high--throughput unsupervised analysis settings. \section{Troubleshooting} If this package is throwing errors when parsing your workspace, and you are certain your workspace is version 2.0, contact the package author. If you can send your workspace by email, we can test, debug, and fix the package so that it works for you. Our goal is to provide a tool that works, and that people find useful. \section{Future Improvements} We are working on support for flowJo XML workspaces exported from the Windows version of flowJo. Efforts are underway to integrate GatingSet and GatingHierarchy objects more closely with the rest of the flow infrastructure. \end{document}