--- title: "The ChIPanalyser User's Guide" author: "Patrick Martin" output: pdf_document: null keep_tex: yes number_sections: no html_document: default code_folding: show package: '`ChIPanalyser''' vignette: | %\VignetteIndexEntry{ChIPanalyser User's Guide for Genetic Algorithms} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- # Introduction - What is this package? ChIPanalyser provides a quick an easy method to predict and explain TF binding. The package uses a statistical thermodynamic framework to model the binding of proteins to DNA. The model assumes that there are four main driver to TF binding: * Chromatin State affinity * Binding Energy * Number of bound Molecules * Scaling Factor modulating binding affinity While binding energy is given by PWM matrices, the other parameters will be inferred by maximizing or minimizing a goodness of fit score between ChIP data and ChIPanalyser predictions. To do so, ChIPanalyser provides a genetic algorithm to infer optimal values for each parameter. # Methods - The Chromatin State Model The chromatin state model is described by the following equation derived from statistical thermodynamics: $$P(N,a,\lambda,\omega)_j = \frac{N \cdot a_{j} \cdot e^{(\frac{1}{\lambda} \cdot \omega_{j})}}{N \cdot a_{j} \cdot e^{(\frac{1}{\lambda} \cdot \omega_{j})}+ L \cdot n \cdot [a_{i} \cdot e^{(\frac{1}{\lambda} \cdot \omega_{j})}]_ {i}} $$ * $N$ , the average number of bound molecules * $a_{j}$ , chromatin state affinity at site j * $\omega$ , the binding energy required for a TF to bind to site $j$ - in the form of a Position Weight Matrix Score * $\lambda$ , a scaling factor for the Position Weight Matrix score * $L$ , the length of the genome of interest * $n$ , the ploidy level of the organism Chromatin state affinity is defined as the following: $$a_{j} = \sum_{k} \alpha_k \cdot c_{j}^k$$ with $\alpha$ the chromatin state affinity scores for a given TF and $c$ the chromatin state at site $j$. Shortly, the model above generates ChIP like profiles and describes the affinity of a TF to chromatin states. The affinity scores $a$, the number of bound molecules $N$, and lambda $\lambda$ are inferred by optimising the fit between predicted profiles and ChIP data using a genetic algortihm. # Using ChIPanalyser ## Loading data To demonstrate the use of ChIPanalyser and the genetic algorithm, we provide some internal data sets. We complement this data with external data source such a DNA sequence sets taken from the `BSgenome`package suit. ### ChIPanalyser First, we load ChIPanlyser and the internal data ```{r loading_ChIPanalyser, eval = TRUE, echo = TRUE} library(ChIPanalyser) # Input data data(ChIPanalyserData) # PFM Matrix PFM <- file.path(system.file("extdata",package="ChIPanalyser"),"BEAF-32.pfm") ``` ### External Data To reduce the size of the internal data size, we complement our data with the _Drosophila_ DNA sequence taken from the BSgenome package suit. ```{r bsgenome, eval = TRUE, echo = TRUE} library(BSgenome.Dmelanogaster.UCSC.dm6) DNASequenceSet <- getSeq(BSgenome.Dmelanogaster.UCSC.dm6) ``` ### Input data : What is it? Now we that have loaded some data, what exactly are we looking at? The environment should contain the following new objects: * *chip* - GRanges object containing ChIP scores. Essentially, ChIP data that will be used to _train_ our model. * *cs* - GRanges object containing Chromatin State (CS) information. Essentially, where each CS can be found in the genome. * *top* - GRanges object containing regions of interest. Esssentially, the genomic regions we will use for training and testing. * *PFM* - Path to file containing the Position Frequency Matrix. Essentially, this represents the binding affinity of our TF (here we are using BEAF-32) * *DNASequenceSet* - DNAStringSet containing the DNASequenceSet of the organism we are looking at. * *geneRef* - GRanges object containing gene reference taken from the (Genome UCSC website)[https://genome.ucsc.edu/] ## Setting Parameters The next step consitst in setting initial parameters to run the genetic algorithm. ```{r GA_params, eval = TRUE, echo = TRUE} # Number of individuals per generation pop <- 10 # Number of generations gen <- 2 # Mutation Probability mut <- 0.3 # Children - Number of ofspring passed to the next generation child <- 2 # Method - Goodness of fit metric used to optimise the Genetic algorithm method <- "MSE" ``` Please note that the parameters presented here are for the sake of the vignette. We recommend using the following parameters for a first trial with your own data. Please be advised that these parameters will depend on the nature of the data you are using. ```{r GA_params_recom, class.source = 'fold-hide' ,eval = FALSE, echo = TRUE} # Number of individuals per generation pop <- 100 # Number of generations gen <- 50 # Mutation Probability mut <- 0.3 # Children - Number of ofspring passed to the next generation child <- 10 # Method - Goodness of fit metric used to optimise the Genetic algorithm method <- "MSE" ``` We can also define which parameters we wish to optimise. In this example, we will optimise the number of bound molecules ($N$), lambda ($\lambda$), the PWM Threshold and 11 different chromatin states. ```{r opti_params, eval = FALSE, echo = TRUE} # Parameters to optimised params <- c("N","lambda","PWMThreshold", paste0("CS",seq(1:11))) ``` Alternatively, we can also set custom ranges for the parameters. We will use custom ranges here to reduce computational time. ```{r opti_params2,eval = TRUE,echo= TRUE} params_custom <- vector("list", 14) names(params_custom) <- c("N","lambda","PWMThreshold", paste0("CS",seq(1:11))) # vector in the format of min value, max value and number of values params_custom$N <- c(1,1000000,5) params_custom$lambda <- c(1,5,5) # Bound between 0 and 1 params_custom$PWMThreshold <- c(0.1,0.9,5) # Bound between 0 and 1 CS <- c(0,1,5) CS_loc <- grep("CS",names(params_custom)) for(i in CS_loc){ params_custom[[i]] <- CS } ``` ## Initializing ChIPanalyser ### Building Initial objects The first step of any ChIPanalyser anaylsis is to create parameter object. These object contain input paramters and hold basic data. Here, we will load the PFM as PWM and compute the Base Pair Frequency from the DNASequenceSet. ```{r gpp, eval = TRUE, echo = TRUE} GPP <- genomicProfiles(PFM=PFM,PFMFormat="JASPAR", BPFrequency=DNASequenceSet) ``` ### Generating a starting population For clairty, we will show what a starting population looks like. This step is not stricly required as the `evolve` function can take the number of individuals in the poppulation (defined above by `pop`) and the parameters to be optimised (defined above by `params` or `params_custom`). ```{r generateStartingPopulation, class.source = 'fold-hide', eval = TRUE, echo = TRUE} start_pop <- generateStartingPopulation(pop, params_custom) ``` ### Pre-processing ChIP data We will pre-proccess the ChIP data by reducing noise and converting GRanges to a ChIPscore object. This object contains normalised and smoothed ChIP scores that will be used to train and test the model. ```{r preprocess, eval = TRUE, echo = TRUE} chipProfile <- processingChIP(chip,loci = top) # Splitting data into training and testing # We recommend setting dist to 20/80. However, here we only have 4 loci. splitdata <- splitData(chipProfile,dist = c(50,50),as.proportion = TRUE) trainingSet <- splitdata$trainingSet testingSet <- splitdata$testingSet ``` ## Evolution We can run the Genetic algorithm provided by ChIPanalyser using a single function. The function will generate intermediate files that allow you to check the status of the algorithm while it is running. It should be noted that the intermediate files are updated at each generation. This method also provides a lambda database. This allows for a faster run time on larger data sets as all values for lambda are pre-computed. If you do not want intermediate files, set the `checkpoint` argument to `FALSE`. If you do not want to pre-compute lambda values, set the `lambda` argument to `FALSE`. For the purpose of this vignette, we will not save intermediate files. ```{r evolve, eval = TRUE, echo = TRUE} evo <- evolve(population = pop, DNASequenceSet = DNASequenceSet, ChIPScore = trainingSet, genomicProfiles = GPP, parameters = params_custom, mutationProbability = mut, generations = gen, offsprings = child, chromatinState = cs, method = method, filename = "This_TF_is_Best_TF", checkpoint = FALSE, cores= 1) ``` The output of this function returns a list of 3 elements: * *database* - a data frame with all parameters that have been computed through out each generation * *poppulation* - list containing the last generated population * *fitest* - list containing goodness of metric for best perfomring individuals ## Fitest of them all Once we have the best perfomring paramters, we can plug them into a single run of ChIPanalyser and use these parameters on a tesing set. ### Get fitest individual The first step is to extract the best performing individual and its associated traits from the population. ```{r getFit, echo = TRUE, eval = TRUE} SuperFit <- getHighestFitnessSolutions(evo$population, child = 1, method = method) single<-evo[["population"]][SuperFit] ``` ### Running ChIPanalyser with fitest individual We can use these parameters as input to the `singleRun` function to obtain the ChIP like profiles. This run represents the predicted run and TF binding affinity of our TF of choice. We will set fitness to `all`. This means that the function will return all goodness of fit metrics available. ```{r singleRun, eval = TRUE, echo = TRUE} # Set chromatin states for single run - create CS Granges with affinity scores cs_single <- setChromatinStates(single,cs)[[1]] superFit <- singleRun(indiv = single, DNAAffinity = cs_single, genomicProfiles = GPP, DNASequenceSet = DNASequenceSet, ChIPScore = testingSet, fitness = "all") ``` This final sections returns a list containing 3 elements: * *occupnacy* - genomicProfiles object containing occupancy scores for each region * *ChIP* - genomicProfiles object containing ChIP like scores for each region * *gof* - goodness of fit scores for each regions and mean scores over all regions. ## Plotting Finally, we plot the resulting profiles. ```{r plotOccup, eval = TRUE, echo = TRUE,fig.height= 8, fig.width=15,fig.show=TRUE} par(mfrow = c(2,1)) plotOccupancyProfile(predictedProfile = superFit$ChIP, ChIPScore = testingSet, chromatinState = cs_single, occupancy = superFit$occupancy, goodnessOfFit = superFit$gof, geneRef = geneRef, addLegend = TRUE) ```