--- title: "Introduction to ClustIRR" author: "Kai Wollek and Simo Kitanovski" date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Introduction to ClustIRR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 5, fig.width = 5, fig.align = "center" ) ``` ```{r} library(ClustIRR) library(knitr) library(visNetwork) ``` # Introduction Adaptive immune immunity relies on diverse immune receptor repertoires (IRRs: B- and T-cell receptor repertoires) to protect the host against genetically diverse and rapidly evolving pathogens, such as viruses, bacteria, or cancers. The B- and T-cell receptor (BCR/TCR) sequence diversity originates in part due to V(D)J recombination, in which different germline-encoded genes are joined to form immune receptors. As a result of this process, practically every newly formed naive mature T cell and B cell is equipped with a distinct IR, and this allows them to recognize distinct sets of antigens. B-cells bind antigens directly via the complementarity determining regions (CDR) of their BCRs, and T-cells recognize antigenic peptides presented by major histocompatibility (MHC) molecules via the CDRs of their TCRs. Antigen recognition may lead to B/T cell activation, and in such a case, the cells start to proliferate rapidly, forming antigen-specific clones that are capable of mounting effective immune response. Recent studies have shown that similarity in TCR sequences implies shared antigen specificity between receptors. Hence, by clustering of TCR sequences of a repertoire derived by high-throughput sequencing (HT-seq), we can identify groups of TCRs with shared antigen specificity, which is essential for the development of cancer immunotherapies, vaccines, antiviral drugs, etc. This vignette introduces `r Biocpkg("ClustIRR")`, a computational method for clustering of IRRs. # Installation `r Biocpkg("ClustIRR")` is freely available as part of Bioconductor, filling the gap that currently exists in terms of software for quantitative analysis of IRRs. To install `r Biocpkg("ClustIRR")` please start R and enter: ```{r, eval=FALSE} if(!require("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("ClustIRR") ``` # ClustIRR algorithm The algorithm of `r Biocpkg("ClustIRR")` performs clustering of IRR sequences to find groups of receptors with similar specificity. ```{r graphic, echo = FALSE, fig.align="left", out.width='90%'} knitr::include_graphics("../inst/extdata/logo.png") ``` ## Input The main input of `r Biocpkg("ClustIRR")` are two repertoires: `s` and `r`: * `s`: IRR under investigation (case/sample) * `r`: reference IRR (control/reference) Both repertoires should be provided as data.frames. The rows in the data.frames correspond to clones (group of cells derived from a common parent cell by clonal expansion). `r Biocpkg("ClustIRR")` requires the **CDR3 sequence** and the **size** of each clone. **CDR3 sequences** For each clone, `s` and `r`, must contain the amino acid sequences of the corresponding complementarity determining regions 3 (CDR3s). We may have the CDR3s from one T cell receptor chain (e.g., \ only CDR3$\alpha$s or only CDR3$\beta$s) or from both chains (CDR3$\alpha\beta$). Similarly, we may have CDR3s from $\gamma\delta$ T-cells (CDR3$\gamma$ and CDR3$\delta$) or B-cells (CDR3H and CDR3L). This is explained in case study 4. **clone size** `s` and `r` may also contain the size of each clone. This is provided by the column `clone_size`. If the clone sizes are not available, then we assume that `clone_size` = 1. Let's have a look at an example data set which we will use as input: ```{r} data("CDR3ab") ``` ```{r} # take the first 500 CDR3b sequences from the data -> s s <- data.frame(CDR3b = c(CDR3ab$CDR3b[1:500], "CASSSSPGDQEQFF"), clone_size = 1) # take 5000 CDR3b sequences 501:5500 from the data -> r r <- data.frame(CDR3b = CDR3ab$CDR3b[501:5500], clone_size = 1) ``` ```{r} str(s) ``` ```{r} str(r) ``` ## Clustering `r Biocpkg("ClustIRR")` employs two strategies for clustering: * **local**: detects enrichment of motifs in `s` compared to `r` * **global**: identifies pairs of CDR3s in `s` that have similar sequences The rationale behind these two clustering strategies is the following: two identical CDR3 sequences have the same specificity. Similar CDR3 sequences (e.g., CDR3 sequences that differ by only one amino acid) may have similar specificity. Global clustering is designed to find pairs of CDR3s that are globally (based on the complete CDR3 sequences) similar. We also know that two CDR3s with significantly different sequences may still recognize the same peptide[^1] if they share a motif in their core regions (e.g., identical 4-mer). Such "useful" motifs may be enriched in `s` but not in `r`, and local clustering aims to identify them. ### Local clustering CDR3 sequences are segmented into overlapping **motifs** ($k$-mers), where $k$ is specified by setting the input $ks = 4$. Example of segmenting CDR3 sequence into 4-mers: ```{r} cdr3 <- "CASSTTTGTGELFF" ks <- 4 colnames(stringdist::qgrams(cdr3, q = ks)) ``` $k$-mers found in the **core region** of the CDR3 loop are more likely to establish contact with an antigenic peptide than the $k$-mers found at the flanks of the CDR3 sequence. Hence, the user is encouraged to remove a few amino acids from the flanks of each CDR3 sequence. This can be done by changing the control input from `control$trim_flank_aa,` e.g.: * `control$trim_flank_aa = 0`: no trimming * `control$trim_flank_aa = 3`: trim three amino acids from both flanks of the CDR3 sequence An example of trimming CDR3 flanks and segmenting the core of the CDR3 sequence into 4-mers is shown below. We now focus on five overlapping motifs found in the core region of the `CASSTTTGTGELFF`. ```{r} t <- 3 cdr3_trimmed <- substr(x = cdr3, start = t + 1, stop = nchar(cdr3) - t) colnames(stringdist::qgrams(cdr3_trimmed, q = ks)) ``` A motif is considered enriched if the following (user-defined) criteria are satisfied: 1. `control$local_min_o`: minimum motif frequency in `s` 2. `control$local_min_ove`: minimum ratio of observed vs. expected (OvE) relative motif frequency, with $OvE=\dfrac{f_s}{n_s}/\dfrac{f_r}{n_r}$ * $f_{s}$ and $f_{r}$: motif frequencies in repertoires `s` and `r` * $n_{s}$ and $n_{r}$: total number of motifs in repertoires `s` and `r` 3. `control$local_max_fdr`: maximum false discovery rate (FDR). Corrected p-value, computed by a one-sided Fisher's exact test (effectively a hypergeometric test). ### Global clustering For global clustering, `r Biocpkg("ClustIRR")` employs one of three strategies. The user can select each strategy with the control parameters: `global_smart` and `global_pairs` *Strategy 1:* if `control$global_smart = FALSE` (default) Here, `r Biocpkg("ClustIRR")` quantifies the dissimilarity between pairs of CDR3 sequences using Hamming distances. Two CDR3 sequences with Hamming distance $\leq x$ are considered globally similar, where $x$ is the user- defined threshold `control$global_max_hdist` (default = 1). With this, `r Biocpkg("ClustIRR")` provides a very simple and intuitive heuristic for identifying globally similar CDR3s. But, this approach also has drawbacks: 1. CDR3 sequences with different lengths are by definition dissimilar 2. the Hamming distance, by design, ignores the properties of the amino acids *Strategy 2:* if `control$global_smart = TRUE` In this case, `r Biocpkg("ClustIRR")` does pairwise alignment of CDR3 sequences and computes a BLOSUM62 score $B$ that is normalized by the alignment length $l$ $\rightarrow$ $NB = B/l$. *Strategy 3:* if `control$global_pairs` is specified by the user `r Biocpkg("ClustIRR")` also provides a second input option (see red input in `r Biocpkg("ClustIRR")` workflow), which allows the user to provide a data.frame containing globally similar CDR3 sequences computed by complementary approaches (e.g., *tcrdist*). This input can be provided via the input parameter `control$global_pairs`. In this case, global clustering is not performed by `r Biocpkg("ClustIRR")`. ## Output The main function in `r Biocpkg("ClustIRR")` is `cluster_irr`. This function returns an S4 object of class `clust_irr`. The object contains two sublists (slots): 1. clustering results: tables and lists 2. processed inputs: processed form of the input data (`s`) and parameters We will describe the inputs, outputs, and the algorithm of `r Biocpkg("ClustIRR")` with the help of the following case studies. # Case study 1: simple TCR repertoire analysis with `r Biocpkg("ClustIRR")` In this example, we will insert the motif *LEAR* in the core regions of the first 20 CDR3b sequences of repertoire `s`. With this, we simulate enrichment of this motif. This motif is not enriched in repertoire `r`, and `r Biocpkg("ClustIRR")` should be able to detect *LEAR* as enriched: ```{r} # insert motif LEAR substr(x = s$CDR3b[1:20], start = 6, stop = 9) <- "LEAR" ``` ... and then we perform clustering with `r Biocpkg("ClustIRR")`: ```{r} o <- cluster_irr(s = s, r = r) ``` ## Local clustering results In the following table we see that `r Biocpkg("ClustIRR")` has detected enrichment of *LEAR* in `s` compared to `r`: ```{r} # extract local motifs local_motifs <- get_clustirr_clust(o)$CDR3b$local$m # display only passed motifs knitr::kable(local_motifs[local_motifs$pass == TRUE, ], row.names = FALSE) ``` Interestingly, the motif *LLEA* is also enriched. This is probably because *LLEA* and *LEAR* share the substring *LEA*, hence, the enrichment of *LLEA* can be seen as a byproduct of the inserted motif *LEAR*. This can also be seen from the lower frequency of *LLEA* ($f_s$ = 5) in `s` compared to *LEAR* ($f_s$ = 20). For *LEAR* we see FDR $\approx 10^{-15}$, which is significantly smaller than the FDR $\approx 10^{-2}$ observed for *LLEA*. Finally, for *LEAR* we see OvE $\approx 100$, whereas for *LLEA*'s OvE $=\infty$ (*LLEA* has $f_r$ = 0, which results in a division by 0 when calculating OvE). ## Global clustering results In our data we had only one pair of globally similar CDR3 sequences, i.e. the CDR3 sequences *CAS**S**PLEARGYTF* and *CAS**R**PLEARGYTF* which differ by one amino acid at position 4. `r Biocpkg("ClustIRR")` has identified this: ```{r} # display globally similar pairs knitr::kable(get_clustirr_clust(o)$CDR3b$global, row.names = FALSE) ``` ## Putting it all together $\rightarrow$ graph To interpret the `r Biocpkg("ClustIRR")` output we can inspect the tables of locally/globally similar CDR3s. Furthermore, we provide the functions `get_graph`, which generates an undirected graph (`r CRANpkg("igraph")` object), and `plot_graph` for visualization of the graph. Each vertex in the graph is a T-cell clone (row in the input data.frame for repertoire `s`), and we draw an edge between two vertices if they are globally similar or if they share an enriched motif (locally similar). Multiple edges for global and local similarity between two vertices are possible. Let's visualize the graph output for this case study: ```{r, fig.width=5, fig.height=4, fig.align='center'} par(mai = c(0,0,0,0)) plot_graph(clust_irr = o) ```   ### Vertices The graph shows a cluster of T-cell clones (densely connected vertices). These are the 20 clones with CDR3 sequences in which we inserted the motif *LEAR*. All clones that have an enriched motif are connected in the graph, i.e. enriched motifs give rise to cliques. The remaining clones (about 480) are shown as singleton vertices. We also see two vertices that are connected by an edge. These are the two clones that are globally similar. We can plot an **interactive** graph with `r CRANpkg("visNetwork")`. ```{r, fig.width=5, fig.height=4, fig.align='center'} plot_graph(clust_irr = o, as_visnet = TRUE) ``` To remove all vertices without edges from the graph, we can disable the option *show_singletons*. ```{r, fig.width=5, fig.height=4, fig.align='center'} plot_graph(clust_irr = o, as_visnet = TRUE, show_singletons = FALSE) ``` ### Edges Between a pair of vertices, we draw an edge if: a) they are globally similar; or b) they share an enriched motif. We can have multiple edges between a pair of vertices (clones) **for each chain**: local, global, local and global. # Case study 2: analysis of TCR repertoire with large expanded clone In this case study, we assume that the data contains a clone with 10 T-cells ($\approx$ 2% of the size of the initial repertoire). All T-cells in the expanded clone have the same CDR3b sequence *CATSRPDGLAQYF*. `r Biocpkg("ClustIRR")` should detect most motifs at the core of *CATSRPDGLAQYF* as enriched while also reporting that all cells within `s` have globally similar (in fact identical) CDR3b sequences. Let's insert a clone in data set `s`: ```{r} # create a clone of 10 T-cells clone <- data.frame(CDR3b = "CATSRPDGLAQYF", clone_size = 10) # append the clone to the original repertoire 's' s <- rbind(s, clone) ``` ... and once again perform clustering with `r Biocpkg("ClustIRR")`: ```{r} o <- cluster_irr(s = s, r = r, ks = 4, cores = 1, control = list(global_smart = FALSE, trim_flank_aa = 3)) ``` ## Local clustering results `r Biocpkg("ClustIRR")` once again reports enrichment of *LEAR*, but also of many additional motifs that are part of the core of *CATSRPDGLAQYF*, such as *DGLA*, *PDGL*, *RPDG*, *SRPG*, etc. ```{r, fig.align='center'} # extract local motifs local_motifs <- get_clustirr_clust(o)$CDR3b$local$m # display only passed motifs knitr::kable(local_motifs[local_motifs$pass == TRUE, ], row.names = FALSE) ``` ## Global clustering results Once again, `r Biocpkg("ClustIRR")` finds the same pair of globally similar CDR3 sequences. The CDR3 sequences *CAS**S**PLEARGYTF* and *CAS**R**PLEARGYTF* differ by one amino acid at position 4. Let's check how the global and local similarities are represented in the graph (see next section). Can you find the vertex that corresponds to the expanded clone? ## Graph output Let's plot the clustering results: ```{r} plot_graph(clust_irr = o, as_visnet = TRUE) ```   We see one connected component, which is identical to the one we saw in case study 1. Furthermore, we see a large vertex. This represents the clonal expansion, where the size of the vertex scales as the logarithm of the number of T-cells in the clone. Clonally expanded cells are globally similar to each other. If the specific clonal expansion is only found in `s` but not in `r`, then it is likely that we will also see an enrichment of certain motifs from the core of the corresponding CDR3 sequences. In summary, CDR3s of expanded clones are similar in terms of the global sequences but may also be locally similar. # Case study 3: analysis of TCR repertoire with paired $\alpha\beta$ TCR chains Single-cell technology allows us to sequence entire TCR repertoires, and to extract the sequences of both TCR chains: $\beta$ and $\alpha$. `r Biocpkg("ClustIRR")` can analyze such data. Clustering is performed separately using the CDR3 sequences of each chain. Let's create the input data. We create two repertoires: * repertoire `S0`: 100 CDR3$\beta$ and CDR3$\alpha$ sequence pairs. * repertoire `S1`: 1,000 CDR3$\beta$ and CDR3$\alpha$ sequence pairs. Imagine that `S0` and `S1` are two TCR repertoires of a cancer patient, sequenced before and after cancer therapy, respectively. ```{r} data("CDR3ab") S0 <- data.frame(CDR3a = CDR3ab$CDR3a[3001:3100], CDR3b = CDR3ab$CDR3b[3001:3100], clone_size = c(50, 10, rep(1, times = 98))) S1 <- data.frame(CDR3a = CDR3ab$CDR3a[5001:5200], CDR3b = CDR3ab$CDR3b[5001:5200], clone_size = c(20, rep(1, times = 199))) # create a clone of 15 T-cells clone <- data.frame(CDR3a = "CASSQPGTDHGYTF", CDR3b = "CASSPQGREATGELFF", clone_size = 15) # append the clone to the original repertoire 'S0' S0 <- rbind(S0, clone) # insert motif LEAR into S0 substr(x = S0$CDR3b[1:20], start = 6, stop = 9) <- "LEAR" # insert motif WWWW into S1 substr(x = S1$CDR3b[1], start = 5, stop = 8) <- "WWWW" ``` TCR repertoire `S0` has * two enriched motifs: *LEAR* and *LLEA* * two expanded clones: * clone `S0-1`: 50 T-cells (CDR3a: CASSFGPYGSQPQHF, CDR3b: CASSRDAGNTIYF) * clone `S0-2`: 15 T-cells (CDR3a: CASSQPGTDHGYTF, CDR3b: CASSPQGREATGELFF) * clone `S0-3`: 10 T-cells (CDR3a: CATSRLRQGLNEKLFF, CDR3b: CASGGGLAYEQYF) TCR repertoire `S1` has * one enriched motif: *WWWW* * one expanded clone: * clone `S1-1`: 20 T-cells (CDR3a: CASSQPGTDHGYTF, CDR3b: CASSPQGREATGELFF) We will perform two sets of analysis with `r Biocpkg("ClustIRR")`. First, we will inspect how the specificity structure of repertoire `S1` (`s`=`S1`) is modulated compare to repertoire `S0` (`r`=`S0`). ```{r} clust_irrs <- vector(mode = "list", length = 2) names(clust_irrs) <- c("S1", "S0") clust_irrs[[1]] <- cluster_irr(s = S1, r = S0, control = list(global_smart = FALSE, trim_flank_aa = 3)) ``` Second, we will do the reverse, i.e. we will inspect the specificity structure of repertoire `S0` (`s`=`S0`) compared to `S1` (`r`=`S1`). ```{r} clust_irrs[[2]] <- cluster_irr(s = S0, r = S1, control = list(global_smart = FALSE, trim_flank_aa = 3)) ``` Let's plot the joint graph: ```{r} # beta & alpha chain plot_joint_graph(clust_irrs = clust_irrs, cores = 1, as_visnet = TRUE) ```   ## How to interpret the joint graph? The clones of repertoire `S0` and `S1` are shown as red and yellow vertices, respectively. Repertoire `S0` contains three expanded clones. Hence, we see three large red vertices. Repertoire `S1` contains only one expanded clone shown as a large yellow vertex. The remaining vertices are small and likely contain a single T-cell. Within each repertoire, the edges are drawn as explained earlier. Meanwhile, in the joint graph we also have edges between the vertices from both repertoires. These will be drawn, if a pair of clones in `S0` and `S1` have globally similar or equal CDR3 sequences. In this particular example, the vertices that represents the expanded clones `S0-2` and `S1-1` (large red and yellow vertices) are connected by such an edge. # Case study 4: analysis of TCR repertoires with paired $\gamma\delta$ chains `r Biocpkg("ClustIRR")` can be employed to study the specificity structure of: * $\alpha\beta$ chain TCR repertoires * $\gamma\delta$ chain TCR repertoires * heavy(H)/light(L) chain BCR repertoires To carry out this analysis, the user must appropriately configure the inputs. Here we demonstrate this point. The rest of the clustering analysis proceeds similarly as outlined in case studies 1-3. ## Configuring the input To analyze $\alpha\beta$ chain TCR repertoires, the input data sets `s` and `r` are required to have one or both (e.g., if we have paired data as explained in case study 3) of the columns: `CDR3a` and `CDR3b`. For the analysis of $\gamma\delta$ chain TCR repertoires the columns `CDR3g` and `CDR3d` have to be specified. For the analysis of heavy(H)/light(L) chain BCR repertoires `r Biocpkg("ClustIRR")` uses the two columns CDR3h and CDR3l. Dummy example, where we take the `CDR3ab` data set as input and simply rename the columns: ```{r, eval=F} data("CDR3ab") # gamma/delta chain TCR data -> notice CDR3g and CDR3d columns of 's' and 'r' s <- data.frame(CDR3g = CDR3ab$CDR3a[4501:5000], CDR3d = CDR3ab$CDR3b[4501:5000]) r <- data.frame(CDR3g = CDR3ab$CDR3a[5001:10000], CDR3d = CDR3ab$CDR3b[5001:10000]) ``` ```{r, eval=F} data("CDR3ab") # heavy/light chain BCR data -> notice CDR3h and CDR3l columns of 's' and 'r' s <- data.frame(CDR3h = CDR3ab$CDR3a[4501:5000], CDR3l = CDR3ab$CDR3b[4501:5000]) r <- data.frame(CDR3h = CDR3ab$CDR3a[5001:10000], CDR3l = CDR3ab$CDR3b[5001:10000]) ``` The rest of the analysis proceeds as usual, i.e., by calling the function `cluster_irr`. # Case study 5: joint analysis of 4 TCR repertoires Imagine that we have a two patients, P1 and P2. From each patient we have two IRRs, one before (A) and one after (B) treatment. ```{r} data("CDR3ab") P1_A <- data.frame(CDR3a = CDR3ab$CDR3a[3001:3100], CDR3b = CDR3ab$CDR3b[3001:3100], clone_size = rpois(n = 100, lambda = 1)+1) P1_B <- data.frame(CDR3a = CDR3ab$CDR3a[c(3100, 3101:3200)], CDR3b = CDR3ab$CDR3b[c(3100, 3101:3200)], clone_size = rpois(n = 101, lambda = 1)+1) # clone_size column -> poisson distributed ``` ```{r} P2_A <- data.frame(CDR3a = CDR3ab$CDR3a[c(3100, 4001:4100)], CDR3b = CDR3ab$CDR3b[c(3100, 4001:4100)]) P2_B <- data.frame(CDR3a = CDR3ab$CDR3a[c(3100, 4001:4100)], CDR3b = CDR3ab$CDR3b[c(3100, 4001:4100)]) # no clone_size column -> default clone_size = 1 ``` We can analyze the specificity structure of IRR A and B by performing clustering with `r Biocpkg("ClustIRR")`. Here we perform two analyses: A vs B and B vs. A, which allows us to find enriched motifs from either direction. The results are stored in a list of size 4. When `r` is set to `NULL`, only global clustering is performed. ```{r} clust_irrs <- vector(mode = "list", length = 4) names(clust_irrs) <- c("P1_A", "P1_B", "P2_A", "P2_B") clust_irrs[[1]] <- cluster_irr(s = P1_A, r = NULL) clust_irrs[[2]] <- cluster_irr(s = P1_B, r = NULL) clust_irrs[[3]] <- cluster_irr(s = P2_A, r = NULL) clust_irrs[[4]] <- cluster_irr(s = P2_B, r = NULL) ``` `r Biocpkg("ClustIRR")` provides the unique opportunity to merge the four graphs (in fact: as many `r Biocpkg("ClustIRR")` outputs as we want) and to visualize them. ```{r, fig.width=6, fig.height=6} # beta & alpha chain plot_joint_graph(clust_irrs = clust_irrs, as_visnet = TRUE, show_singletons = T) ```   ## How to interpret the joint graph? Clones are color-coded according to the four input repertoires: * `P1_A` = red * `P1_B` = yellow * `P2_A` = green * `P2_B` = blue The clones present in both samples from the same patient are shown as connected vertices. Clones with larger clone size appear as bigger vertices than clones with smaller clone size. Similarities between clones from different repertoires are drawn as edges between different colored vertices. In this case, we can see many connections between clones from `P2_A` and `P2_B`, and only one connection `P1_A` and `P1_B`, which also connects to clones from `P2_A` and `P2_B`. Within each repertoire, edges are drawn as explained earlier (not the case in this particular graph). ```{r session_info} utils::sessionInfo() ``` [^1]: Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94-98.