Making 2D Hilbert Curve ======================================== **Author**: Zuguang Gu ( z.gu@dkfz.de ) **Date**: `r Sys.Date()` **Package version**: `r installed.packages()["HilbertCurve", "Version"]` ------------------------------------------------------------- ```{r, echo = FALSE, message = FALSE} library(markdown) options(markdown.HTML.options = c(options('markdown.HTML.options')[[1]], "toc")) library(knitr) knitr::opts_chunk$set( error = FALSE, tidy = FALSE, message = FALSE, fig.align = "center", fig.width = 6, fig.height = 6) options(markdown.HTML.stylesheet = "custom.css") options(width = 100) library(HilbertCurve) ``` ## Introduction [Hilbert curve](https://en.wikipedia.org/wiki/Hilbert_curve) is a type of space-filling curves that folds one dimensional axis into a two dimensional space, but still keeps the locality. It has advantages to visualize data with long axis in following two aspects: 1. greatly improve resolution of the visualization fron n to $\sqrt{n}$; 2. easy to visualize clusters because generally data points in the axis will also be close in the 2D space. This package aims to provide an easy and flexible way to visualize data through Hilbert curve. The implementation and example figures are based on following sources: - http://mkweb.bcgsc.ca/hilbert/ - http://corte.si/posts/code/hilbert/portrait/index.html - http://bioconductor.org/packages/devel/bioc/html/HilbertVis.html We first load the packages and set seed for random numbers. ```{r} library(HilbertCurve) library(circlize) set.seed(12345) ``` Following shows Hilbert curves with level 2, 3, 4, 5: ```{r, fig.width = 12, fig.height = 3, echo = FALSE} grid.newpage() pushViewport(viewport(layout = grid.layout(nr = 1, nc = 4))) pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 1)) hc = HilbertCurve(1, 16, level = 2, reference = TRUE, newpage = FALSE, title = "level = 2") hc_segments(hc, x1 = 1, x2 = 2, gp = gpar(col = "red", lwd = 2)) upViewport() pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 2)) hc = HilbertCurve(1, 64, level = 3, reference = TRUE, newpage = FALSE, title = "level = 3") hc_segments(hc, x1 = 1, x2 = 2, gp = gpar(col = "red", lwd = 2)) upViewport() pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 3)) hc = HilbertCurve(1, 256, level = 4, reference = TRUE, newpage = FALSE, title = "level = 4") hc_segments(hc, x1 = 1, x2 = 2, gp = gpar(col = "red", lwd = 2)) upViewport() pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 4)) hc = HilbertCurve(1, 1024, level = 5, reference = TRUE, reference_gp = gpar(col = "grey"), arrow = FALSE, newpage = FALSE, title = "level = 5") hc_segments(hc, x1 = 1, x2 = 2, gp = gpar(col = "red", lwd = 2)) upViewport() upViewport() ``` As shown in the above plots, as level increases, the length of the curve becomes longer and the curve folds more densely. The number of segments (one segment is marked in red) on the Hilbert curve is `4^level - 1`. If a Hilbert curve with level 11 is used to map to human chromosome 1, the resolution would be `249250621/4^11` (approximately `r round(249250621/4^11)`bp). ### Locality [Hilbert curve](https://en.wikipedia.org/wiki/Hilbert_curve) folds one-dimensional axis into a two-dimensional space while still preserves the locality of the data points. We visualize this attribute with a Hilbert curve with level 5. In following animation, the point moves with its natural order on the axis. ```{r, eval = FALSE} for(i in 1:1024) { hc = HilbertCurve(1, 1024, level = 5, reference = TRUE, arrow = FALSE) hc_points(hc, x1 = i, np = NULL, pch = 16, size = unit(2, "mm")) } ```
Next, we calculate the pairwise distance between data points in the 2D space and visualize it as heatmap. The numbers on x-axis (top) and y-axis (left) illustrate the position of data points in the original 1D axis. ```{r, fig.width = 9, fig.height = 8} library(HilbertVis) pos = HilbertVis::hilbertCurve(5) mat = as.matrix(dist(pos)) library(ComplexHeatmap) ht = Heatmap(mat, name = "dist", cluster_rows = FALSE, cluster_columns = FALSE, show_row_names = FALSE, show_column_names = FALSE, heatmap_legend_param = list(title = "euclidean_dist")) draw(ht, padding = unit(c(5, 5, 5, 2), "mm")) decorate_heatmap_body("dist", { grid.segments(c(0.25, 0.5, 0.75, 0, 0, 0), c(0, 0, 0, 0.25, 0.5, 0.75), c(0.25, 0.5, 0.75, 1, 1, 1), c(1, 1, 1, 0.25, 0.5, 0.75), gp = gpar(lty = 2)) grid.text(rev(c(256, 512, 768, 1024)), 0, c(0, 256, 512, 768)/1024, just = "bottom", rot = 90, gp = gpar(fontsize = 10)) grid.text(c(1, 256, 512, 768, 1024), c(1, 256, 512, 768, 1024)/1024, 1, just = "bottom", gp = gpar(fontsize = 10)) }) ``` Basically, from the heatmap, data points which are close in the 1D axis are still close in the 2D space (illustrated in the areas around diagonals). Also there are some data points in (e.g, 1 ~ 256) which are close to data points in (768 ~ 1024)(bottom left or top right area in the heatmap), but it is only due to that the curve folds back. This exception can be easily distinguished by adding assistant grid lines or if users are clear with the structure of the curve. ## Basic settings The **HilbertCurve** package provides a rather simple way to display the data in the form of Hilbert curve. It hides all the technical parts and users only need to think that they are adding graphics on a novel axis based on specifying positions. Generally, customizing a Hilbert curve follows following steps: ```{r, eval = FALSE} hc = HilbertCurve(...) # initialize the curve hc_points(hc, ...) # add points hc_rect(hc, ...) # add rectangles hc_polygon(hc, ...) # add polygons hc_segments(hc, ...) # add lines hc_text(hc, ...) # add text ``` `HilbertCurve()` is a constructor function and initializes the Hilbert curve. Following example means initializing a Hilbert curve with level 4 which maps data ranging from 1 to 100. The function returns a `HilbertCurve` class instance and it can be used to add more graphics later. `reference` argument here is only used to show the structure of the curve. ```{r, eval = FALSE} hc = HilbertCurve(1, 100, level = 4, reference = TRUE) ``` By default, the Hilbert curve starts from the bottom left corner. The start position can be specifyed by `start_from` option and the orientation of the first segment can be set by `first_seg`: ```{r, fig.width = 12, fig.height = 6, echo = FALSE} grid.newpage() pushViewport(viewport(layout = grid.layout(nr = 2, nc = 4))) pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 1)) hc = HilbertCurve(1, 16, level = 2, reference = TRUE, reference_gp = gpar(col = "#999999"), newpage = FALSE, start_from = 'bottomleft', first_seg = "horizontal", title = "start_from = 'bottomleft'\nfirst_seg = 'horizontal'") upViewport() pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 2)) hc = HilbertCurve(1, 16, level = 2, reference = TRUE, reference_gp = gpar(col = "#999999"), newpage = FALSE, start_from = 'topleft', first_seg = "horizontal", title = "start_from = 'topleft'\nfirst_seg = 'horizontal'") upViewport() pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 3)) hc = HilbertCurve(1, 16, level = 2, reference = TRUE, reference_gp = gpar(col = "#999999"), newpage = FALSE, start_from = 'bottomright', first_seg = "horizontal", title = "start_from = 'bottomright'\nfirst_seg = 'horizontal'") upViewport() pushViewport(viewport(layout.pos.row = 1, layout.pos.col = 4)) hc = HilbertCurve(1, 16, level = 2, reference = TRUE, reference_gp = gpar(col = "#999999"), newpage = FALSE, start_from = 'topright', first_seg = "horizontal", title = "start_from = 'topright'\nfirst_seg = 'horizontal'") upViewport() pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 1)) hc = HilbertCurve(1, 16, level = 2, reference = TRUE, reference_gp = gpar(col = "#999999"), newpage = FALSE, start_from = 'bottomleft', first_seg = "vertical", title = "start_from = 'bottomleft'\nfirst_seg = 'vertical'") upViewport() pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 2)) hc = HilbertCurve(1, 16, level = 2, reference = TRUE, reference_gp = gpar(col = "#999999"), newpage = FALSE, start_from = 'topleft', first_seg = "vertical", title = "start_from = 'topleft'\nfirst_seg = 'vertical'") upViewport() pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 3)) hc = HilbertCurve(1, 16, level = 2, reference = TRUE, reference_gp = gpar(col = "#999999"), newpage = FALSE, start_from = 'bottomright', first_seg = "vertical", title = "start_from = 'bottomright'\nfirst_seg = 'vertical'") upViewport() pushViewport(viewport(layout.pos.row = 2, layout.pos.col = 4)) hc = HilbertCurve(1, 16, level = 2, reference = TRUE, reference_gp = gpar(col = "#999999"), newpage = FALSE, start_from = 'topright', first_seg = "vertical", title = "start_from = 'topright'\nfirst_seg = 'vertical'") upViewport() upViewport() ``` The curve can be though as a folded axis. When the coordinate for this folded axis is initialized, low-level graphics can be added by specifying the positions. There are several ways to specify the "positions" of the data points. The most common way is to construct a `IRanges` object: ```{r} x = sort(sample(100, 20)) s = x[1:10*2 - 1] e = x[1:10*2] ir = IRanges(s, e) ir ``` Here `ir` contains intervals which are composed by integers. In later sections, you will see it is also possible to specify positions with numeric values and negative values. ### Points There are two modes for adding points. Normally, intervals are always long in the curve and can not be sufficiently represented by single points, thus, by default, a list of e.g. circles split the intervals. ```{r} hc = HilbertCurve(1, 100, level = 4, reference = TRUE) hc_points(hc, ir) ``` The number of circles used to represent the intervals can be controlled by `np` (number of points per segment). `np` controls number of tiny segments that split every Hilbert curve segment (e.g. the first horizontal segment in the left bottom in the curve). Following plot is under `np = 3`. ```{r, fig.width = 3, fig.height = 3} hc = HilbertCurve(1, 16, level = 2, reference = TRUE, title = "np = 3") hc_points(hc, x1 = 1, x2 = 2, np = 3) ``` Graphic parameters can be set by `gp`. Note under this mode, the size of points can only be controlled by `np` argument. To make it more interesting, you can choose different shapes for the points. There are some pre-defined shapes that you can choose from: "circle", "square", "triangle", "hexagon", "star". ```{r} hc = HilbertCurve(1, 100, level = 4, reference = TRUE) hc_points(hc, ir, np = 3, gp = gpar(fill = rand_color(length(ir))), shape = sample(c("circle", "square", "triangle", "hexagon", "star"), length(ir), replace = TRUE)) ``` In above figure, you may notice for some points, the color is more light in the ends. This is because the segments represented by these light circles are not fully covered by user's input intervals, thus, averaging is applied here. E.g. if only the half of the segment represented by the circle is covered by user's interval and the color for this circle is set to red, then after the averaging, the color would be semi-red (`#FF8080`, an average between red and white). Averaging is very important when you visualize the genome with zooming. You can find more detailed explanation in the [**Averaging**](#toc_14) section. If `np` is set to 1 or `NULL`, there will be second mode for adding points that the points are plotted at the center of every interval in `ir`. In this case, `size` argument is used to control the size of the points. This mode is useful if you have a lot of small intervals. ```{r} hc = HilbertCurve(1, 100, level = 4, reference = TRUE) hc_points(hc, ir, np = NULL, size = unit(runif(length(ir)), "cm"), pch = 16) ``` ### Segments Adding segments is quite straightforward. You can assign `lwd` to a large value to simulate long and twisted rectangles. ```{r} hc = HilbertCurve(1, 100, level = 4, reference = TRUE) hc_segments(hc, ir, gp = gpar(lwd = 5)) ``` ### Rectangles `hc_rect()` fills intervals by a list of squares. The squares always locate at the turning points of the Hilbert curve and the width/height cannot be modified and it is automatically adjusted according to the level of the curve. There is a difference from `hc_points()` with setting `shape` to `square`. The number of squares per segment is always 2 for `hc_rect()` while it can be any number for `hc_points()`. When the interval is long enough, `hc_rect()` can completely fill the surrounding areas. ```{r} hc = HilbertCurve(1, 100, level = 4, reference = TRUE) hc_rect(hc, ir, gp = gpar(fill = "#FF000080")) ``` As you can see some rectangles are not full red as well. It is also because of averaging. See [**Averaging**](#toc_14) sections for explanation. ### Polygons Basically polygons are quite similar to the rectangles. The major difference is 1. borders can be set for irregular polygons; 2. there is only single color in one polygon while for rectangles, the color can change in the ends of the interval if the corresponding segment is not completely covered by the input interval. ```{r} hc = HilbertCurve(1, 100, level = 4, reference = TRUE) hc_polygon(hc, ir, gp = gpar(fill = "#FF0000", col = "black")) ``` ### Text Text is always added at the center of each interval in `ir`. Additional settings such as `just` and `rot` can also be set. ```{r} hc = HilbertCurve(1, 100, level = 4, reference = TRUE) labels = sample(letters, length(ir), replace = TRUE) hc_text(hc, ir, labels = labels, gp = gpar(fontsize = width(ir)*2+5)) ``` ### Combine low-level functions With combination of these basic low-level graphic functions, complicated graphics can be easily made: ```{r} hc = HilbertCurve(1, 100, level = 4) hc_segments(hc, IRanges(1, 100)) # This is an other way to add background line hc_rect(hc, ir, gp = gpar(fill = rand_color(length(ir), transparency = 0.8))) hc_polygon(hc, ir[c(1,3,5)], gp = gpar(col = "red")) hc_points(hc, ir, np = 3, gp = gpar(fill = rand_color(length(ir))), shape = sample(c("circle", "square", "triangle", "hexagon", "star"), length(ir), replace = TRUE)) hc_text(hc, ir, labels = labels, gp = gpar(fontsize = width(ir)*2+5, col = "blue", font = 2)) ``` ## Non-integer positions It doesn't matter if your positions are integers or not. Internally, adjustment is automatically applied. When positions are not integers, you can specify the positions by `x1` and `x2`. All low-level graphical funtions accept `x1` and `x2`. ```{r} hc = HilbertCurve(0.1, 0.8, level = 4, reference = TRUE) hc_points(hc, x1 = c(0.15, 0.55), x2 = c(0.25, 0.65)) ``` This is also useful to specify the positions with huge values which can not be represented by the `IRagnes` class (e.g. the summation of total length of human chromosomes). ```{r, eval = FALSE} # code not run when generating the vignette hc = HilbertCurve(1, 1000000000000, level = 4, reference = TRUE) hc_points(hc, x1 = 400000000000, x2 = 600000000000) ``` Negative positions are allowed as well. ```{r, eval = FALSE} # code not run when generating the vignette hc = HilbertCurve(-100, 100, level = 4, reference = TRUE) hc_points(hc, x1 = -50, x2 = 50) ``` ## Pixel mode The pixel mode can be thought as a high resolution version of `hc_rect()`. When the level is high (e.g. > 10), the whole 2D space will be almost completely filled by the curve and it is impossible to add or visualize e.g. points on the curve. In this case, the 'pixel' mode visualizes each tiny 'segment' as a pixel and maps values to colors. Internally, the whole plot is represented as an RGB matrix and every time a new layer is added to the plot, the RGB matrix will be updated according to the color overlay. When all the layers are added, normally a PNG figure is generated directly from the RGB matrix. So the Hilbert curve with level 11 will generate a PNG figure with 2048x2048 resolution. This is extremely useful for visualize genomic data. E.g. If we make a Hilbert curve for human chromosome 1 with level 11, then each pixel can represent 60bp (``249250621/2048/2048``) which is of very high resolution. Under 'pixel' mode, every time a new layer is added, the image will be add to the interactive device as a rastered image. ```{r} hc = HilbertCurve(1, 100, level = 9, mode = "pixel") hc_layer(hc, ir) ``` Since color is graphic representation to values, there is only one graphic setting `col`. You can add more than one layers, and just remember to set transparent colors for the additional layers. To map values to colors, users may try `colorRamp2()` function in **circlize** package to generate a color mapping function. Another advantage of using `colorRamp2()` is it can be used to generate the corresponding color legend. ```{r} col_fun = colorRamp2(c(0, 1), c("white", "red")) x = sample(1000, 100) value = runif(length(x)) hc = HilbertCurve(1, 1000, level = 9, mode = "pixel", title = "pixel mode") hc_layer(hc, x1 = x, x2 = x+2, mean_mode = "absolute", col = col_fun(value)) hc_layer(hc, x1 = 750, x2 = 850, col = "#00000010") ``` Grid lines can be added to the plot for better distinguishing blocks in the Hilbert curve. The 2D space will be split into `2^(grid_line-1)` rows and `2^(grid_line-1)` columns. ```{r} hc = HilbertCurve(1, 1000, level = 9, mode = "pixel", title = "pixel mode") hc_layer(hc, x1 = x, x2 = x+2, mean_mode = "absolute", col = col_fun(value), grid_line = 3) ``` Borders can be added by setting `border` to `TRUE` or a vector of colors to distinguish different intervals. ```{r} hc = HilbertCurve(1, 1000, level = 9, mode = "pixel", title = "pixel mode") hc_layer(hc, x1 = x, x2 = x+2, mean_mode = "absolute", col = col_fun(value), border = TRUE) ``` The Hilbert curve can be save as PNG figure by `hc_png()` with resolution `2^level x 2^level`. Note only the curve itself is exported. ```{r, eval = FALSE} # code not run, only for demonstration hc_png(hc, file = "test.png") ``` ### Color overlay As discussed before, under "pixel" mode, the Hilbert curve is stored in an RGB matrix, every time a new layer is added, the RGB values that correspond to the new input intervals will be updated according to the color and transparency of the new layer. By default, the color overlay for layers is applied based on the transparency of the new layer. Under "pixel" mode, each pixel is represented by red, green and blue colors for which the value is between 0 and 1. When a new layer with transparency color is added, the overlayed RGB is calculated as follows (`r0`, `g0` and `b0` correspond to the layer which is already added, `r`, `g`, `b` and `alpha` correspond to the layer which is going to be added). The method is implemented in `default_overlay()` function. ```{r, eval = FALSE} r = r*alpha + r0*(1-alpha) g = g*alpha + g0*(1-alpha) b = b*alpha + b0*(1-alpha) ``` `hc_layer()` provides a `overlay` argument for which users can supply their own overlay method. This may be easy to use but it is powerful for representing different features. In following example, we change the red regions which are overlapped with the blue regions to green. ```{r} hc = HilbertCurve(1, 1000, level = 9, mode = "pixel", title = "pixel mode") hc_layer(hc, x1 = x, x2 = x+2, mean_mode = "absolute") hc_layer(hc, x1 = 750, x2 = 850, col = "#0000FF20", overlay = function(r0, g0, b0, r, g, b, alpha) { l = r0 > 0 # if the pixel is red # red -> green, just switch red and green channel tmp = r0[l] r0[l] = g0[l] g0[l] = tmp # add to the curve default_overlay(r0, g0, b0, r, g, b, alpha) }) ``` The self-defined `overlay` should accept seven arguments for which, `r0`, `g0` and `b0` correspond to RGB channles for the layers which are already added, `r`, `g`, `b` and `alpha` correspond to the layer which are going to be added. The elements of e.g. `r`, `g`, `b` only correspond to the pixels that are covered by the regions in the new layer. If the colors are generated by the color mapping function which is generated by `colorRamp2()`, `col2value()` function which is also from **circlize** package transforms back from colors to the original values (it can be thought as the reversed function of `colorRamp2()`). It is usefull to integrate with `overlay` to change the color theme if the color is from a continuous mapping. In following plot, we change the color theme in the overlapped areas to white-blue. To do this, we first select the pixels that correspond to the intersection of "red" areas and "grey" areas. Since `overlay` is only applied to "grey" area and "red" area contains non-white pixels, we only need to extract these non-white pixels inside the overlay function. Then transform back to the original values by `col2value()` and finally change to another color theme. ```{r} hc = HilbertCurve(1, 1000, level = 9, mode = "pixel", title = "pixel mode") hc_layer(hc, x1 = x, x2 = x+2, mean_mode = "absolute", col = col_fun(value)) hc_layer(hc, x1 = 750, x2 = 850, col = "#00000010", # remember r0, g0, ... only correspond to the pixels in the grey area overlay = function(r0, g0, b0, r, g, b, alpha) { # non-white areas l = !(r0 == 1 & g0 == 1 & b0 == 1) # original value v = col2value(r0[l], g0[l], b0[l], col_fun = col_fun) # new color theme col_fun_new = colorRamp2(c(0, 1), c("white", "blue")) col_new = col_fun_new(v, return_rgb = TRUE) r0[l] = col_new[, 1] g0[l] = col_new[, 2] b0[l] = col_new[, 3] default_overlay(r0, g0, b0, r, g, b, alpha) }) ``` In above example, actually you can implement such overlay by first split "red" regions into two parts by whether they overlap with the "grey" regions. Then add the un-overlapped part, "grey" regions, and overlapped regions as three independent layers. But the `overlay` option provides an easier and more straightforward way to do it. In later section, we will demonstrate using self-defined overlay method can effectively visualize the correspondance between H3K36me3 histone mark and gene body. ## Add title and legends Title can be set by the `title` argument in `HilbertCurve()`. Legend can be passed to `legend` argument in `HilbertCurve()` as a `grob` object or a list of `grob` objects. Basically, colors are the main aesthetic mappings to values, thus, `ColorMapping` class in **ComplexHeatmap** package can be very helpful to create a legend `grob`. Also you can consider `legendGrob()` in **grid** package. ```{r} value = runif(length(ir)) col_fun = colorRamp2(c(0, 1), c("white", "red")) cm1 = ColorMapping(col_fun = col_fun) legend1 = color_mapping_legend(cm1, plot = FALSE, title = "continuous") cm2 = ColorMapping(levels = c("A", "B"), color = c("#00FF0080", "#0000FF80")) legend2 = color_mapping_legend(cm2, plot = FALSE, title = "discrete") legend = list(legend1, legend2) hc = HilbertCurve(1, 100, reference = TRUE, title = "points", legend = legend) hc_points(hc, ir, np = 3, gp = gpar(fill = col_fun(value))) hc_rect(hc, ir, gp = gpar(fill = sample(c("#00FF0020", "#0000FF20"), length(ir), replace = TRUE))) ``` Under "pixel" mode, and when the curve is shown on the interactive device as raster image, title and legends can be added as well. But if the plot is exported by `hc_png()`, there will be not title and legends. ```{r} col_fun = colorRamp2(c(0, 1), c("white", "red")) cm = ColorMapping(col_fun = col_fun) lgd = color_mapping_legend(cm, title = "value", plot = FALSE) value = runif(length(x)) hc = HilbertCurve(1, 1000, level = 9, mode = "pixel", title = "pixel mode", legend = lgd) hc_layer(hc, x1 = x, x2 = x+2, mean_mode = "absolute", col = col_fun(value)) ``` ## Averaging models **Note: this "averaging" concept is very important. If you have unexpected plot, try other `mean_mode` option.** When using points or rectangles under "normal" mode, or using pixels under "pixel" mode, each point/rectangle/pixel is mapped to a small window in the genome. When overlapping certain genomic regions to the curve, and when the window is not full covered by the input regions, proper averaging method should be applied to summarize the value in the window. Depending on different scenarios, **HilbertCurve** provides three metrics for averaging. The overlapping model is illustrated in following plot. The red line in the bottom represents the small window on the Hilbert curve. Black lines on the top are the parts of input regions that overlap with the window. The thick lines indicate the intersected part between the input regions and the window. ```{r, fig.height = 3, echo = FALSE} library(grid) grid.lines(c(0.1, 0.9), c(0.3, 0.3), gp = gpar(col = "red", lwd = 4)) grid.text("window on the curve", 0.5, 0.25, just = "top", gp = gpar(col = "red")) grid.lines(c(0, 0.2), c(0.6, 0.6)) grid.lines(c(0.3, 0.5), c(0.6, 0.6)) grid.lines(c(0.7, 1), c(0.6, 0.6)) grid.lines(c(0.1, 0.2), c(0.6, 0.6), gp = gpar(lwd = 4)) grid.lines(c(0.3, 0.5), c(0.6, 0.6), gp = gpar(lwd = 4)) grid.lines(c(0.7, 0.9), c(0.6, 0.6), gp = gpar(lwd = 4)) grid.lines(c(0.1, 0.1), c(0, 1), gp = gpar(lty = 2, col = "grey")) grid.lines(c(0.9, 0.9), c(0, 1), gp = gpar(lty = 2, col = "grey")) grid.text("genomic regions", 0.5, 0.65, just = "bottom") ``` For a window on the curve, $n$ is the number of input regions which overlap with the window (it is 3 in the above plot), $w_i$ is the width of the intersected segments (black thick lines), $x_i$ is the value associated with the original regions. The "absolute" method which is denoted as $v_a$ and is simple calculated as the mean regardless of the width: $$ v_a = \frac{\sum_i^n{x_i}}{n} $$ The "weighted" method which is denoted as $v_w$ is the mean weighted by the width of intersections: $$ v_w = \frac{\sum_i^n{x_iw_i}}{\sum_i^n{w_i}} $$ "Absolute" and "weighted" mode should be applied when background information is not of interest. E.g. when summarizing mean methylation in a small window, since methylation is only related with CpG sites while not other bases, non-CpG background should be ignored. The "w0" mode is the weighted mean between the intersected parts and un-intersected parts: $$ v_{w0} = \frac{v_wW + v_bW'}{W+W'} $$ for which $W$ is sum of width of the intersected parts and $W'$ is the sum of width for the un-intersected parts. $V_b$ is the value corresponding to the background. E.g when averaging colors which are represented as numeric RGB values, the background value is set to 255 which corresponds to white. ## Examples This section gives several interesting examples. Visualize rainbow colors: ```{r} col = rainbow(100) hc = HilbertCurve(1, 100, level = 5) hc_points(hc, x1 = 1:99, x2 = 2:100, np = 3, gp = gpar(col = col, fill = col)) ``` or use the rectangles: ```{r} hc = HilbertCurve(1, 100, level = 5) hc_rect(hc, x1 = 1:99, x2 = 2:100, gp = gpar(col = col, fill = col)) ``` Reproduce [the visualization for IPv4 address](http://dds.ec/hilvis/iframe.html). The CIDR data is from http://www.iwik.org/ipcountry/. ```{r, fig.width = 7, fig.height = 6} library(ComplexHeatmap) library(RColorBrewer) load(system.file("extdata", "cidr_list.RData", package = "HilbertCurve")) cidr_list = cidr_list[sapply(cidr_list, length) > 0] country = rep(names(cidr_list), times = sapply(cidr_list, length)) ip = unlist(cidr_list, "r") # convert ip address to numbers mat = t(as.matrix(data.frame(lapply(strsplit(ip, "\\.|/"), as.numeric)))) start = mat[, 1]*256^3 + mat[, 2]*256^2 + mat[, 3]*256 + mat[, 4] width = sapply(mat[, 5], function(x) strtoi(paste(rep(1, 32 - x), collapse = ""), base = 2)) # top 8 countries col = structure(rep("grey", length(cidr_list)), names = names(cidr_list)) top8_rate = sort(tapply(width, country, sum), decreasing = TRUE)[1:8]/256^4 top8 = names(top8_rate) col[top8] = brewer.pal(8, "Set1") top8_rate = paste0(round(top8_rate*100), "%") # this is the part of using HilbertCurve package cm = ColorMapping(levels = c(top8, "Others"), color = c(col[top8], "grey")) lgd = color_mapping_legend(cm, title = "Top8 countries", labels = c(paste(top8, top8_rate), "Others"), plot = FALSE) hc = HilbertCurve(0, 256^4, start_from = "topleft", first_seg = "horizontal", mode = "pixel", level = 9, legend = lgd) hc_layer(hc, x1 = start, x2 = start + width, col = col[country]) ``` Duration of Chinese dynasties. Data from http://www.360doc.com/content/11/1123/02/7829717_166641336.shtml. ```{r} load(system.file("extdata", "chinese_dynasty.RData", package = "HilbertCurve")) detect_os = function() { if (grepl('w|W', .Platform$OS.type)) { os = "Windows" } else { if (grepl('darwin', version$os)) { os = "MacOS" } else { os = "Linux" } } return(os) } # default font family for Chinese under different OS fontfamily = switch(detect_os(), Windows = "SimSun", MacOS = "Songti SC", Linux = "Songti SC") hc = HilbertCurve(min(chinese_dynasty[[2]]), max(chinese_dynasty[[3]]), title = "Chinese dynasty (1122 B.C. ~ 1911 A.D.)", title_gp = gpar(fontsize = 16, fontfamily = fontfamily)) hc_segments(hc, x1 = chinese_dynasty[[2]], x2 = chinese_dynasty[[3]], gp = gpar(col = rand_color(nrow(chinese_dynasty), transparency = 0.2), lwd = runif(nrow(chinese_dynasty), min = 1, max = 10))) hc_text(hc, x1 = chinese_dynasty[[2]], x2 = chinese_dynasty[[3]], labels = chinese_dynasty[[1]], gp = gpar(fontsize = (chinese_dynasty[[3]] - chinese_dynasty[[2]])/500 * 10 + 8, fontfamily = fontfamily)) ``` ## GenomicHilbertCurve: specific for genomic data To make it easy with genomic data which is always stored as a data frame or a `GRanges` object, the package provides specific class and functions to deal with it. `GenomicHilbertCurve()` initializes the Hilbert curve by specifying chromosomes or specific background regions. After that, the data frame or `GRanges` object which contains genomic intervals can be mapped to the curve. Following functions are depatched to the `GenomicHilbertCurve` class, and the second argument should be a data frame (which is in bed-format) or a `GRanges` object. ```{r, eval = FALSE} hc_points(hc, gr, ...) hc_rect(hc, gr, ...) hc_polygon(hc, gr, ...) hc_segments(hc, gr, ...) hc_text(hc, gr, ...) hc_layer(hc, gr, ...) ``` Following illustrates several examples of making genome-level Hilbert curves. ### Genes The first example visualizes genes on chromosome 1 ([RefSeq genes for human, hg19](https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=451688581_FoCuggvGvavyfGhdj9sSGrGlnErE&clade=mammal&org=Human&db=hg38&hgta_group=genes&hgta_track=refGene&hgta_table=0&hgta_regionType=genome&position=chr9%3A133252000-133280861&hgta_outputType=primaryTable&hgta_outFileName=)). Here random colors are used to distinguish different genes. ```{r} load(system.file("extdata", "refseq_chr1.RData", package = "HilbertCurve")) hc = GenomicHilbertCurve(chr = "chr1", level = 5, reference = TRUE, reference_gp = gpar(lty = 1, col = "grey"), arrow = FALSE) hc_segments(hc, g, gp = gpar(lwd = 6, col = rand_color(length(g)))) ``` ### Sequence conservation The conserved regions on human chromosome 1 between mouse and human or between zebrafish and human are downloaded from [UCSC Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=472144813_MQ4BbmgW37du53Nw6n9SKKy6mb6r&clade=mammal&org=Human&db=hg38&hgta_group=compGeno&hgta_track=placentalChainNet&hgta_table=netMm10&hgta_regionType=genome&position=chr9%3A133252000-133280861&hgta_outputType=primaryTable&hgta_outFileName=). The parameters for downloading are: ``` clade: Mammal genome: Human assembly: Feb. 2009(GRCh37/hg19) group: Comparative Genomics # for mouse track: Placental Chain/Net table: Mouse Net (netMm10) # for zebrafish track: Vertebrate Chain/Net table: Zebrafish Net(netDanRer7) ``` The "net" alignment which is used here shows the best alignment (the alignment algorithm allows longer gaps than traditional affine gap scoring systems) between each part in human genome and other genomes. For a detailed description of the alignment strategy, please go to [UCSC Table Browser](http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=472144813_MQ4BbmgW37du53Nw6n9SKKy6mb6r&clade=mammal&org=Human&db=hg38&hgta_group=compGeno&hgta_track=placentalChainNet&hgta_table=netMm10&hgta_regionType=genome&position=chr9%3A133252000-133280861&hgta_outputType=primaryTable&hgta_outFileName=), and click **describe table schema** button. In following example, the conservation data is mapped to a level 6 Hilbert curve under "normal" mode. Points are used as the graphic and each segment on the Hilbert curve is split by 3 points. In this case, each point on the curve represents approximately 30kb. Only chromosome 1 is visualized. The point is fully red if the window which is represented by this point is completely covered by the conserved regions, and it is yellow if it is not covered by any region. Colors are interpolated linearly between red and yellow if it is partially covered by the conserved regions. ```{r} # for generating the legend cm = ColorMapping(col_fun = colorRamp2(c(0, 1), c("yellow", "red"))) lgd = color_mapping_legend(cm, title = "Conservation", plot = FALSE, at = c(0, 0.2, 0.4, 0.6, 0.8, 1), labels = c("0%", "20%", "40%", "60%", "80%", "100%")) chr1_len = 249250621 ``` Conservation between human and mouse. ```{r, fig.width = 7.5, fig.height = 7} load(system.file("extdata", "mouse_net.RData", package = "HilbertCurve")) seqlengths(mouse) = chr1_len # it is only used to extract the complement nonmouse = gaps(mouse); nonmouse = nonmouse[strand(nonmouse) == "*"] gr = c(mouse, nonmouse) col = c(rep("red", length(mouse)), rep("yellow", length(nonmouse))) hc = GenomicHilbertCurve(chr = "chr1", level = 6, title = "Conservation between mouse and human on chr1", legend = lgd) hc_points(hc, gr, np = 3, gp = gpar(col = NA, fill = col)) ``` Conservation between human and zebrafish. ```{r, fig.width = 7.5, fig.height = 7} load(system.file("extdata", "zebrafish_net.RData", package = "HilbertCurve")) seqlengths(zebrafish) = chr1_len nonzebrafish = gaps(zebrafish); nonzebrafish = nonzebrafish[strand(nonzebrafish) == "*"] gr = c(zebrafish, nonzebrafish) col = c(rep("red", length(zebrafish)), rep("yellow", length(nonzebrafish))) hc = GenomicHilbertCurve(chr = "chr1", level = 6, title = "Conservation between zebrafish and human on chr1", legend = lgd) hc_points(hc, gr, np = 3, gp = gpar(col = NA, fill = col)) ``` Compareing these two plots, we have following conclusions: 1. Although it is already known that mouse is far more orthologically close to human, the Hilber curve illustrates more clearly that almost all parts of human chromosome 1 have similar sequences on mouse genome. 2. The start and end of the human chromosome 1 is less conserved to mouse compared to other parts in the chromosome. Also there are several less conserved regions inside the human chromosome. 3. It illustrates how the regions which are conserved to zebrafish spread in the human chromosome and it gives an estimation of the positions and length of the conserved regions. The idea of visualizing sequencing conservation through the Hilbert curve is from [here](http://mkweb.bcgsc.ca/hilbert/scientificamerican.mhtml). ### GC percent Following figure visualizes GC percent on chromosome 1, under "normal" mode. The GC content per 1KB window is calculated by `hgGcPercent.bin` binary which is downloaded from http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/. ```{r, eval = grepl("tbi", Sys.info()["nodename"]) & Sys.info()["user"] == "guz", echo = 2:11} png('gc_percent_chr1_points.png', width = 500, height = 500) df = read.table(pipe("awk '$1==\"chr1\"' ~/HilbertCurveTest/hg19_gc_percent_window1000.bed")) col_fun = colorRamp2(quantile(df[[5]], c(0.1, 0.5, 0.9)), c("green", "#FFFFCC", "red")) cm = ColorMapping(col_fun = col_fun) lgd = color_mapping_legend(cm, title = "GC percent", plot = FALSE, at = c(300, 400, 500, 600), labels = c("30%", "40%", "50%", "60%")) hc = GenomicHilbertCurve(chr = "chr1", level = 6, legend = lgd) hc_points(hc, df, np = 3, gp = gpar(fill = col_fun(df[[5]]), col = col_fun(df[[5]]))) hc_rect(hc, reduce(g), gp = gpar(fill = "#00000020", col = NA)) invisible(dev.off()) ``` ```{r, echo = FALSE, results = "asis"} cat("