Table of Contents

Making Hilbert Curve

Making Hilbert Curve

Author: Zuguang Gu ( z.gu@dkfz.de )

Date: 2015-10-14

Hilbert curve is a type of space-filling curves that fold one dimensional axis into a two dimensional space, but with still keeping the locality. It has advantages to visualize data with long axis in following two aspects:

greatly improve resolution for the visualization;
easy to visualize clusters because generally data points in the cluster will also be close in the Hilbert curve.

This package aims to provide an easy and flexible way to visualize data through Hilbert curve. The implementation and example figures are based on following sources:

suppressPackageStartupMessages(library(HilbertCurve))
library(circlize)
set.seed(12345)

Following shows Hilbert curve with level 2, 3, 4, 5:

plot of chunk unnamed-chunk-3

Following heatmap shows distance between elements in the Hilbert curve. From left to right and from top to bottom, the order is the natural order of data points and colors correspond to the distance for pair distance in the Hilbert curve. Basically, if data points are close in the one dimensional axis, they also have small distance in the Hilbert curve (regions around diagonals)

library(HilbertVis)
pos = HilbertVis::hilbertCurve(5)
mat = as.matrix(dist(pos))
library(ComplexHeatmap)

ht = Heatmap(mat, name = "dist", cluster_rows = FALSE, cluster_columns = FALSE, 
    show_row_names = FALSE, show_column_names = FALSE, 
    heatmap_legend_param = list(title = "euclidean_dist"))
draw(ht, padding = unit(c(5, 5, 5, 2), "mm"))
decorate_heatmap_body("dist", {
    grid.segments(c(0.25, 0.5, 0.75, 0, 0, 0), c(0, 0, 0, 0.25, 0.5, 0.75), 
          c(0.25, 0.5, 0.75, 1, 1, 1), c(1, 1, 1, 0.25, 0.5, 0.75), gp = gpar(lty = 2))
    grid.text(rev(c(256, 512, 768, 1024)), 0, c(0, 256, 512, 768)/1024, just = "bottom", 
        rot = 90, gp = gpar(fontsize = 10))
    grid.text(c(1, 256, 512, 768, 1024), c(1, 256, 512, 768, 1024)/1024, 1, just = "bottom",
        gp = gpar(fontsize = 10))
})

plot of chunk unnamed-chunk-4

Basic settings

Generally, customizing a Hilbert curve follows following steps:

hc = HilbertCurve(...)
hc_points(hc, ...)
hc_rect(hc, ...)
hc_segments(hc, ...)
hc_text(hc, ...)

HilbertCurve() is a constructor function and initializes the Hilbert curve. Here the start and end position that correspond to the two ends of the Hilbert curve are specified as well as the level of the Hilbert curve. The function returns a HilbertCurve class instance and it can be used to add more graphics later.

hc = HilbertCurve(1, 100, level = 4, reference = TRUE)

The curve can be though as a folded axis. When the coordinate for this folded axis is initialized, low-level graphics can be added with specifying the positions.

Before we demonstrate low-level functions, first we generate a random regions by IRanges package.

x = sort(sample(100, 20))
s = x[1:10*2 - 1]
e = x[1:10*2]
ir = IRanges(s, e)
ir

## IRanges of length 10
##      start end width
## [1]      1   4     4
## [2]     14  15     2
## [3]     16  31    16
## [4]     33  34     2
## [5]     40  44     5
## [6]     48  65    18
## [7]     67  73     7
## [8]     75  78     4
## [9]     86  87     2
## [10]    91  97     7

Points

There are two modes for adding points. By default, every segment in the Hilbert curve are segmented into several small segments and a circle is plotted at every small segments.

hc = HilbertCurve(1, 100, level = 4, reference = TRUE)
hc_points(hc, ir)

plot of chunk unnamed-chunk-8

The number of circles in the small segment can be controlled by np and graphical parameters can be set by gp. Note under this mode, the size of points can only be changed by np argument. This mode is useful if you have long regions.

hc = HilbertCurve(1, 100, level = 4, reference = TRUE)
hc_points(hc, ir, np = 3, gp = gpar(fill = rand_color(length(ir))))

plot of chunk unnamed-chunk-9

Here such circles are not real points in R, they are polygons actually. There are some pre-defined shapes that you can choose from: “circle”, “square”, “triangle”, “hexagon”, “star”.

If np is set less than 2 or NULL, the points will be plotted at the center of every interval in ir. In this case, size argument is used to control the size of the points. This mode is useful if you have a lot of small regions.

hc = HilbertCurve(1, 100, level = 4, reference = TRUE)
hc_points(hc, ir, np = NULL, size = unit(runif(length(ir)), "cm"), pch = 16)

plot of chunk unnamed-chunk-10

Segments

Adding segments is straightforward.

hc = HilbertCurve(1, 100, level = 4, reference = TRUE)
hc_segments(hc, ir)

plot of chunk unnamed-chunk-11

Rectangles

Adding rectangles is straightforward. Note You cannot set the width or height of the rectangles. Rectangles are always located at the turning points and have width or height equal to the length of the segments.

hc = HilbertCurve(1, 100, level = 4, reference = TRUE)
hc_rect(hc, ir)

plot of chunk unnamed-chunk-12

As you can see some rectangles are not full red. It is because these segments in the curve do not full cover regions in ir, Averaging is applied in such segments. Actually it is also applicable for the 'segmented circle' mode for hc_points.

Graphical values that correspond to ir are always represented as numeric values, such as size of points, width of lines or even colors which can be converted to numeric RGB values. If a segment in the Hilbert curve can not full overlap with some intervals in ir, these numeric parameters can be averaged.

There are three different modes for averaging which are controlled by mean_mode argument. Following illustrates different settings for mean_mode:

      100     80    60     values in ir (e.g. red compoment for colors)
    ++++++   +++   +++++   ir
      ================     window (width = 16)
       4      3     3      overlap

    absolute: (100 + 80 + 60)/3
    weighted: (100*4 + 80*3 + 60*3)/(4 + 3 + 3)
    w0:       (100*4 + 80*3 + 60*3)/16

So use of the mode depends on specific scenario. For example, if ir corresponds to positions of genes, then the mode of w0 is perhaps a good choise. If ir corresponds to positions of CpG sites which is has width of 1 and most of the time is sparse in genomic windows, then absolute is a correct choice.

Text

Adding texts is straightforward. Note text is added at the center of each interval in ir.

hc = HilbertCurve(1, 100, level = 4, reference = TRUE)
labels = sample(letters, length(ir), replace = TRUE)
hc_text(hc, ir, labels = labels)

plot of chunk unnamed-chunk-13

Combine low-level functions

With combination of these basic graphic functions, complicated graphics can be easily made:

hc = HilbertCurve(1, 100, level = 4)
hc_segments(hc, IRanges(1, 100))   # background line
hc_rect(hc, ir)
hc_points(hc, ir, np = 3)
hc_text(hc, ir, labels = labels, gp = gpar(fontsize = 16, col = "blue"))

plot of chunk unnamed-chunk-14

Non-integer positions

It doesn't matter if your positions are integers or not. Internally, rounding and zooming are applied to ensure the accuracy.

Since the positions are not integers, you can specify the positions by x1 and x2. All low-level graphical funtions accept x1 and x2.

hc = HilbertCurve(0.1, 0.8, level = 4, reference = TRUE)
hc_points(hc, x1 = c(0.15, 0.55), x2 = c(0.25, 0.65))

plot of chunk unnamed-chunk-15

Add title and legends

Title are allowed by setting title argument. Legend can be passed to legend argument in HilbertCurve() as a grob object or a list of grob objects. ColorMapping class in ComplexHeatmap package or legendGrob() in grid package can be used to create a legend grob. Or you can consider to use frameGrob() and packGrob() to build a legend from ground.

Because width and height are enforced to be equal for the Hilbert curve, sometimes you may see blank spaces around the curve.

value = runif(length(ir))
col_fun = colorRamp2(c(0, 1), c("white", "red"))
cm = ColorMapping(col_fun = col_fun)
legend = color_mapping_legend(cm, plot = FALSE, title = "value")
hc = HilbertCurve(1, 100, reference = TRUE, title = "points", legend = legend)
hc_points(hc, ir, np = 3, gp = gpar(fill = col_fun(value)))

plot of chunk unnamed-chunk-16

Pixel mode

When the level is high (e.g. > 10), the whole 2D space will be almost completely filled by the curve and it is impossible to add or visualize e.g. points on the curve. In this case, the 'pixel' mode visualizes each tiny 'segment' as a pixel and maps values to colors. So the Hilbert curve with level 11 will generate a PNG figure with 2048x2048 resolution. This is extremely useful for visualize genomic data. E.g. If we make a Hilbert curve for human chromosome 1 with level 11, then each pixel can represent 60bp (249250621/2048/2048) which is of very high resolution.

Under 'pixel' mode, if the current device is an interactive deivce, every time a new layer is added, the image will be add to the interactive device as a rastered image.

hc = HilbertCurve(1, 100, level = 9, mode = "pixel")
hc_layer(hc, ir)

plot of chunk unnamed-chunk-17

Since you can only use color to map to other values, there is only one graphical setting col. You can add more than one layers, just remember to set transparent colors. And of course you can add title and legend to the plot. Grid lines can be added to the plot for better distinguish blocks in the Hilbert curve.

hc = HilbertCurve(1, 100, level = 9, mode = "pixel", title = "pixel mode", legend = legend)
hc_layer(hc, ir, col = col_fun(value), grid_line = 2)
hc_layer(hc, x1 = 75, x2 = 85, col = "#0000FF10")

plot of chunk unnamed-chunk-18

The Hilbert curve can be save as PNG figure by hc_png(), with resolution 2^level x 2^level.

hc_png(hc, file = "test.png")

Examples

Visualize rainbow colors:

col = rainbow(100)
hc = HilbertCurve(1, 100, level = 5)
ir = IRanges(start = 1:99, end = 2:100)
hc_points(hc, ir, np = 4, gp = gpar(col = col, fill = col))

plot of chunk unnamed-chunk-20

Genes on chromosome 1 (RefSeq genes for human, hg19). Here random colors are used to represent to different genes.

chr1_len = 249250621
load(paste0(system.file("extdata", package = "HilbertCurve"), "/refseq_chr1.RData"))
hc = HilbertCurve(1, chr1_len, level = 5)
hc_segments(hc, IRanges(start = 1, end = chr1_len), gp = gpar(col = "grey"))
hc_segments(hc, ranges(g), gp = gpar(lwd = unit(6, "mm"), col = rand_color(length(g))))

plot of chunk unnamed-chunk-21

Following two figures compare sequence conservations. Data are from here.

load(paste0(system.file("extdata", package = "HilbertCurve"), "/mouse_net.RData"))
hc = HilbertCurve(1, chr1_len, level = 6)
ir1 = ranges(mouse)
ir2 = setdiff(IRanges(1, chr1_len), ir1)
ir = c(ir1, ir2)
col = c(rep("red", length(ir1)), rep("yellow", length(ir2)))
hc_points(hc, ir, np = 3, gp = gpar(col = col, fill = col))

plot of chunk unnamed-chunk-22

load(paste0(system.file("extdata", package = "HilbertCurve"), "/zebrafish_net.RData"))
hc = HilbertCurve(1, chr1_len, level = 6)
ir1 = ranges(zebrafish)
ir2 = setdiff(IRanges(1, chr1_len), ir1)
ir = c(ir1, ir2)
col = c(rep("red", length(ir1)), rep("yellow", length(ir2)))
hc_points(hc, ir, np = 3, gp = gpar(col = col, fill = col))

plot of chunk unnamed-chunk-22

GC percent on chromosome 1, under “normal” mode:

df = read.table(pipe("awk '$1==\"chr1\"' /path-to/hg19_gc_percent_window1000.bed"))
col_fun = colorRamp2(c(0, 500, 1000), c("green", "#FFFFCC", "red"))
png("gc_percent_chr1_points.png", width = 500, height = 500)
hc = HilbertCurve(1, chr1_len, level = 6)
hc_points(hc, IRanges(df[[2]], df[[3]]), np = 3, gp = gpar(fill = col_fun(df[[5]]), col = col_fun(df[[5]])))
hc_rect(hc, reduce(ranges(g)), gp = gpar(fill = "#00000020", col = "#00000020"))
dev.off()

GC percent on chromosome 1, under “pixel” mode:

hc = HilbertCurve(1, chr1_len, level = 9, mode = "pixel")
hc_layer(hc, IRanges(df[[2]], df[[3]]), col = col_fun(df[[5]]))
hc_layer(hc, reduce(ranges(g)), col = "#00000020")
hc_png(hc, file = "gc_percent_chr1.png")

Following data are from http://genboree.org/EdaccData/Release-9/sample-experiment/Lung/

First is the distribution of histone marks as well as the gene regions (shadow).

library(GetoptLong)

for(mark in c("H3K27ac", "H3K36me3", "H3K4me3", "H3K9me3")) {
    df = read.table(pipe(qq("awk '$5>0 && $1==\"chr1\"' /path-to/UCSD.Lung.@{mark}.STL002.bed")), sep = "\t")
    col_fun = colorRamp2(c(0, quantile(df[[5]], 0.99)), c("white", "red"))
    hc = HilbertCurve(1, chr1_len, level = 9, mode = "pixel")
    hc_layer(hc, IRanges(df[[2]], df[[3]]), col = col_fun(df[[5]]))
  hc_layer(hc, reduce(ranges(g)), col = "#00000010")
    hc_png(hc, file = qq("@{mark}_chr1.png"))
}

H3K27ac histone mark:

H3K36me3 histone mark. (H3K36me3 is found in actively transcribed gene bodies)

H3K4me3 histone mark. (H3K4me3 is found in actively transcribed promoters, particularly just after the transcription start site)

H3K9me3 histone mark. (H3K9me3 is found in constitutively repressed genes)

Coverage for RNAseq.

df = read.table(pipe(qq("awk '$5>0 && $1==\"chr1\"' /path-to/UCSD.Lung.mRNA-Seq.STL002.bed")), sep = "\t")
df[[5]] = log(df[[5]] + 1)
col_fun = colorRamp2(c(0, quantile(df[[5]], 0.99)), c("white", "red"))
hc = HilbertCurve(1, chr1_len, level = 9, mode = "pixel")
hc_layer(hc, IRanges(df[[2]], df[[3]]), col = col_fun(df[[5]]))
hc_layer(hc, reduce(ranges(g)), col = "#00000010")
hc_png(hc, file = qq("RNASeq_coverage_chr1.png"))

Methylation, blue corresponds to un-methylation, red corresponds to full-methylation.

df = read.table(pipe("awk '$1==\"chr1\"' /path-to/UCSD.Lung.Bisulfite-Seq.STL002.bed"), sep = "\t")
col_fun = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red"))
hc = HilbertCurve(1, chr1_len, level = 9, mode = "pixel")
hc_layer(hc, IRanges(df[[2]], df[[3]]), col = col_fun(df[[5]]), mean_mode = "absolute")
hc_png(hc, file = "methylation_chr1.png")

Put all chromosomes in one plot

We demonstrated how to visualize one single chromosome through Hilbert curve, but sometimes we also want to put all chromosomes into one plot to get rid of reading too many plots simutaneously. One solution is to construct a 'huge' chromosome which merges all real chromosomes.

chr.len = read.chromInfo()$chr.len
sum_chr = 0
s = numeric(length(ratio))
e = numeric(length(ratio))
for(chr in paste0("chr", c(1:22, "X", "Y"))) {
    l = as.character(seqnames(ratio)) == chr
    s[l] = start(ratio[l]) + sum_chr
    e[l] = end(ratio[l]) + sum_chr
    sum_chr = sum_chr + chr.len[chr]
}

ir = IRanges(start = round(s/1000),
             end = round(e/1000))
hc = HilbertCurve(0, round(max(chr_cumsum)/1000), level = 9, mode = "pixel")
hc_layer(hc, ir, col = ..., grid_line = 3)

Following is an example to show log2-ratio of coverage between tumor and control samples for all chromosomes.

Also a map which shows the chromosome positions in Hilbert curve is necessary.

chr_cumsum = cumsum(chr.len[paste0("chr", c(1:22, "X", "Y"))])
chr_ir = IRanges(start = round(c(1, chr_cumsum[-length(chr_cumsum)]+1)/1000),
                 end = round(chr_cumsum/1000),
                 names = names(chr_cumsum))
hc = HilbertCurve(0, round(max(chr_cumsum)/1000), level = 8)
hc_rect(hc, chr_ir, gp = gpar(fill = rand_color(length(chr_ir), transparency = 0.5), col = NA))
hc_text(hc, chr_ir, names(chr_ir), gp = gpar(fontsize = 16))

Session info

sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] stats4    parallel  grid      stats     graphics  grDevices utils     datasets  methods  
## [10] base     
## 
## other attached packages:
##  [1] ComplexHeatmap_1.6.0 HilbertVis_1.28.0    lattice_0.20-33      circlize_0.3.1      
##  [5] HilbertCurve_1.0.0   GenomicRanges_1.22.0 GenomeInfoDb_1.6.0   IRanges_2.4.0       
##  [9] S4Vectors_0.8.0      BiocGenerics_0.16.0  knitr_1.11           markdown_0.7.7      
## 
## loaded via a namespace (and not attached):
##  [1] whisker_0.3-2       XVector_0.10.0      magrittr_1.5        zlibbioc_1.16.0    
##  [5] colorspace_1.2-6    rjson_0.2.15        stringr_1.0.0       tools_3.2.2        
##  [9] png_0.1-7           RColorBrewer_1.1-2  formatR_1.2.1       GlobalOptions_0.0.8
## [13] dendextend_1.1.0    shape_1.4.2         evaluate_0.8        stringi_0.5-5      
## [17] GetoptLong_0.1.0