1 Introduction

PICB is a flexible toolkit for assembling, prioritizing, and characterizing piRNA clusters. piRNAs (PIWI-interacting RNAs) and their PIWI protein partners play a key role in fertility and maintaining genome integrity by restricting mobile genetic elements (transposons) in germ cells. piRNAs originate from genomic regions called piRNA clusters.

PICB identifies genomic regions with a high density of piRNAs through stepwise integration of unique and multimapping piRNAs.

Figure 1: PICB considers unique mapping piRNAs (NH=1), primary alignments of multimapping piRNAs (NH>1), and all possible alignments stepwise to build seeds, cores, and clusters. Find additional information in our recent publication.

Only very limited programming knowledge is needed to run PICB. Check out the following instructions and demonstration.

2 Getting Started

It is possible to run PICB in RStudio, in an R script on your local machine or with High Performance Computing (HPC) resources or in Jupyter Notebook (using R). Keep in mind that as for any handling of large-scale sequencing data you need to have sufficient memory on your device (or cluster) allocated.

PICB requires the CRAN packages data.table, seqinr, openxlsx, dplyr, and the Bioconductor packages IRanges, GenomicRanges, GenomicAlignments, Rsamtools, Biostrings, GenomeInfoDb, BSgenome and rtracklayer installed.

Load PICB in your R environment:

install.packages(c(
    "data.table",
    "seqinr",
    "openxlsx",
    "dplyr"
))
if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
BiocManager::install(c(
    "IRanges",
    "GenomicRanges",
    "GenomicAlignments",
    "Rsamtools",
    "Biostrings",
    "GenomeInfoDb",
    "BSgenome",
    "rtracklayer"
))

Next, install PICB from Bioconductor:

BiocManager::install("PICB")

Or install the latest development version from GitHub. Note that this version may be less stable than the Bioconductor release. Make sure to have devtools installed.

devtools::install_github("HaaseLab/PICB")

Then load PICB in your R environment:

library(PICB)

2.1 How to run PICB

PICB requires only two main inputs: a BAM file containing the piRNA alignments and the reference genome.

2.1.1 BAM File Requirements

  • piRNA dataset: Verify your system! piRNAs (short for PIWI-interacting RNAs) are defined by their association with PIWI proteins. Verify the expression of PIWI proteins before looking for piRNAs. If possible, work with ‘bona-fide’ piRNAs that are prepared from purified PIWI-piRNA complexes Girard et al., 2006. Unrelated RNA fragments can be easily mistaken for ‘piRNAs’ in small RNA sequencing data.
  • NH and NM tags: The BAM file must include these tags. PICB stops when these tags are not included and provides a descriptive error message.
  • Coordinate-sorted and indexed: Ensure that your BAM file is sorted and accompanied by a .bai index file. PICB’s dependency, GenomicAlignments, stops if the index file is not found and provides a descriptive error message.

2.1.2 Information on the demo dataset

In the following example, we demonstrate the basic workflow of PICB using a sample dataset.

For this demonstration, we utilize a subset of our mapped small RNAs from Drosophila ovaries SRR7346241. In short, we trimmed the 3’ adaptor and filtered for reads ≥24 nucleotides. Next, we removed abundant cellular RNAs (rRNAs, tRNAs, snRNAs, snoRNAs) by annotation. Finally, we mapped the processed reads to the Dm6 genome using STAR (Dobin et al., 2013) allowing up to one mismatch and 100 alignments per read. Using samtools (Li et al., 2009), we extracted alignments from chr2L:20000000-21000000. To further reduce the file size, we retained only the essential tags for PICB (NH, NM) in the BAM file, replaced the sequence quality field (QUAL) with a placeholder (*) and downsampled the file to retain 22% of the reads. Lastly, we applied a high level of compression to further reduce its size and indexed the compressed and sorted BAM file. More information on the processing steps can be found in the inst/script folder.

The purpose of this entire reduction process was to significantly minimize the file size, making it manageable for storage. It is important to note that this downsampling, combined with the prior restriction to a smaller genomic region (chr2L:20000000-21000000), has drastically reduced the coverage and scope of the data, rendering it no longer representative of a biologically relevant sample.

2.1.3 Reference Genome

You can provide the reference genome to PICB in one of four ways. Make sure to use the same reference genome version as used for mapping.

  1. Using a BSgenome package

If your genome is available in a BSgenome package, load it directly:

# Example for Drosophila melanogaster (make sure to install the package first), replace with your own genome
library(BSgenome.Dmelanogaster.UCSC.dm6)
myGenome <- "BSgenome.Dmelanogaster.UCSC.dm6"
  1. Using chromosome names and lengths

If you have the chromosome names and lengths, create a Seqinfo object:

myGenome <- GenomeInfoDb::Seqinfo(
    seqnames = c("chr2L", "chr2R", "chr3L", "chr3R", "chr4", "chrX", "chrY"),
    seqlengths = c(23513712, 25286936, 28110227, 32079331, 1348131, 23542271, 3667352)
)
  1. Using a Seqinfo object

Or use an existing Seqinfo object:

myGenome <- GenomeInfoDb::Seqinfo(genome = "dm6")
  1. Using a FASTA file

If you have a FASTA file of the assembled genome sequence:

myGenome <- PICBloadfasta(FASTA.NAME = "/path/to/your/genome.fa")

3 Running PICB

PICB involves two main functions:

  1. PICBload: Loads the piRNA alignments from the BAM file.
  2. PICBbuild: Builds the piRNA clusters from the loaded alignments.

Next to these two main functions, PICB also includes a function to optimize parameters for PICBbuild (PICBoptimize) and a function to perform strand-specific analysis (PICBstrandanalysis).

3.1 Load Alignments with PICBload

Load the sample BAM file provided with the package or provide your own BAM file path.

# demonstration bam_file is in the "inst/extdata/" folder of the PICB package
bam_file <- system.file("extdata", "Fly_Ov1_chr2L_20To21mb_filtered.bam", package = "PICB")
myAlignments <- PICBload(
    BAMFILE = bam_file,
    REFERENCE.GENOME = myGenome
)
#> PICB v0.99.15 Processing ...
#> 
#> Sorting into uniquemappers vs multimappers and primary vs secondary alignments
#> Loading primary only
#> 
#> prepared loading parameters
#>  TAGS:   NH, NM
#>  CIGAR:  simple cigar
#>  WHAT:   flag
#> loading .bam file into GAlignments
#> 
#> ******
#> Loading unique and primary alignments !!!
#> ******
#>  IMPORTED: 78906
#> 
#> filtering by read size
#>  RANGE:  18-50
#>  REMAINDER: 100 %
#> 
#> filtering based on flags
#>  REMAINDER: 100 %
#> 
#> converting to GRanges
#> 
#> Seqlevels style set to: UCSC
#> Loading secondary only
#> 
#> prepared loading parameters
#>  TAGS:   NH, NM
#>  CIGAR:  simple cigar
#>  WHAT:   flag
#> loading .bam file into GAlignments
#> 
#> ******
#> Loading secondary alignments only !!!
#> ******
#>  IMPORTED: 369219
#> 
#> filtering by read size
#>  RANGE:  18-50
#>  REMAINDER: 100 %
#> 
#> filtering based on flags
#>  REMAINDER: 100 %
#> 
#> converting to GRanges
#> 
#> Seqlevels style set to: UCSC
#> 
#> Done!
#> 

The output is a list with three GenomicRanges objects:

  • myAlignments$unique: Unique mapping piRNAs (read mapped only to a single location on the reference genome)
  • myAlignments$multi.primary: Primary multimapping piRNAs (read mapped to multiple locations on the reference genome, one alignment is assigned as primary by the alignment software based on the mapping quality score. If multiple alignments have the same mapping quality score, the primary alignment is selected randomly.)
  • myAlignments$multi.secondary: Secondary multimapping piRNAs (read mapped to multiple locations on the reference genome, all other alignments are secondary)

3.2 Build piRNA Clusters with PICBbuild

Next, we want to construct the piRNA clusters using the PICBbuild function. Usually you would not need to include the size of the library (LIBRARY.SIZE) since PICB calculates it automatically. However, just for this demo, please include this parameter to adjust to the subset we chose to create this demo.

myClusters <- PICBbuild(
    IN.ALIGNMENTS = myAlignments,
    REFERENCE.GENOME = myGenome,
    LIBRARY.SIZE = 12799826 #usually not necessary
)$clusters
#> PICB v0.99.15 Starting ...
#> 
#>  Keeping standard linear chromosomes
#> 
#> Building ... STEP 1... Searching using windows
#>  Sliding window analysis
#>      UNIQUE MAPPERS
#>      WINDOW: 350
#>      STEP: 35
#>      PRIMARY MULTI MAPPERS
#>      WINDOW: 350
#>      STEP: 35
#>      SECONDARY MULTI MAPPERS
#>      WINDOW: 1000
#>      STEP: 100
#>      MIN.UNIQUE.ALIGNMENTS.PER.WINDOW >= 8.9598782
#> Annotating seeds
#> Removing seed with unique mapping coverage less than 0 nt
#>      MIN.PRIMARY.MULTIMAPPING.ALIGNMENTS.PER.WINDOW >= 17.9197564
#>      MIN.SECONDARY.MULTIMAPPING.ALIGNMENTS.PER.WINDOW >= 2.5599652
#> 
#> Building ... STEP 2 ... Annotating & filtering
#>  Annotating cores and clusters
#> 
#>  Clusters stats:
#>      Accomodated UNIQUE MAPPERS: 30315 (97.273 %)
#>      Accomodated PRIMARY MULTI MAPPERS: 43382 (90.869 %)
#> 
#> Done!

We note that ranking piRNA clusters is essential for proper interpretation. PICB provides various output columns as options to rank the clusters.

💡 Both PICBload and PICBbuild allow parameter adjustments. Refer to chapter Parameter Adjustments for insights into customization options. PICBoptimize can help you find the optimal parameters for your dataset.

3.3 Optimize parameters with PICBoptimize

You can use PICBoptimize to find the optimal parameters for PICBbuild for your dataset. This is especially useful if you are not sure about the quality of your data or your reference genome. The goal is to find a set of parameters that maximize the number of piRNAs accomodated in the clusters while minimizing genomic space occupied by the clusters.

Make sure to provide the values for the parameters as a vector. An example is shown below:

parameterExploration <- PICBoptimize(
    IN.ALIGNMENTS = myAlignments,
    REFERENCE.GENOME = myGenome,
    MIN.UNIQUE.ALIGNMENTS.PER.WINDOW = c(1, 2, 3, 4, 5), 
    LIBRARY.SIZE = 12799826, #usually not necessary
    VERBOSITY = 1
)
#> PICB v0.99.15 Starting ...
#> Checking the inputs
#> Adding MIN.UNIQUE.ALIGNMENTS.PER.WINDOW to iteration targets
#> Warning in PICBoptimize(IN.ALIGNMENTS = myAlignments, REFERENCE.GENOME =
#> myGenome, : The total number of primary alignments is not equal to the total
#> number of read names. This discrepancy may occur when using a subset of data,
#> such as in the PICB demonstration, where you specified the LIBRARY.SIZE but the
#> actual number of reads is reduced. For full piRNA datasets, this indicates a
#> potential issue with read assignment and your numbers of explained reads may be
#> wrong.
#> Building the parameter combinations
#> Iteration 1 out of 5
#> PICB v0.99.15 Starting ...
#> 
#> Building ... STEP 1... Searching using windows
#> 
#> Building ... STEP 2 ... Annotating & filtering
#> 
#> Done!
#> Iteration 2 out of 5
#> PICB v0.99.15 Starting ...
#> 
#> Building ... STEP 1... Searching using windows
#> 
#> Building ... STEP 2 ... Annotating & filtering
#> 
#> Done!
#> Iteration 3 out of 5
#> PICB v0.99.15 Starting ...
#> 
#> Building ... STEP 1... Searching using windows
#> 
#> Building ... STEP 2 ... Annotating & filtering
#> 
#> Done!
#> Iteration 4 out of 5
#> PICB v0.99.15 Starting ...
#> 
#> Building ... STEP 1... Searching using windows
#> 
#> Building ... STEP 2 ... Annotating & filtering
#> 
#> Done!
#> Iteration 5 out of 5
#> PICB v0.99.15 Starting ...
#> 
#> Building ... STEP 1... Searching using windows
#> 
#> Building ... STEP 2 ... Annotating & filtering
#> 
#> Done!
#> Values normalization

This function works on numerical parameters (see list of parameters in Parameter Adjustments) and returns a dataframe with any combination of your chosen parameters and their corresponding number of clusters, the total width of the clusters in nucleotides, the number of reads explained by the clusters, the mean RPKM of the clusters, the fraction of reads explained by the clusters and the fraction of genomic space occupied by the clusters.

Be aware that for most datasets, the default parameters will yield great results and you do not need to optimize parameters.

As an example, we can plot the results to find the optimal parameters for your dataset. First specify the parameter you want to optimize.

x_column <- "MIN.UNIQUE.ALIGNMENTS.PER.WINDOW" # change parameter to optimize, if applicable

Then plot the results.

library(ggplot2)
# Determine a scaling factor for the secondary axis
scaling_factor <- max(parameterExploration$fraction.of.library.explained.by.clusters) / max(parameterExploration$fraction.of.genome.space.clusters)
# plot graph
ggplot(parameterExploration, aes(x = .data[[x_column]])) +
    geom_line(aes(y = fraction.of.library.explained.by.clusters * 100, color = "piRNAs Explained"), linewidth = 1) +
    geom_point(aes(y = fraction.of.library.explained.by.clusters * 100, color = "piRNAs Explained"), size = 3) +
    geom_line(aes(y = fraction.of.genome.space.clusters * scaling_factor * 100, color = "Genome Space"), linewidth = 1) +
    geom_point(aes(y = fraction.of.genome.space.clusters * scaling_factor * 100, color = "Genome Space"), size = 3) +
    scale_y_continuous(name = "piRNAs Explained by Clusters (%, piRNA sample)", sec.axis = sec_axis(~ . / scaling_factor, name = "Total piRNA cluster-length (Genome, %)")) +
    scale_x_reverse(name = paste0("Parameter chosen: ", x_column), breaks = parameterExploration[[x_column]], labels = parameterExploration[[x_column]]) +
    scale_color_manual(name = "Metrics", values = c("piRNAs Explained" = "#00a100", "Genome Space" = "black")) +
    theme_classic() +
    theme(axis.title.y.left = element_text(color = "#00a100"), axis.title.y.right = element_text(color = "black"), legend.position = "top")

As described earlier, we subset the dataset to include only reads from chromosome 2L and further downsampled the file to retain 22% of the reads for this demonstration. This is why the proportion of reads explained by clusters and the genome space occupied by the clusters is so low, as shown in the plot. The plot can help you identify the optimal parameters for your dataset. However, for this demonstration, optimization is neither necessary nor meaningful. Refer to the paper for more information on how to interpret the results. In most cases, the default parameters yield excellent results (high piRNA coverage and low genome space occupation), so parameter optimization is usually unnecessary.

3.4 Strand-specific analysis with PICBstrandanalysis

Next to the core-PICB analysis, PICBstrandanalysis allows for strand-specific analysis of piRNA clusters. This computes the sense/antisense ratio of piRNAs per cluster (non-collapsed) and adds this information as a new metadata column (s_as_ratio).

myClustersWithStrandAnalysis <- PICBstrandanalysis(
    IN.ALIGNMENTS = myAlignments,
    IN.RANGES = myClusters
)
#> PICB v0.99.15 Starting with strand analysis ...
#>  Using unique alignments only
#>  Finding overlaps of clusters with piRNAs from plus strand
#>  Finding overlaps of clusters with piRNAs from minus strand
#>  Adding ratio of sense/antisense piRNAs per cluster
#> Done!

To visualize the sense/antisense ratio of the clusters, you can plot the sense/antisense ratio of the clusters in a violin plot. Higher numbers indicate a higher ratio of sense to antisense piRNAs, while values close to 1 indicate an equal ratio of sense and antisense piRNAs.

In Explore Results you can see that the sense/antisense ratio is added as a new metadata column (s_as_ratio) and is quite low. This indicates that the demonstration example is dealing with dual-stranded piRNA clusters.

3.5 Explore Results

Inspect the piRNA clusters:

myClustersWithStrandAnalysis
#> GRanges object with 4 ranges and 10 metadata columns:
#>       seqnames            ranges strand | width_in_nt uniq_reads_FPM
#>          <Rle>         <IRanges>  <Rle> |   <integer>      <numeric>
#>   [1]    chr2L 20103901-20119900      + |       16000        444.772
#>   [2]    chr2L 20147701-20224700      + |       77000       1134.937
#>   [3]    chr2L 20104201-20121400      - |       17200        113.595
#>   [4]    chr2L 20147701-20224600      - |       76900        675.087
#>       multimapping_reads_primary_alignments_FPM
#>                                       <numeric>
#>   [1]                                  131.7205
#>   [2]                                 1687.6011
#>   [3]                                   75.0791
#>   [4]                                 1494.8641
#>       all_reads_primary_alignments_FPM uniq_reads_FPKM
#>                              <numeric>       <numeric>
#>   [1]                          576.492        27.79823
#>   [2]                         2822.538        14.73945
#>   [3]                          188.674         6.60438
#>   [4]                         2169.951         8.77877
#>       multimapping_reads_primary_alignments_FPKM
#>                                        <numeric>
#>   [1]                                    8.23253
#>   [2]                                   21.91690
#>   [3]                                    4.36507
#>   [4]                                   19.43906
#>       all_reads_primary_alignments_FPKM
#>                               <numeric>
#>   [1]                           36.0308
#>   [2]                           36.6563
#>   [3]                           10.9694
#>   [4]                           28.2178
#>       fraction_of_width_covered_by_unique_alignments         type s_as_ratio
#>                                            <numeric>  <character>  <numeric>
#>   [1]                                       0.482312 ExtendedCore   3.916094
#>   [2]                                       0.204636    MultiCore   1.681092
#>   [3]                                       0.289535    MultiCore   0.254950
#>   [4]                                       0.181638    MultiCore   0.594851
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths

Explainations of the output columns are listed under the chapter Output.

3.6 Export Results

You can export the results to an excel file using PICBexporttoexcel.

PICBexporttoexcel(
    IN.RANGES = myClustersWithStrandAnalysis,
    EXCEL.FILE.NAME = "myClusters_demonstration.xlsx"
)
#> IN.RANGES is provided as a single GRanges object. Converting to a list with the name 'clusters'. If you intend to save seeds or cores, please input the data as a list (e.g., list(seeds = IN.RANGES)).
#> Saved to Excel.

When exporting clusters only (PICBbuildOutput$clusters), you receive an excel file with just one sheet containing the clusters. When you export all resulting list items of PICBbuild (PICBbuildOutput with all 3 elements: seeds, cores and clusters), you receive an excel file with three sheets containing the seeds, cores and clusters.

3.7 Import Results

When needed, you can import the results from an excel file using PICBimportfromexcel.

myClustersWithStrandAnalysis <- PICBimportfromexcel(EXCEL.FILE.NAME = "myClusters_demonstration.xlsx")

💡 Keep in mind that you just run a demo. These are not representative clusters! Though you can apply these steps to your organism of choice and sequencing data.

4 Parameter Adjustments

As previously mentioned, PICB allows wide-ranging parameter adjustments to adapt to e.g. sparse reference genomes and specific limitations of the data set. Lists of possible parameter adjustments for PICBload and PICBbuild are shown below.

4.1 Parameters for PICBload

Required parameters:

  • BAMFILE
  • REFERENCE.GENOME

Optional parameters:

  • VERBOSE default TRUE • Allows disabling progress messages while running PICBload.
  • IS.SECONDARY.ALIGNMENT default NA (all alignments) • Determines which alignment types (primary multimappers and secondary multimappers) will be loaded.
  • SEQ.LEVELS.STYLE default “UCSC” • Allows changing the style of chromosome names in the output. Supported styles are listed in GenomeInfoDb::genomeStyles(). NA allows keeping all chromosomes from REFERENCE.GENOME and does not filter by standard linear chromosome names.
  • STANDARD.CONTIGS.ONLY default TRUE • Determines whether alignments from non-standard contigs are used.
  • FILTER.BY.FLAG default TRUE • Allows only those alignments with flag values present in the vector of allowed flags SELECT.FLAG. Default values of SELECT.FLAG are 0, 16, 272 and 256 (primary and secondary alignments on plus and minus strands). If FALSE, includes all flags.
  • USE.SIZE.FILTER default TRUE • Allows filtering of alignments based on size. Default value is 18-50 nt.
  • TAGS default c(“NH”,“NM”) • Indicates list of tags to be extracted from given bam file. The “NH” tag is required to deduce if the alignment is unique mapping or multimapping. The “NM” is required to identify mismatches if required in further analysis.
  • GET.ORIGINAL.SEQUENCE default FALSE • Allows extraction of original read sequence from the bam file.
  • SIMPLE.CIGAR default TRUE • Allows filtering out spliced alignments.
  • PERFECT.MATCH.ONLY default FALSE • Allows filtering out alignments with mismatches.
  • WHAT default c(“flag”) • Allows importing flags. Corresponds to “what” from ScanBamParam-class Rsamtools.

4.2 Parameters for PICBbuild

Required parameters:

  • IN.ALIGNMENTS (output of PICBload)
  • REFERENCE.GENOME (reference genome previously used in PICBload)

The library size can be adjusted as shown in our PICB demo, however in most cases this is not necessary.

  • LIBRARY.SIZE: default number of unique mapping alignments + number of primary multimapping alignments • number of reads in the library.
  • VERBOSITY default 2 • Allows choosing the quantity of progress messages while running PICBbuild. Depending on VERBOSITY’s value, printed messages are missing (0), include current processing step (1), include additionally current processing sub-step (2) or include chosen parameters for PICBbuild.
  • COMPUTE.1U.10A.FRACTIONS default FALSE • Calculates the fractions of piRNAs with a 1U (Uridine at the 5’ most piRNA position) and a 10A (adenine at position 10). Requirement: GET.ORIGINAL.SEQUENCE set to TRUE in PICBload.

PICBbuild integrates unique mapping (seeds), primary multimapping (cores) and secondary alignments stepwise using the sliding window algorithm. Each step allows following parameter adjustments.

Adjustable parameters for processing unique mapping alignments:

  • UNIQUEMAPPERS.SLIDING.WINDOW.WIDTH default 350 nt • Sets length of sliding windows for uniquely mapping alignments.
  • UNIQUEMAPPERS.SLIDING.WINDOW.STEP default UNIQUEMAPPERS.SLIDING. WINDOW.WIDTH divided by 10 and rounded to the nearest integer • Sets distance between starts of the windows for uniquely mapping alignments.
  • MIN.UNIQUE.ALIGNMENTS.PER.WINDOW default Value corresponding to 2 FPKM • Sets coverage threshold for seed discovery. Corresponds to normalized counts of uniquely mapping alignments throughout the window (in FPKM).
  • MIN.UNIQUE.SEQUENCES.PER.WINDOW default 1 per 50 nt of window length or MIN.UNIQUE.ALIGNMENTS.PER.WINDOW whichever is smaller • Minimum number of distinct piRNA sequences within the sliding window. This parameter allows identification of clusters supported by a diverse set of piRNA sequences rather than a high number of reads from one or a few piRNA sequences.

The called windows are reduced into seeds and further filtered:

  • THRESHOLD.SEEDS.GAP default 0 nt • Removes gaps between seeds if below the given length value.
  • MIN.SEED.LENGTH default 2 * UNIQUEMAPPERS.SLIDING. WINDOW.WIDTH + 100 nt = 800 nt • Removes seeds below a certain length (Default requires an actual piRNA coverage of at least 100 nt: UNIQUEMAPPERS.SLIDING.WINDOW.WIDTH (see above, default: 350 nt) at both the 5’ end and the 3’ end, thus making a 350+350+100=800 nt long seed).
  • MIN.COVERED.SEED.LENGTH default 0 nt • Allows filtering by minimum number of seed bases covered by unique mapping alignments.

The next step includes primary multimapping alignments using a similar but simplified algorithm. Following parameters can be adjusted:

  • PRIMARY.MULTIMAPPERS.SLIDING.WINDOW.WIDTH: default 350 nt • Sets lengths of sliding windows for primary multimapping alignments.
  • PRIMARY.MULTIMAPPERS.SLIDING.WINDOW.STEP: default PRIMARY.MULTIMAPPERS. SLIDING.WINDOW.WIDTH divided by 10 and rounded to the nearest integer • Sets distances between starts of windows for primary multimapping alignments.
  • MIN.PRIMARY.MULTIMAPING.ALIGNMENTS.PER.WINDOW: default 4 FPKM • Sets coverage threshold for calling primary multimapping alignments windows. Corresponds to normalized counts of primary multimapping alignments throughout the window (in FPKM).

These resulting windows are being reduced into cores and are further filtered:

  • THRESHOLD.CORES.GAP: default 0 nt • Removes gaps between cores if below the given length value.

Seeds and primary multimapping windows overlapping seeds merge into cores. Standalone seeds are also considered cores. Cores not overlapping with a seed are being filtered out as their transcription cannot be verified.

In the next step secondary alignments are processed identically as primary multimapping alignments. This step is dependent on whether PICBload loaded the secondary alignments (parameter IS.SECONDARY.ALIGNMENT).

  • SECONDARY.MULTIMAPPERS.SLIDING.WINDOW.WIDTH: default 1000 nt • Sets lengths of sliding windows for secondary alignments.
  • SECONDARY.MULTIMAPPERS.SLIDING.WINDOW.STEP: default 100 nt • Sets distances between starts of the windows for secondary alignments.
  • MIN.SECONDARY.MULTIMAPING.ALIGNMENTS.PER.WINDOW: default 0.2 FPKM • Sets coverage threshold for calling secondary alignments windows. Corresponds to normalized counts of secondary alignments throughout the window (in FPKM).

Cores and secondary alignments overlapping cores merge into clusters. Standalone cores are also considered clusters. piRNA clusters were formed and include all alignment information.

5 Output

The output of PICBbuild includes seeds, cores and clusters, each in GenomicRanges format. In Running PICB, we extracted directly the clusters using $clusters. Extracting the seeds and cores can be done similarly using $seeds and $cores.

The piRNA clusters follow GRanges convention including the genomic coordinates (seqnames, ranges, and strand) and metacolumns. There are in total 9 metacolumns when running PICB with default parameters:

5.1 Metadata Columns

  • type: Type of cluster (SingleCore, ExtendedCore, MultiCore).
  • width_in_nt: Width of the seed/core/cluster.
  • uniq_reads_FPM: Uniquely mapping piRNA reads aligned to the seed/core/cluster (normalized to the number of all aligned reads)
  • multimapping_reads_primary_alignments_FPM: Number of primary multimapping alignments overlapping to the seed/core/cluster (normalized to the number of all aligned reads)
  • all_reads_primary_alignments_FPM: Number of primary alignments overlapping the seed/core/cluster (normalized to the number of all aligned reads)
  • uniq_reads_FPKM: Number of unique reads normalized to the number of all aligned reads and to the size of the corresponding seed/core/cluster length (unique reads FPM per kilobase)
  • multimapping_reads_primary_alignments_FPKM: Number of multimapping reads normalized to the number of all aligned reads and to the size of the corresponding seed/core/cluster length (primary alignments of multimapping reads FPM per kilobase)
  • all_reads_primary_alignments_FPKM: Number of all reads primary alignments normalized to the number of all aligned reads and to the size of the corresponding seed/core/cluster length (primary alignments of all reads FPM per kilobase)
  • fraction_of_width_covered_by_unique_alignments: Fraction of seed/core/cluster length coreved by at least 1 unique mapping piRNA read

Further analyze and visualize your piRNA clusters as needed. PICB provides an excellent starting point for identifying and characterizing piRNA clusters and serves as a stepping stone for further piRNA research. # Acknowledgments

We thank all contributors and supporters of PICB.

Authors:

The Haase Lab at the National Institute of Health (NIH)

Visit the lab website of Astrid D. Haase, M.D., Ph.D.

6 Session Information

sessionInfo()
#> R Under development (unstable) (2025-01-20 r87609)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] ggplot2_3.5.1                         BSgenome.Dmelanogaster.UCSC.dm6_1.4.1
#>  [3] BSgenome_1.75.1                       rtracklayer_1.67.0                   
#>  [5] BiocIO_1.17.1                         Biostrings_2.75.3                    
#>  [7] XVector_0.47.2                        GenomicRanges_1.59.1                 
#>  [9] GenomeInfoDb_1.43.4                   IRanges_2.41.2                       
#> [11] S4Vectors_0.45.2                      BiocGenerics_0.53.6                  
#> [13] generics_0.1.3                        PICB_0.99.15                         
#> [15] BiocStyle_2.35.0                     
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.37.0 gtable_0.3.6               
#>  [3] rjson_0.2.23                xfun_0.50                  
#>  [5] bslib_0.9.0                 Biobase_2.67.0             
#>  [7] lattice_0.22-6              vctrs_0.6.5                
#>  [9] tools_4.5.0                 bitops_1.0-9               
#> [11] curl_6.2.0                  parallel_4.5.0             
#> [13] tibble_3.2.1                pkgconfig_2.0.3            
#> [15] Matrix_1.7-2                data.table_1.16.4          
#> [17] lifecycle_1.0.4             GenomeInfoDbData_1.2.13    
#> [19] farver_2.1.2                compiler_4.5.0             
#> [21] Rsamtools_2.23.1            tinytex_0.54               
#> [23] munsell_0.5.1               codetools_0.2-20           
#> [25] htmltools_0.5.8.1           sass_0.4.9                 
#> [27] RCurl_1.98-1.16             yaml_2.3.10                
#> [29] pillar_1.10.1               crayon_1.5.3               
#> [31] jquerylib_0.1.4             BiocParallel_1.41.0        
#> [33] cachem_1.1.0                DelayedArray_0.33.5        
#> [35] magick_2.8.5                abind_1.4-8                
#> [37] zip_2.3.2                   tidyselect_1.2.1           
#> [39] digest_0.6.37               stringi_1.8.4              
#> [41] dplyr_1.1.4                 restfulr_0.0.15            
#> [43] bookdown_0.42               labeling_0.4.3             
#> [45] fastmap_1.2.0               grid_4.5.0                 
#> [47] colorspace_2.1-1            cli_3.6.3                  
#> [49] SparseArray_1.7.5           magrittr_2.0.3             
#> [51] S4Arrays_1.7.3              XML_3.99-0.18              
#> [53] withr_3.0.2                 UCSC.utils_1.3.1           
#> [55] scales_1.3.0                rmarkdown_2.29             
#> [57] httr_1.4.7                  matrixStats_1.5.0          
#> [59] openxlsx_4.2.8              evaluate_1.0.3             
#> [61] knitr_1.49                  rlang_1.1.5                
#> [63] Rcpp_1.0.14                 glue_1.8.0                 
#> [65] BiocManager_1.30.25         jsonlite_1.8.9             
#> [67] R6_2.5.1                    MatrixGenerics_1.19.1      
#> [69] GenomicAlignments_1.43.0

For any questions or issues, please contact the PICB development team through email.