GBScleanR 1.2.14
The GBScleanR package has been developed to conduct error correction on genotype data obtained via NGS-based genotyping methods such as RAD-seq and GBS (Miller et al. 2007; Elshire et al. 2011). It is designed to estimate true genotypes along chromosomes from given allele read counts in the VCF file generated by SNP callers like GATK and TASSEL-GBS (McKenna et al. 2010; Glaubitz et al. 2014). The current implementation supports genotype data of a mapping population derived from two or more diploid founders followed by selfings, sibling crosses, or random crosses. e.g. F\(_2\) and 8-way RILs. Our method supposes markers to be biallelic and ordered along chromosomes by mapping reads on a reference genome sequence. To support smooth access to large size genotype data, every input VCF file is first converted to a genomic data structure (GDS) file (Zheng et al. 2012). The current implementation cannot convert a VCF file containing non-biallelic markers to a GDS file correctly. Thus, any input VCF file should be subjected to filtering for retaining only biallelic SNPs using, for example, bcftools (Li and Barrett 2011). GBScleanR provides functions for data visualization, filtering, and loading/writing a VCF file. Furthermore, the data structure of the GDS file created via this package is compatible with those used in the SNPRelate, GWASTools and GENESIS packages those are designed to handle large variant data and conduct regression analysis (Zheng et al. 2012; Gogarten et al. 2012, 2019). In this document, we first walk through the utility functions implemented in GBScleanR to introduce a basic usage. Then, the core function of GBScleanR which estimates true genotypes for error correction will be introduced.
This package internally uses the following packages.
- ggplot2
- dplyr
- tidyr
- expm
- gdsfmt
- SeqArray
You can install GBScleanR from the Bioconductor repository with the following code.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GBScleanR")
The code below let you install the package from the github repository. The package released in the github usually get frequent update more than that in Bioconductor due to the release schedule.
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("tomoyukif/GBScleanR", build_vignettes = TRUE)
The main class of the GBScleanR package is GbsrGenotypData
which inherits the GenotypeData
class in the SeqArray package. The gbsrGenotypeData
class object has three slots: sample
, marker
, and scheme
. The data
slot holds genotype data as a gds.class
object which is defined in the gdsfmt
package while snpAnnot
and scanAnnot
contain objects storing annotation information of SNPs and samples, which are the SnpAnnotationDataFrame
and ScanAnnotationDataFrame
objects defined in the GWASTools package. See the vignette of GWASTools for more detail. GBScleanR follows the way of GWASTools in which a unique genotyping instance (genotyped sample) is called “sample”.
Load the package.
library("GBScleanR")
GBScleanR only supports a VCF file as input. As an example data, we use sample genotype data for a biparental F2 population derived from inbred founders.
vcf_fn <- system.file("extdata", "sample.vcf", package = "GBScleanR")
gds_fn <- tempfile("sample", fileext = ".gds")
As mentioned above, the GbsrGenotypeData
class requires genotype data in the gds.class
object which enable us quick access to the genotype data without loading the whole data on RAM. At the beginning of the processing, we need to convert data format of our genotype data from VCF to GDS. This conversion can be achieved using gbsrVCF2GDS()
as shown below. A compressed VCF file (.vcf.gz) also can be the input.
# `force = TRUE` allow the function to over write the GDS file,
# even if a GDS file exists at `out_fn`.
gbsrVCF2GDS(vcf_fn = vcf_fn, out_fn = gds_fn, force = TRUE, verbose = FALSE)
## [1] "/tmp/RtmpfzvM99/sample1c3d062b33ce07.gds"
Once we converted the VCF to the GDS, we can create a GbsrGenotypeData
instance for our data.
gds <- loadGDS(gds_fn, verbose = FALSE)
The first time to load a newly produced GDS file will take long time due to data reformatting for quick access.
Getter functions allow you to retrieve basic information of genotype data, e.g. the number of SNPs and samples, chromosome names, physical position of SNPs and alleles.
# Number of samples
nsam(gds)
## [1] 102
# Number of SNPs
nmar(gds)
## [1] 242
# Indices of chromosome ID of all markers
head(getChromosome(gds))
## [1] "1" "1" "1" "1" "1" "1"
# Chromosome names of all markers
head(getChromosome(gds))
## [1] "1" "1" "1" "1" "1" "1"
# Position (bp) of all markers
head(getPosition(gds))
## [1] 522289 571177 577905 720086 735019 841286
# Reference allele of all markers
head(getAllele(gds))
## [1] "A,G" "A,G" "C,T" "G,C" "C,T" "G,T"
# SNP IDs
head(getMarID(gds))
## [1] 1 2 3 4 5 6
# sample IDs
head(getSamID(gds))
## [1] "F2_1054" "F2_1055" "F2_1059" "F2_1170" "F2_1075" "F2_1070"
The function getGenotype()
returns genotype call data in which integer numbers 0, 1, and 2 indicate the number of reference allele.
geno <- getGenotype(gds)
The function getRead()
returns read count data as a list with two elements ref
and alt
containing reference read counts and alternative read counts, respectively.
geno <- getRead(gds)
countGenotype()
and countRead()
are class methods of GbsrGenotypeData
and they summarize genotype counts and read counts per marker and per sample.
gds <- countGenotype(gds)
gds <- countRead(gds)
These summary statistics can be visualized via plotting functions.
With the values obtained via countGenotype()
, we can plot histograms of missing rate , heterozygosity, reference allele frequency as shown below.
# Histgrams of missing rate
histGBSR(gds, stats = "missing")
# Histgrams of heterozygosity
histGBSR(gds, stats = "het")
# Histgrams of reference allele frequency
histGBSR(gds, stats = "raf")
With the values obtained via countRead()
, we can plot histograms of total read depth , allele read depth , reference read frequency as shown below.
# Histgrams of total read depth
histGBSR(gds, stats = "dp")
# Histgrams of allelic read depth
histGBSR(gds, stats = "ad_ref")
# Histgrams of allelic read depth
histGBSR(gds, stats = "ad_ref")
# Histgrams of reference allele frequency
histGBSR(gds, stats = "rrf")
# Histgrams of mean allelic read depth
histGBSR(gds, stats = "mean_ref")
# Histgrams of standard deviation of read depth
histGBSR(gds, stats = "sd_ref")
# Histgrams of standard deviation of read depth
histGBSR(gds, stats = "sd_ref")
plotGBSR()
and pairsGBSR()
provide other ways to visualize statistics. plotGBSR()
draws a line plot of a specified statistics per marker along each chromosome. pairsGBSR()
give us a two-dimensional scatter plot to visualize relationship between statistics.
plotGBSR(gds, stats = "missing")
plotGBSR(gds, stats = "geno")
pairsGBSR(gds, stats1 = "missing", stats2 = "dp")
The statistics obtained via countGenotype()
, countReat()
, and calcReadStats()
are sotred in the snpAnnot
and scanAnnot
slots. They can be retrieved using getter functions as follows.
# Reference genotype count per marker
head(getCountGenoRef(gds, target = "marker"))
## [1] 35 38 33 41 42 47
# Reference genotype count per sample
head(getCountGenoRef(gds, target = "sample"))
## [1] 172 76 76 118 28 89
# Heterozygote count per marker
head(getCountGenoHet(gds, target = "marker"))
## [1] 48 33 45 14 12 25
# Heterozygote count per sample
head(getCountGenoHet(gds, target = "sample"))
## [1] 35 79 26 72 179 97
# Alternative genotype count per marker
head(getCountGenoAlt(gds, target = "marker"))
## [1] 19 28 24 23 29 25
# Alternative genotype count per sample
head(getCountGenoAlt(gds, target = "sample"))
## [1] 25 46 122 46 20 39
# Missing count per marker
head(getCountGenoMissing(gds, target = "marker"))
## [1] 0 3 0 24 19 5
# Missing count per sample
head(getCountGenoMissing(gds, target = "sample"))
## [1] 10 41 18 6 15 17
# Reference allele count per marker
head(getCountAlleleRef(gds, target = "marker"))
## [1] 118 109 111 96 96 119
# Reference allele count per sample
head(getCountAlleleRef(gds, target = "sample"))
## [1] 379 231 178 308 235 275
# Alternative allele count per marker
head(getCountAlleleAlt(gds, target = "marker"))
## [1] 86 89 93 60 70 75
# Alternative allele count per sample
head(getCountAlleleAlt(gds, target = "sample"))
## [1] 85 171 270 164 219 175
# Missing allele count per marker
head(getCountAlleleMissing(gds, target = "marker"))
## [1] 0 6 0 48 38 10
# Missing allele count per sample
head(getCountAlleleMissing(gds, target = "sample"))
## [1] 20 82 36 12 30 34
# Reference read count per marker
head(getCountReadRef(gds, target = "marker"))
## [1] 696 1012 604 182 163 271
# Reference read count per sample
head(getCountReadRef(gds, target = "sample"))
## [1] 1227 489 600 1693 1097 859
# Alternative read count per marker
head(getCountReadAlt(gds, target = "marker"))
## [1] 654 1555 537 156 199 428
# Alternative read count per sample
head(getCountReadAlt(gds, target = "sample"))
## [1] 269 575 1516 613 1103 481
# Sum of reference and alternative read counts per marker
head(getCountRead(gds, target = "marker"))
## [1] 1350 2567 1141 338 362 699
# Sum of reference and alternative read counts per sample
head(getCountRead(gds, target = "sample"))
## [1] 1496 1064 2116 2306 2200 1340
# Mean of reference allele read count per marker
head(getMeanReadRef(gds, target = "marker"))
## [1] 4034.762 2966.011 4032.764 1213.087 1248.329 1702.554
# Mean of reference allele read count per sample
head(getMeanReadRef(gds, target = "sample"))
## [1] 3905.653 3481.716 3993.717 3546.723 2493.182 3237.600
# Mean of Alternative allele read count per marker
head(getMeanReadAlt(gds, target = "marker"))
## [1] 3221.486 2783.018 3501.793 1117.116 1469.465 1836.445
# Mean of Alternative allele read count per sample
head(getMeanReadAlt(gds, target = "sample"))
## [1] 3047.675 3651.443 3958.266 2507.814 2387.446 2804.338
# SD of reference allele read count per marker
head(getSDReadRef(gds, target = "marker"))
## [1] 2244.3271 4346.0563 2620.9744 830.4441 818.4920 1080.3064
# SD of reference allele read count per sample
head(getSDReadRef(gds, target = "sample"))
## [1] 3265.505 2489.965 3467.100 3115.169 2048.550 2443.671
# SD of Alternative allele read count per marker
head(getSDReadAlt(gds, target = "marker"))
## [1] 1814.5382 4932.9596 2109.4078 739.2224 912.1266 1480.2307
# SD of Alternative allele read count per sample
head(getSDReadAlt(gds, target = "sample"))
## [1] 2483.399 3011.280 3373.579 2137.899 1875.358 2070.380
# Minor allele frequency per marker
head(getMAF(gds, target = "marker"))
## [1] 0.4215686 0.4494949 0.4558824 0.3846154 0.4216867 0.3865979
# Minor allele frequency per sample
head(getMAF(gds, target = "sample"))
## [1] 0.1831897 0.4253731 0.3973214 0.3474576 0.4823789 0.3888889
# Minor allele count per marker
head(getMAC(gds, target = "marker"))
## [1] 86 89 93 60 70 75
# Minor allele count per sample
head(getMAC(gds, target = "sample"))
## [1] 85 171 178 164 219 175
You can get the proportion of each genotype call with prop = TRUE
.
head(getCountGenoRef(gds, target = "marker", prop = TRUE))
## [1] 0.3431373 0.3838384 0.3235294 0.5256410 0.5060241 0.4845361
head(getCountGenoHet(gds, target = "marker", prop = TRUE))
## [1] 0.4705882 0.3333333 0.4411765 0.1794872 0.1445783 0.2577320
head(getCountGenoAlt(gds, target = "marker", prop = TRUE))
## [1] 0.1862745 0.2828283 0.2352941 0.2948718 0.3493976 0.2577320
head(getCountGenoMissing(gds, target = "marker", prop = TRUE))
## [1] 0.00000000 0.02941176 0.00000000 0.23529412 0.18627451 0.04901961
The proportion of each allele counts.
head(getCountAlleleRef(gds, target = "marker", prop = TRUE))
## [1] 0.5784314 0.5505051 0.5441176 0.6153846 0.5783133 0.6134021
head(getCountAlleleAlt(gds, target = "marker", prop = TRUE))
## [1] 0.4215686 0.4494949 0.4558824 0.3846154 0.4216867 0.3865979
head(getCountAlleleMissing(gds, target = "marker", prop = TRUE))
## [1] 0.00000000 0.02941176 0.00000000 0.23529412 0.18627451 0.04901961
The proportion of each allele read counts.
head(getCountReadRef(gds, target = "marker", prop = TRUE))
## [1] 0.5155556 0.3942345 0.5293602 0.5384615 0.4502762 0.3876967
head(getCountReadAlt(gds, target = "marker", prop = TRUE))
## [1] 0.4844444 0.6057655 0.4706398 0.4615385 0.5497238 0.6123033
Based on the statistics we obtained, we can filter out less reliable markers and samples using setMarFilter()
and setSamFilter()
.
gds <- setMarFilter(gds, missing = 0.2, het = c(0.1, 0.9), maf = 0.05)
gds <- setSamFilter(gds, missing = 0.8, het = c(0.25, 0.75))
setCallFilter()
is another type of filtering which works on each genotype call. We can replace some genotype calls with missing. If you would like to filter out less reliable genotype calls supported by less than 5 reads, set the arguments as below.
gds <- setCallFilter(gds, dp_count = c(5, Inf))
If need to remove genotype calls supported by too many reads, which might be the results of mismapping from repetitive sequences, set as follows.
# Filtering genotype calls based on total read counts
gds <- setCallFilter(gds, dp_qtile = c(0, 0.99))
# Filtering genotype calls based on reference read counts
# and alternative read counts separately.
gds <- setCallFilter(gds, ref_qtile = c(0, 0.99),
alt_qtile = c(0, 0.99))
Usually reference reads and alternative reads show different data distributions. Thus, we can set the different thresholds for them via dp_qtile
, ref_qtile
, and alt_qtile
to filter out genotype calls based on quantiles of read counts per marker and per sample.
Here, the following codes filter out calls supported by less than 5 reads and then filter out markers having more than 20% of missing rate.
gds <- setCallFilter(gds, dp_count = c(5, Inf))
gds <- setMarFilter(gds, missing = 0.2)
In addition to those statistics based filtering functions, GBScleanR provides filtering function based on relative marker positions. Markers locating too close each other usually have redundant information, especially if those markers are closer each other than the read length, in which case the markers are supported by completely (or almost) the same set of reads. To select only one marker from those markers, we can sue thinMarker()
. This function selects one marker having the least missing rate from each stretch with the specified length. If some markers have the least missing rate, select the first marker in the stretch.
# Here we select only one marker from each 150 bp stretch.
gds <- thinMarker(gds, range = 150)
We can obtain the summary statistics using countGenotype()
, countRead()
, and calcReadStats()
for only the SNPs and samples retained after the filtering.
gds <- countGenotype(gds)
gds <- countRead(gds)
calcReadStats()
calculate the normalized read counts based on non filtered data but gets mean, sd, and quantiles from the normalized values of the retained markers of samples.
We can check which markers and samples are retained after the filtering using getValidSnp()
and getValidScan()
.
head(validMar(gds))
## [1] TRUE TRUE TRUE FALSE TRUE TRUE
head(validSam(gds))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
The class methods of GbsrGenotypeData
basically work with only the markers and samples having TRUE
in the returned values of getValidSnp()
and getValidScan()
. To use all markers and samples, please specify valid = FALSE
.
nmar(gds)
## [1] 206
nmar(gds, valid = FALSE)
## [1] 242
We can reset filtering as following.
# Reset the filter on markers
gds <- resetMarFilter(gds)
# Reset the filter on samples
gds <- resetSamFilter(gds)
# Reset the filter on calls
gds <- resetCallFilter(gds)
## <None> <None>
# Reset all filters
gds <- resetFilter(gds)
The error correction algorithm of GBScleanR bases on the HMM assuming observed allele read counts for each SNP marker along a chromosome as the outputs from a sequence of latent true genotypes. Our model supposes that a population of \(N^o \geq 1\) sampled offspring was originally derived form the crosses between \(N^f \geq 2\) founder individuals. The founders can be inbred lines having homozygotes at all markers and outbred lines in which markers show heterozygous genotype.
As the first step for genotype error correcion, we have to specify which samples are the founders of the population via setParents()
. In the case of genotype data in the biparental population, people usually filter out SNPs which are not monomorphic in each parental sample and not biallelic between parents. setParents()
automatically do this filtering, if you set mono = TRUE
and bi = TRUE
.
p1 <- grep("Founder1", getSamID(gds), value = TRUE)
p2 <- grep("Founder2", getSamID(gds), value = TRUE)
gds <- setParents(gds, parents = c(p1, p2), mono = TRUE, bi = TRUE)
The next step is to visualize statistical summaries of the data. Get genotype data summaries as mentioned in the previous section.
gds <- countGenotype(gds)
Then, get histograms.
histGBSR(gds, stats = "missing")
histGBSR(gds, stats = "het")
histGBSR(gds, stats = "raf")
As the histograms showed, the data contains a lot of missing genotype calls with unreasonable heterozygosity in a F2 population. Reference allele frequency shows a bias to reference allele. If you can say your population has no strong segregation distortion in any positions of the genome, you can filter out the markers having too high or too low reference allele frequency.
# filter out markers with reference allele frequency
# less than 5% or more than 95%.
gds <- setMarFilter(gds, maf = 0.05)
However, sometimes filtering based on allele frequency per marker removes all markers from regions truly showing segregation distortion. Although heterozygosity also can be a criterion to filter out markers, this will removes too many markers which even contains useful information for genotyping.
If we found poor quality samples in you dataset based on missing rate, heterozygosity, and reference allele frequency, we can omit those samples with setSamFilter()
.
# Filter out samples with more than 90% missing genotype calls,
# less than 5% heterozygosity, and less than 5% minor allele frequency.
gds <- setSamFilter(gds, missing = 0.9, het = 0.05, maf = 0.05)
Before filtering using setMarFilter()
and setSamFilter()
, we recomend that you conduct filtering on each genotype call based on read depth. The error correction via GBScleanR is robust against low coverage calls, while genotype calls messed up by mismapping might lead less reliable error correction. Therefore, filtering for extremely high coverage calls are necessary rather than that for low coverage ones.
# Filter out genotype calls supported by reads less than 2 reads.
gds <- setCallFilter(gds, dp_count = c(2, Inf))
# Filter out genotype calls supported by reads more than 100.
gds <- setCallFilter(gds, dp_count = c(0, 100))
# Filter out genotype calls based on quantile values
# of read counts at markers in each sample.
gds <- setCallFilter(gds, ref_qtile = c(0, 0.9), alt_qtile = c(0, 0.9))
Since missing genotype calls left in the data basically give no negative effect on genotype error correction. Therefore, you can leave any missing genotype calls. We can, however, remove markers based on missing genotype calls, if you need.
# Remove markers having more than 75% of missing genotype calls
gds <- setMarFilter(gds, missing = 0.2)
nmar(gds)
## [1] 207
To check statistical summaries of the filtered genotype data, we need to set node = "filt
. Otherwise, countGenotype()
used the raw genotype data.
gds <- countGenotype(gds, node = "filt")
histGBSR(gds, stats = "missing")
histGBSR(gds, stats = "het")
histGBSR(gds, stats = "raf")
We can still see the markers showing distortion in allele frequency, while the expected allele frequency is 0.5 in a F2 population. To investigate that those markers having distorted allele frequency were derived from truly distorted regions or just error prone markers, we must check if there are regions where the markers with distorted allele frequency are clustered.
plotGBSR(gds, stats = "raf")
No region seem to have severe distortion. Based on the histogram of reference allele frequency, we can roughly cut off the markers with frequency more than 0.75 or less than 0.25, in other words, less than 0.25 of minor allele frequency.
gds <- setMarFilter(gds, maf = 0.25)
nmar(gds)
## [1] 195
Let’s see the statistics again.
gds <- countGenotype(gds)
histGBSR(gds, stats = "missing")
histGBSR(gds, stats = "het")
histGBSR(gds, stats = "raf")
At the end of filtering, check marker density and genotype ratio per marker along chromosomes.
# Marker density
plotGBSR(gds, stats = "marker")
plotGBSR(gds, stats = "geno")
The coord
argument controls the number of rows and columns of the facets in the plot.
Before executing the function for true genotype estimation, we need to build a scheme object. Our sample data is of a biparental F2 population derived from inbred founders. Therefore, we should run initScheme()
and addScheme()
as following.
# As always the first step of breeding scheme would be "pairing" cross(es) of
# founders, never be "selfing" and a "sibling" cross,
# the argument `crosstype` in initScheme() was deprecated on the update on April 6, 2023.
# gds <- initScheme(gds, crosstype = "pairing", mating = matrix(1:2, 2))
gds <- initScheme(gds, mating = matrix(1:2, 2))
gds <- addScheme(gds, crosstype = "selfing")
## [,1]
## [1,] 3
## [2,] 3
The function initScheme()
initializes the scheme object with information about founders. You need to specify a matrix indicating combinations of mating
, in which each column shows a pair of parental samples. For example, if you have only two parents, the mating
matrix should be mating = matrix(1:2, nrow = 2, ncol = 1)
or equivalents. The indices used in the matrix should match with the IDs labeled to parental samples by setParents()
. To confirm the IDs for parental samples, run the following code.
getParents(gds)
## sampleID memberID indexes
## 1 Founder1 1 18
## 2 Founder2 2 54
The created GbsrScheme
object is set in the scheme
slot of the GbsrGenotypeData
object.
The function addScheme()
adds the information about the next breeding step of your population. In the case of our example data, the second step was selfing to produce F2 individuals from the F1 obtained via the first founder crossing.
If your population was derived from a 4-way or 8-way cross, you need to add more paring
steps. In the case of 8-way RILs developed by three pairing crosses followed by five selfing cycles, the scheme object should be built as following. First we need to initialize the scheme object with specifying the first mating scheme. The crosstype argument should be “pairing” and the mating argument should be given as a matrix in which each pairing combination of parents are shown in each column. The following case indicates the pairing of parent 1 and 2 as well as 3 and 4, 5 and 6, and 7 and 8.
# As always the first step of breeding scheme would be "pairing" cross(es) of
# founders, never be "selfing" and a "sibling" cross,
# the argument `crosstype` in initScheme() was deprecated on the update on April 6, 2023.
# gds <- initScheme(gds, crosstype = "pair", mating = cbind(c(1:2), c(3:4), c(5:6), c(7:8)))
gds <- initScheme(gds, mating = cbind(c(1:2), c(3:4), c(5:6), c(7:8)))
Now the progenies of the crosses above have member ID 9, 10, 11, and 12 for each combination of mating. You can check IDs with showScheme().
showScheme(gds)
Then, add the step to make 4-way crosses.
gds <- addScheme(gds, crosstype = "pair", mating = cbind(c(9:10), c(11:12)))
# Check IDs.
showScheme(gds)
Add the last generation of the paring step .
gds <- addScheme(gds, crosstype = "pair", mating = cbind(c(13:14)))
#' # Check IDs.
showScheme(gds)
Now we have the scheme information of a 8-way cross. The follwoing steps add the selfing cycles.
# Inbreeding by five times selfing.
gds <- addScheme(gds, crosstype = "self")
gds <- addScheme(gds, crosstype = "self")
gds <- addScheme(gds, crosstype = "self")
gds <- addScheme(gds, crosstype = "self")
gds <- addScheme(gds, crosstype = "self")
You can set crosstype = "sibling"
or crosstype = "random"
, if your population was developed through sibling crosses or random crosses, respectively.`
The update on April 4, 2023 introduced new function to GBScleanR. The genotype estimation algorithm in estGeno() supports populations that consist of samples belonging to different pedigree. For example, if you have a population of F1 samples that derived from three different crosses of four founders: Founder1 x Founder2, Founder1 x Founder3, Founder1 x Founder4. You can build a scheme info as following.
gds <- setParents(object = gds,
parents = c("Founder1", "Founder2", "Founder3", "Founder4"))
gds <- initScheme(object = gds,
mating = cbind(c(1, 2), c(1, 3), c(1,4)))
# The initScheme() function here automatically set 5, 6, and 7 as member ID to
# the progenies of the above maiting (pairing) combinations, respectively.
# Then you have to assign member IDs to your samples to indicate which sample
# belongs to which pedigree.
gds <- assignScheme(object = gds,
id = c(rep(5, 10), rep(6, 15), rep(7, 20)))
The assignScheme() assign member IDs id
to the samples in order.
Please confirm the order of the member IDs in id
and the order of the
sample IDs shown by getSamID(gds).
# Get sample ID
sample_id <- getSamID(object = gds)
# Initialize the id vector
id <- integer(nsam(gds))
# Assume your samples were named with prefixes that indicate which
# sample was derived from which combination of founders.
id[grepl("P1xP2", sample_id)] <- 5
id[grepl("P1xP3", sample_id)] <- 6
id[grepl("P1xP4", sample_id)] <- 7
gds <- assignScheme(object = gds, id = id)
The following codes suppose that you built the scheme object for the example data that is a biparental F2 population derived from a cross between inbred parents, not for the 8-way RILs explained above.
Now we can execute genotype estimation for error correction. GBScleanR estimates error pattern via iterative optimization of parameters for genotype estimation. We could not guess the best number of iterations, but our simulation tests showed iter = 4
usually saturates the improvement of estimation accuracy.
gds <- estGeno(gds, iter = 4)
##
Founder genotype probability calculation ...
Founder genotype probability calculation at marker#: 10
Founder genotype probability calculation at marker#: 20
Founder genotype probability calculation at marker#: 30
Founder genotype probability calculation at marker#: 40
Founder genotype probability calculation at marker#: 50
Founder genotype probability calculation at marker#: 60
Founder genotype probability calculation at marker#: 70
Founder genotype probability calculation at marker#: 80
Founder genotype probability calculation at marker#: 90
Founder genotype probability calculation at marker#: 100
Founder genotype probability calculation at marker#: 110
Founder genotype probability calculation at marker#: 120
Founder genotype probability calculation at marker#: 130
Founder genotype probability calculation at marker#: 140
Founder genotype probability calculation at marker#: 150
Founder genotype probability calculation at marker#: 160
Founder genotype probability calculation at marker#: 170
Founder genotype probability calculation at marker#: 180
Founder genotype probability calculation at marker#: 190
Backtracking best genotype sequences at marker#: 190
Backtracking best genotype sequences at marker#: 180
Backtracking best genotype sequences at marker#: 170
Backtracking best genotype sequences at marker#: 160
Backtracking best genotype sequences at marker#: 150
Backtracking best genotype sequences at marker#: 140
Backtracking best genotype sequences at marker#: 130
Backtracking best genotype sequences at marker#: 120
Backtracking best genotype sequences at marker#: 110
Backtracking best genotype sequences at marker#: 100
Backtracking best genotype sequences at marker#: 90
Backtracking best genotype sequences at marker#: 80
Backtracking best genotype sequences at marker#: 70
Backtracking best genotype sequences at marker#: 60
Backtracking best genotype sequences at marker#: 50
Backtracking best genotype sequences at marker#: 40
Backtracking best genotype sequences at marker#: 30
Backtracking best genotype sequences at marker#: 20
Backtracking best genotype sequences at marker#: 10
Backtracking best genotype sequences: Done!
Offspring genotype probability calculation ...
Founder genotype probability calculation ...
Founder genotype probability calculation at marker#: 10
Founder genotype probability calculation at marker#: 20
Founder genotype probability calculation at marker#: 30
Founder genotype probability calculation at marker#: 40
Founder genotype probability calculation at marker#: 50
Founder genotype probability calculation at marker#: 60
Founder genotype probability calculation at marker#: 70
Founder genotype probability calculation at marker#: 80
Founder genotype probability calculation at marker#: 90
Founder genotype probability calculation at marker#: 100
Founder genotype probability calculation at marker#: 110
Founder genotype probability calculation at marker#: 120
Founder genotype probability calculation at marker#: 130
Founder genotype probability calculation at marker#: 140
Founder genotype probability calculation at marker#: 150
Founder genotype probability calculation at marker#: 160
Founder genotype probability calculation at marker#: 170
Founder genotype probability calculation at marker#: 180
Founder genotype probability calculation at marker#: 190
Backtracking best genotype sequences at marker#: 190
Backtracking best genotype sequences at marker#: 180
Backtracking best genotype sequences at marker#: 170
Backtracking best genotype sequences at marker#: 160
Backtracking best genotype sequences at marker#: 150
Backtracking best genotype sequences at marker#: 140
Backtracking best genotype sequences at marker#: 130
Backtracking best genotype sequences at marker#: 120
Backtracking best genotype sequences at marker#: 110
Backtracking best genotype sequences at marker#: 100
Backtracking best genotype sequences at marker#: 90
Backtracking best genotype sequences at marker#: 80
Backtracking best genotype sequences at marker#: 70
Backtracking best genotype sequences at marker#: 60
Backtracking best genotype sequences at marker#: 50
Backtracking best genotype sequences at marker#: 40
Backtracking best genotype sequences at marker#: 30
Backtracking best genotype sequences at marker#: 20
Backtracking best genotype sequences at marker#: 10
Backtracking best genotype sequences: Done!
Offspring genotype probability calculation ...
Founder genotype probability calculation ...
Founder genotype probability calculation at marker#: 10
Founder genotype probability calculation at marker#: 20
Founder genotype probability calculation at marker#: 30
Founder genotype probability calculation at marker#: 40
Founder genotype probability calculation at marker#: 50
Founder genotype probability calculation at marker#: 60
Founder genotype probability calculation at marker#: 70
Founder genotype probability calculation at marker#: 80
Founder genotype probability calculation at marker#: 90
Founder genotype probability calculation at marker#: 100
Founder genotype probability calculation at marker#: 110
Founder genotype probability calculation at marker#: 120
Founder genotype probability calculation at marker#: 130
Founder genotype probability calculation at marker#: 140
Founder genotype probability calculation at marker#: 150
Founder genotype probability calculation at marker#: 160
Founder genotype probability calculation at marker#: 170
Founder genotype probability calculation at marker#: 180
Founder genotype probability calculation at marker#: 190
Backtracking best genotype sequences at marker#: 190
Backtracking best genotype sequences at marker#: 180
Backtracking best genotype sequences at marker#: 170
Backtracking best genotype sequences at marker#: 160
Backtracking best genotype sequences at marker#: 150
Backtracking best genotype sequences at marker#: 140
Backtracking best genotype sequences at marker#: 130
Backtracking best genotype sequences at marker#: 120
Backtracking best genotype sequences at marker#: 110
Backtracking best genotype sequences at marker#: 100
Backtracking best genotype sequences at marker#: 90
Backtracking best genotype sequences at marker#: 80
Backtracking best genotype sequences at marker#: 70
Backtracking best genotype sequences at marker#: 60
Backtracking best genotype sequences at marker#: 50
Backtracking best genotype sequences at marker#: 40
Backtracking best genotype sequences at marker#: 30
Backtracking best genotype sequences at marker#: 20
Backtracking best genotype sequences at marker#: 10
Backtracking best genotype sequences: Done!
Offspring genotype probability calculation ...
Founder genotype probability calculation ...
Founder genotype probability calculation at marker#: 10
Founder genotype probability calculation at marker#: 20
Founder genotype probability calculation at marker#: 30
Founder genotype probability calculation at marker#: 40
Founder genotype probability calculation at marker#: 50
Founder genotype probability calculation at marker#: 60
Founder genotype probability calculation at marker#: 70
Founder genotype probability calculation at marker#: 80
Founder genotype probability calculation at marker#: 90
Founder genotype probability calculation at marker#: 100
Founder genotype probability calculation at marker#: 110
Founder genotype probability calculation at marker#: 120
Founder genotype probability calculation at marker#: 130
Founder genotype probability calculation at marker#: 140
Founder genotype probability calculation at marker#: 150
Founder genotype probability calculation at marker#: 160
Founder genotype probability calculation at marker#: 170
Founder genotype probability calculation at marker#: 180
Founder genotype probability calculation at marker#: 190
Backtracking best genotype sequences at marker#: 190
Backtracking best genotype sequences at marker#: 180
Backtracking best genotype sequences at marker#: 170
Backtracking best genotype sequences at marker#: 160
Backtracking best genotype sequences at marker#: 150
Backtracking best genotype sequences at marker#: 140
Backtracking best genotype sequences at marker#: 130
Backtracking best genotype sequences at marker#: 120
Backtracking best genotype sequences at marker#: 110
Backtracking best genotype sequences at marker#: 100
Backtracking best genotype sequences at marker#: 90
Backtracking best genotype sequences at marker#: 80
Backtracking best genotype sequences at marker#: 70
Backtracking best genotype sequences at marker#: 60
Backtracking best genotype sequences at marker#: 50
Backtracking best genotype sequences at marker#: 40
Backtracking best genotype sequences at marker#: 30
Backtracking best genotype sequences at marker#: 20
Backtracking best genotype sequences at marker#: 10
Backtracking best genotype sequences: Done!
Offspring genotype probability calculation ...
Founder genotype probability calculation ...
Founder genotype probability calculation at marker#: 10
Founder genotype probability calculation at marker#: 20
Founder genotype probability calculation at marker#: 30
Founder genotype probability calculation at marker#: 40
Founder genotype probability calculation at marker#: 50
Founder genotype probability calculation at marker#: 60
Founder genotype probability calculation at marker#: 70
Founder genotype probability calculation at marker#: 80
Founder genotype probability calculation at marker#: 90
Founder genotype probability calculation at marker#: 100
Founder genotype probability calculation at marker#: 110
Founder genotype probability calculation at marker#: 120
Founder genotype probability calculation at marker#: 130
Founder genotype probability calculation at marker#: 140
Founder genotype probability calculation at marker#: 150
Founder genotype probability calculation at marker#: 160
Founder genotype probability calculation at marker#: 170
Founder genotype probability calculation at marker#: 180
Founder genotype probability calculation at marker#: 190
Backtracking best genotype sequences at marker#: 190
Backtracking best genotype sequences at marker#: 180
Backtracking best genotype sequences at marker#: 170
Backtracking best genotype sequences at marker#: 160
Backtracking best genotype sequences at marker#: 150
Backtracking best genotype sequences at marker#: 140
Backtracking best genotype sequences at marker#: 130
Backtracking best genotype sequences at marker#: 120
Backtracking best genotype sequences at marker#: 110
Backtracking best genotype sequences at marker#: 100
Backtracking best genotype sequences at marker#: 90
Backtracking best genotype sequences at marker#: 80
Backtracking best genotype sequences at marker#: 70
Backtracking best genotype sequences at marker#: 60
Backtracking best genotype sequences at marker#: 50
Backtracking best genotype sequences at marker#: 40
Backtracking best genotype sequences at marker#: 30
Backtracking best genotype sequences at marker#: 20
Backtracking best genotype sequences at marker#: 10
Backtracking best genotype sequences: Done!
Offspring genotype probability calculation ...
Founder genotype probability calculation ...
Founder genotype probability calculation at marker#: 10
Founder genotype probability calculation at marker#: 20
Founder genotype probability calculation at marker#: 30
Founder genotype probability calculation at marker#: 40
Founder genotype probability calculation at marker#: 50
Founder genotype probability calculation at marker#: 60
Founder genotype probability calculation at marker#: 70
Founder genotype probability calculation at marker#: 80
Founder genotype probability calculation at marker#: 90
Founder genotype probability calculation at marker#: 100
Founder genotype probability calculation at marker#: 110
Founder genotype probability calculation at marker#: 120
Founder genotype probability calculation at marker#: 130
Founder genotype probability calculation at marker#: 140
Founder genotype probability calculation at marker#: 150
Founder genotype probability calculation at marker#: 160
Founder genotype probability calculation at marker#: 170
Founder genotype probability calculation at marker#: 180
Founder genotype probability calculation at marker#: 190
Backtracking best genotype sequences at marker#: 190
Backtracking best genotype sequences at marker#: 180
Backtracking best genotype sequences at marker#: 170
Backtracking best genotype sequences at marker#: 160
Backtracking best genotype sequences at marker#: 150
Backtracking best genotype sequences at marker#: 140
Backtracking best genotype sequences at marker#: 130
Backtracking best genotype sequences at marker#: 120
Backtracking best genotype sequences at marker#: 110
Backtracking best genotype sequences at marker#: 100
Backtracking best genotype sequences at marker#: 90
Backtracking best genotype sequences at marker#: 80
Backtracking best genotype sequences at marker#: 70
Backtracking best genotype sequences at marker#: 60
Backtracking best genotype sequences at marker#: 50
Backtracking best genotype sequences at marker#: 40
Backtracking best genotype sequences at marker#: 30
Backtracking best genotype sequences at marker#: 20
Backtracking best genotype sequences at marker#: 10
Backtracking best genotype sequences: Done!
Offspring genotype probability calculation ...
Founder genotype probability calculation ...
Founder genotype probability calculation at marker#: 10
Founder genotype probability calculation at marker#: 20
Founder genotype probability calculation at marker#: 30
Founder genotype probability calculation at marker#: 40
Founder genotype probability calculation at marker#: 50
Founder genotype probability calculation at marker#: 60
Founder genotype probability calculation at marker#: 70
Founder genotype probability calculation at marker#: 80
Founder genotype probability calculation at marker#: 90
Founder genotype probability calculation at marker#: 100
Founder genotype probability calculation at marker#: 110
Founder genotype probability calculation at marker#: 120
Founder genotype probability calculation at marker#: 130
Founder genotype probability calculation at marker#: 140
Founder genotype probability calculation at marker#: 150
Founder genotype probability calculation at marker#: 160
Founder genotype probability calculation at marker#: 170
Founder genotype probability calculation at marker#: 180
Founder genotype probability calculation at marker#: 190
Backtracking best genotype sequences at marker#: 190
Backtracking best genotype sequences at marker#: 180
Backtracking best genotype sequences at marker#: 170
Backtracking best genotype sequences at marker#: 160
Backtracking best genotype sequences at marker#: 150
Backtracking best genotype sequences at marker#: 140
Backtracking best genotype sequences at marker#: 130
Backtracking best genotype sequences at marker#: 120
Backtracking best genotype sequences at marker#: 110
Backtracking best genotype sequences at marker#: 100
Backtracking best genotype sequences at marker#: 90
Backtracking best genotype sequences at marker#: 80
Backtracking best genotype sequences at marker#: 70
Backtracking best genotype sequences at marker#: 60
Backtracking best genotype sequences at marker#: 50
Backtracking best genotype sequences at marker#: 40
Backtracking best genotype sequences at marker#: 30
Backtracking best genotype sequences at marker#: 20
Backtracking best genotype sequences at marker#: 10
Backtracking best genotype sequences: Done!
Offspring genotype probability calculation ...
Founder genotype probability calculation ...
Founder genotype probability calculation at marker#: 10
Founder genotype probability calculation at marker#: 20
Founder genotype probability calculation at marker#: 30
Founder genotype probability calculation at marker#: 40
Founder genotype probability calculation at marker#: 50
Founder genotype probability calculation at marker#: 60
Founder genotype probability calculation at marker#: 70
Founder genotype probability calculation at marker#: 80
Founder genotype probability calculation at marker#: 90
Founder genotype probability calculation at marker#: 100
Founder genotype probability calculation at marker#: 110
Founder genotype probability calculation at marker#: 120
Founder genotype probability calculation at marker#: 130
Founder genotype probability calculation at marker#: 140
Founder genotype probability calculation at marker#: 150
Founder genotype probability calculation at marker#: 160
Founder genotype probability calculation at marker#: 170
Founder genotype probability calculation at marker#: 180
Founder genotype probability calculation at marker#: 190
Backtracking best genotype sequences at marker#: 190
Backtracking best genotype sequences at marker#: 180
Backtracking best genotype sequences at marker#: 170
Backtracking best genotype sequences at marker#: 160
Backtracking best genotype sequences at marker#: 150
Backtracking best genotype sequences at marker#: 140
Backtracking best genotype sequences at marker#: 130
Backtracking best genotype sequences at marker#: 120
Backtracking best genotype sequences at marker#: 110
Backtracking best genotype sequences at marker#: 100
Backtracking best genotype sequences at marker#: 90
Backtracking best genotype sequences at marker#: 80
Backtracking best genotype sequences at marker#: 70
Backtracking best genotype sequences at marker#: 60
Backtracking best genotype sequences at marker#: 50
Backtracking best genotype sequences at marker#: 40
Backtracking best genotype sequences at marker#: 30
Backtracking best genotype sequences at marker#: 20
Backtracking best genotype sequences at marker#: 10
Backtracking best genotype sequences: Done!
Offspring genotype probability calculation ...
If your population derived from outbred founders, please set het_parents = TRUE
.
gds <- estGeno(gds, het_parent = TRUE, iter = 4)
The larger number of iterations makes running time longer. If you would like to execute no optimization, set optim = FALSE
or iter = 1
.
# Following codes do the same.
gds <- estGeno(gds, iter = 1)
gds <- estGeno(gds, optim = FALSE)
All of the results of estimation are stored in the gds file linked to the GbsrGenotypeData
object. You can obtain the estimated genotype data via the getGenotype()
function with node = "cor"
.
est_geno <- getGenotype(gds, node = "cor")
GBScleanR also estimates phased founder genotypes and you can access it.
founder_geno <- getGenotype(gds, node = "parents")
GBScleanR first internally estimate phased haplotype and then convert them to genotype calls. If you need the estimated haplotype data, run getHaplotype()
.
est_hap <- getHaplotype(gds)
The function gbsrGDS2VCF()
generate a VCF file containing the estimated genotype data and phased haplotype information. The estimated haplotypes are indicated in the FORMAT field with the HAP tag. The founder genotypes correspond to each haplotype are indicated in the INFO field with the PGT tag. HAP shows the pair of haplotype for each marker of each sample, while PGT shows the allele of each haplotype.
out_fn <- tempfile("sample_est", fileext = ".vcf.gz")
gbsrGDS2VCF(gds, out_fn)
Please use reopenGDS()
to open the connection again if you need.
gds <- reopenGDS(gds)
To safely close the connection to the GDS file, use closeGDS()
.
closeGDS(gds)
sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GBScleanR_1.2.14 SeqArray_1.38.0 gdsfmt_1.34.1 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 lattice_0.21-8 tidyr_1.3.0
## [4] Biostrings_2.66.0 digest_0.6.31 utf8_1.2.3
## [7] R6_2.5.1 GenomeInfoDb_1.34.9 stats4_4.2.3
## [10] evaluate_0.20 ggplot2_3.4.2 highr_0.10
## [13] pillar_1.9.0 zlibbioc_1.44.0 rlang_1.1.0
## [16] jquerylib_0.1.4 magick_2.7.4 S4Vectors_0.36.2
## [19] Matrix_1.5-4 rmarkdown_2.21 labeling_0.4.2
## [22] RCurl_1.98-1.12 munsell_0.5.0 compiler_4.2.3
## [25] xfun_0.38 pkgconfig_2.0.3 BiocGenerics_0.44.0
## [28] htmltools_0.5.5 tidyselect_1.2.0 tibble_3.2.1
## [31] GenomeInfoDbData_1.2.9 expm_0.999-7 bookdown_0.33
## [34] IRanges_2.32.0 fansi_1.0.4 crayon_1.5.2
## [37] dplyr_1.1.1 withr_2.5.0 bitops_1.0-7
## [40] grid_4.2.3 jsonlite_1.8.4 gtable_0.3.3
## [43] lifecycle_1.0.3 magrittr_2.0.3 scales_1.2.1
## [46] RcppParallel_5.1.7 cli_3.6.1 cachem_1.0.7
## [49] farver_2.1.1 XVector_0.38.0 bslib_0.4.2
## [52] vctrs_0.6.1 generics_0.1.3 tools_4.2.3
## [55] glue_1.6.2 purrr_1.0.1 parallel_4.2.3
## [58] fastmap_1.1.1 yaml_2.3.7 colorspace_2.1-0
## [61] BiocManager_1.30.20 GenomicRanges_1.50.2 knitr_1.42
## [64] sass_0.4.5
Elshire, Robert J., Jeffrey C. Glaubitz, Qi Sun, Jesse A. Poland, Ken Kawamoto, Edward S. Buckler, and Sharon E. Mitchell. 2011. “A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species.” PLoS ONE 6 (5): e19379. https://doi.org/10.1371/journal.pone.0019379.
Glaubitz, Jeffrey C., Terry M. Casstevens, Fei Lu, James Harriman, Robert J. Elshire, Qi Sun, and Edward S. Buckler. 2014. “TASSEL-GBS: A High Capacity Genotyping by Sequencing Analysis Pipeline.” PLoS ONE 9 (2): e90346. https://doi.org/10.1371/journal.pone.0090346.
Gogarten, Stephanie M., Tushar Bhangale, Matthew P. Conomos, Cecelia A. Laurie, Caitlin P. McHugh, Ian Painter, Xiuwen Zheng, et al. 2012. “GWASTools: an R/Bioconductor package for quality control and analysis of genome-wide association studies.” Bioinformatics 28 (24): 3329–31. https://doi.org/10.1093/BIOINFORMATICS/BTS610.
Gogarten, Stephanie M, Tamar Sofer, Han Chen, Chaoyu Yu, Jennifer A Brody, Timothy A Thornton, Kenneth M Rice, and Matthew P Conomos. 2019. “Genetic association testing using the GENESIS R/Bioconductor package.” Bioinformatics 35 (24): 5346–8. https://doi.org/10.1093/BIOINFORMATICS/BTZ567.
Li, Heng, and Jeffrey Barrett. 2011. “A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data” 27 (21): 2987–93. https://doi.org/10.1093/bioinformatics/btr509.
McKenna, Aaron, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, et al. 2010. “The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data.” Genome Research 20 (9): 1297–1303. https://doi.org/10.1101/gr.107524.110.
Miller, Michael R., Joseph P. Dunham, Angel Amores, William A. Cresko, and Eric A. Johnson. 2007. “Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers.” Genome Research 17 (2): 240–48. https://doi.org/10.1101/gr.5681207.
Zheng, Xiuwen, David Levine, Jess Shen, Stephanie M. Gogarten, Cathy Laurie, and Bruce S. Weir. 2012. “A high-performance computing toolset for relatedness and principal component analysis of SNP data.” Bioinformatics 28 (24): 3326–8. https://doi.org/10.1093/BIOINFORMATICS/BTS606.