Long reads VCF Preprocessing User Guide

This vignette shows how to process long-read PacBio HiFi variant calls from a validated trio (HG002–HG003–HG004) and prepare them for UPDhmm analysis.

Data source

Ashkenazi trio (GIAB, NIST) – PacBio HiFi Revio, DeepVariant calls (GRCh38).

1. Download phased VCFs for each individual


# Proband (HG002)
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/PacBio_HiFi-Revio_20231031/pacbio-wgs-wdl_germline_20231031/HG002.GRCh38.deepvariant.phased.vcf.gz
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/PacBio_HiFi-Revio_20231031/pacbio-wgs-wdl_germline_20231031/HG002.GRCh38.deepvariant.phased.vcf.gz.tbi

# Father (HG003)
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/PacBio_HiFi-Revio_20231031/pacbio-wgs-wdl_germline_20231031/HG003.GRCh38.deepvariant.phased.vcf.gz
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/PacBio_HiFi-Revio_20231031/pacbio-wgs-wdl_germline_20231031/HG003.GRCh38.deepvariant.phased.vcf.gz.tbi

# Mother (HG004)
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/PacBio_HiFi-Revio_20231031/pacbio-wgs-wdl_germline_20231031/HG004.GRCh38.deepvariant.phased.vcf.gz
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/PacBio_HiFi-Revio_20231031/pacbio-wgs-wdl_germline_20231031/HG004.GRCh38.deepvariant.phased.vcf.gz.tbi

2. Merge individual VCFs into a trio VCF

bcftools merge \
  -O z \
  -o trio_HiFi_GRCh38_phased.vcf.gz \
  HG002.GRCh38.deepvariant.phased.vcf.gz \
  HG003.GRCh38.deepvariant.phased.vcf.gz \
  HG004.GRCh38.deepvariant.phased.vcf.gz

bcftools index trio_HiFi_GRCh38_phased.vcf.gz

3. Filter variants for UPDhmm input

The following filtering steps are applied:

  • keep only biallelic variants

  • remove sites where all trio members are reference (0/0 or 0|0)

  • remove sites where all trio members are missing (./. or .|.)

bcftools view \
  -m2 -M2 \
  -e 'COUNT(GT="0/0" || GT="0|0")==3 || COUNT(GT="./." || GT=".|.")==3' \
  -O z \
  -o trio_HiFi_GRCh38_phased_biallelic_nonref_nomissing.vcf.gz \
  trio_HiFi_GRCh38_phased.vcf.gz

bcftools index trio_HiFi_GRCh38_phased_biallelic_nonref_nomissing.vcf.gz

4. UPDhmm analysis in R

library(UPDhmm)
library(VariantAnnotation)

vcf <- readVcf(
  "trio_HiFi_GRCh38_phased_biallelic_nonref_nomissing.vcf.gz"
)

vcf_check <- vcfCheck(
  <!-- vcf, -->
  proband = "HG002",
  father  = "HG003",
  mother  = "HG004"
)

events <- calculateEvents(
  vcf_check,
  add_ratios = TRUE
)

Session Info

sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.40.0
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.39       R6_2.6.1            fastmap_1.2.0      
##  [4] xfun_0.57           maketools_1.3.2     cachem_1.1.0       
##  [7] knitr_1.51          htmltools_0.5.9     rmarkdown_2.31     
## [10] buildtools_1.0.0    lifecycle_1.0.5     cli_3.6.6          
## [13] sass_0.4.10         jquerylib_0.1.4     compiler_4.6.0     
## [16] sys_3.4.3           tools_4.6.0         evaluate_1.0.5     
## [19] bslib_0.10.0        yaml_2.3.12         BiocManager_1.30.27
## [22] jsonlite_2.0.0      rlang_1.2.0