Go to home page

Updated standard reference Genome-in-a-Bottle (GIAB) samples HG001-HG007


Last update: November 02, 2023


Description:

The reference data set was introduced in the paper “Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform,” including variant calling by a UG-adapted version of GATK.”

An additional variant calling analysis of the same dataset using a UG-adapted version of DeepVariant was described in the whitepaper “Adapting Google DeepVariant to Ultima Genomics Reads for Improved Variant Calling.”


Methods:

The seven standard GIAB reference samples HG001-HG007 were sequenced on a UG 100™ sequencer with a read-length of ~290bp. Each sample was assigned to one barcode, except for HG001, which was sequenced across four replicates, each with a distinct barcode.The data were base-called with base-calling pipeline version APL5.0, quality filtered by read quality (rq≤1) and randomly down-sampled to ~35.6X (post de-duplication) from an original coverage of ~70X.


For variant-calling evaluation we excluded regions of homopolymers with length ≥11. Since low-complexity genomic regions tend to amplify inefficiently, we isolated the sequencing accuracy by further excluding selected low-complexity, tandem-repeats, and low mappability regions while still maintaining 98.2-98.5% of the original GIAB HCRs. The exclusion BED files are included below.


Included data files:

The GIAB aligned data (CRAM) is available for download from AWS as shown below.

RUN-SAMPLE-BC GIAB
030945-NA12878-Z0113-CAGTTCATCTGTGAT HG001
030945-NA12878-Z0008-CACATCCTGCATGTGAT HG001
030945-NA12878-Z0025-CTCGAGATTGATGAT HG001
030945-NA12878-Z0027-CACTGTCAGCCAGAT HG001
030945-NA24385-Z0114-CAACATACATCAGAT HG002
030945-NA24149-Z0115-CGGCTAGATGCAGAT HG003
030945-NA24143-Z0016-CATCCTGTGCGCATGAT HG004
030945-NA24631-Z0024-CTGAGCCTGTCAGAT HG005
030945-NA24694-Z0116-CAGTTATGTGCTGAT HG006
030945-NA24695-Z0005-CATGTATCCTCTGAT HG007
  • GIAB DeepVariant variant calls (VCF) for the 10 files
  • Full UG-LCR exclusion file (BED) - [ug_lcr.bed]
  • As well as its definition in ug_hcr.md
  • UG-HCR file (BED) - [ug_hcr.bed] – complementary to the ug_lcr.bed
  • Readme file – Steps to reproduce the variant calling: howto-germline-calling-dv.md
  • The basic QC sequencing statistics of this dataset are in WGS.QC_stats.tsv
  • The results of the variant calling evaluations on ug_hcr are summarized in VC_stats.tsv

CRAM and VCF download instructions:

The 10 files GIAB (HG001-HG007) dataset is available on S3 bucket on AWS (total size: 862.74 GB).

You can download the full dataset using the AWS CLI:


$ mkdir ultima-GIAB-Oct-2023
$ cd ultima-GIAB-Oct-2023
$ aws s3 cp s3://ultima-ashg-2023-reference-set/ . --recursive --no-sign-request

The CRAM reference can be downloaded here.


VCF only download instructions:

You can download just the VCF files for this dataset using the AWS CLI:


$ mkdir ultima-GIAB-Oct-2023-vcf-only
$ cd ultima-GIAB-Oct-2023-vcf-only
$ aws s3 cp s3://ultima-ashg-2023-reference-set/DeepVariant_vcfs/ . --recursive --no-sign-request

Previous versions:

Previous datasets uploaded June 07, 2022, and March 21, 2023

Request for information

Please fill out the below fields to access this document.