Updated standard reference Genome-in-a-Bottle (GIAB) samples HG001-HG007
Last update: November 02, 2023
Description:
The reference data set was introduced in the paper “Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform,” including variant calling by a UG-adapted version of GATK.”
An additional variant calling analysis of the same dataset using a UG-adapted version of DeepVariant was described in the whitepaper “Adapting Google DeepVariant to Ultima Genomics Reads for Improved Variant Calling.”
Methods:
The seven standard GIAB reference samples HG001-HG007 were sequenced on a UG 100™ sequencer with a read-length of ~290bp. Each sample was assigned to one barcode, except for HG001, which was sequenced across four replicates, each with a distinct barcode.The data were base-called with base-calling pipeline version APL5.0, quality filtered by read quality (rq≤1) and randomly down-sampled to ~35.6X (post de-duplication) from an original coverage of ~70X.
For variant-calling evaluation we excluded regions of homopolymers with length ≥11. Since low-complexity genomic regions tend to amplify inefficiently, we isolated the sequencing accuracy by further excluding selected low-complexity, tandem-repeats, and low mappability regions while still maintaining 98.2-98.5% of the original GIAB HCRs. The exclusion BED files are included below.
Included data files:
The GIAB aligned data (CRAM) is available for download from AWS as shown below.
RUN-SAMPLE-BC | GIAB |
---|---|
030945-NA12878-Z0113-CAGTTCATCTGTGAT | HG001 |
030945-NA12878-Z0008-CACATCCTGCATGTGAT | HG001 |
030945-NA12878-Z0025-CTCGAGATTGATGAT | HG001 |
030945-NA12878-Z0027-CACTGTCAGCCAGAT | HG001 |
030945-NA24385-Z0114-CAACATACATCAGAT | HG002 |
030945-NA24149-Z0115-CGGCTAGATGCAGAT | HG003 |
030945-NA24143-Z0016-CATCCTGTGCGCATGAT | HG004 |
030945-NA24631-Z0024-CTGAGCCTGTCAGAT | HG005 |
030945-NA24694-Z0116-CAGTTATGTGCTGAT | HG006 |
030945-NA24695-Z0005-CATGTATCCTCTGAT | HG007 |
- GIAB DeepVariant variant calls (VCF) for the 10 files
- Full UG-LCR exclusion file (BED) - [ug_lcr.bed]
- As well as its definition in ug_hcr.md
- UG-HCR file (BED) - [ug_hcr.bed] – complementary to the ug_lcr.bed
- Readme file – Steps to reproduce the variant calling: howto-germline-calling-dv.md
- The basic QC sequencing statistics of this dataset are in WGS.QC_stats.tsv
- The results of the variant calling evaluations on ug_hcr are summarized in VC_stats.tsv
CRAM and VCF download instructions:
The 10 files GIAB (HG001-HG007) dataset is available on S3 bucket on AWS (total size: 862.74 GB).
You can download the full dataset using the AWS CLI:
$ mkdir ultima-GIAB-Oct-2023
$ cd ultima-GIAB-Oct-2023
$ aws s3 cp s3://ultima-ashg-2023-reference-set/ . --recursive --no-sign-request
The CRAM reference can be downloaded here.
VCF only download instructions:
You can download just the VCF files for this dataset using the AWS CLI:
$ mkdir ultima-GIAB-Oct-2023-vcf-only
$ cd ultima-GIAB-Oct-2023-vcf-only
$ aws s3 cp s3://ultima-ashg-2023-reference-set/DeepVariant_vcfs/ . --recursive --no-sign-request
Previous versions:
Previous datasets uploaded June 07, 2022, and March 21, 2023