Go to home page

Updated standard reference Genome-in-a-Bottle (GIAB) samples HG001-HG007


Last update: February 04, 2024


Description:

Germline variant calling plays a crucial role in unlocking the mysteries encoded within our DNA and helps us understand the unique genetic variations passed down from generation to generation.

Analysis of the files in this folder will reproduce the figures presented in our germline WGS application note showing utility of the UG 100 for germline variant calling.


Methods:

The seven standard GIAB reference samples HG001-HG007 were sequenced on a UG 100™ sequencer with a read-length of ~290bp. HG001-HG004 were each assigned to one barcode. HG005, HG006, and HG007 were each sequenced across two replicates, each with a distinct barcode. The data were base-called with base-calling pipeline version APL5.1, quality filtered by read quality (rq≤1) and randomly down-sampled to ~40X (post de-duplication) from an original coverage of ~47X.

Variant calling analysis is done using a UG-adapted version of DeepVariant as described in the whitepaper “Adapting Google DeepVariant to Ultima Genomics Reads for Improved Variant Calling.”


Included data files:

The GIAB aligned data (CRAM) is available for download from AWS as shown below:

SAMPLE-BC
NA12878-Z0025
NA24143-Z0113
NA24149-Z0008
NA24385-Z0027
NA24631-Z0114
NA24631-Z0115
NA24694-Z0016
NA24694-Z0024
NA24695-Z0005
NA24695-Z0116
  • Cram files for the 10 GIAB files in s3://ultimagen-feb-2024-giab/Crams/
  • GIAB DeepVariant variant calls (VCF) for the 10 samples: s3://ultimagen-feb-2024-giab/DeepVariant_vcfs/
  • Cn.mops CNV calls in bed files for the 10 samples: s3://ultimagen-feb-2024-giab/cn.mops_beds/
  • In s3://ultimagen-feb-2024-giab/UG-High-Confidence-Regions/
    • Full UG-LCR exclusion file (BED) - [ug_lcr.bed]
    • As well as its definition in ug_hcr.md
    • UG-HCR file (BED) - [ug_hcr.bed] – complementary to the ug_lcr.bed
  • The basic QC sequencing statistics of this dataset are in s3://ultimagen-feb-2024-giab/GIAB_WGS.QC_stats.tsv
  • The results of the variant calling evaluations on ug_hcr are summarized in s3://ultimagen-feb-2024-giab/VC_stats.tsv

CRAM and VCF download instructions:

The 10 files GIAB (HG001-HG007) dataset is available on a S3 bucket on AWS (total size: 420.6 GB).

You can download the full dataset using the AWS CLI:

$ mkdir ultima-GIAB-Feb-2024 $ cd ultima-GIAB-Oct-2023 $ aws s3 cp s3://ultimagen-feb-2024-giab/ . --recursive --no-sign-request

The CRAM reference can be downloaded here.


VCF only download instructions:

You can download just the VCF files for this dataset using the AWS CLI:

$ mkdir ultima-GIAB-Feb-2024-vcf-only $ cd ultima-GIAB-Feb-2024-vcf-only $ aws s3 cp s3://ultimagen-feb-2024-giab/DeepVariant_vcfs/ . --recursive --no-sign-request

How to reproduce these results:

The GitHub repository https://github.com/Ultimagen/healthomics-workflows contains our WDL-format pipelines for calling short and copy-number variants in the WDL format and instructions for running them in AWS HealthOmics service.

Short variant calling pipeline is available in AWS HealthOmics as a Ready2Run workflow.


Previous versions:

Previous datasets uploaded June 07, 2022, March 21, 2023 and November 2023

Request for information

Please fill out the below fields to access this document.