Updated standard reference Genome-in-a-Bottle (GIAB) samples HG001-HG007
Last update: February 04, 2024
Description:
Germline variant calling plays a crucial role in unlocking the mysteries encoded within our DNA and helps us understand the unique genetic variations passed down from generation to generation.
Analysis of the files in this folder will reproduce the figures presented in our germline WGS application note showing utility of the UG 100 for germline variant calling.
Methods:
The seven standard GIAB reference samples HG001-HG007 were sequenced on a UG 100™ sequencer with a read-length of ~290bp. HG001-HG004 were each assigned to one barcode. HG005, HG006, and HG007 were each sequenced across two replicates, each with a distinct barcode. The data were base-called with base-calling pipeline version APL5.1, quality filtered by read quality (rq≤1) and randomly down-sampled to ~40X (post de-duplication) from an original coverage of ~47X.
Variant calling analysis is done using a UG-adapted version of DeepVariant as described in the whitepaper “Adapting Google DeepVariant to Ultima Genomics Reads for Improved Variant Calling.”
Included data files:
The GIAB aligned data (CRAM) is available for download from AWS as shown below:
SAMPLE-BC |
---|
NA12878-Z0025 |
NA24143-Z0113 |
NA24149-Z0008 |
NA24385-Z0027 |
NA24631-Z0114 |
NA24631-Z0115 |
NA24694-Z0016 |
NA24694-Z0024 |
NA24695-Z0005 |
NA24695-Z0116 |
- Cram files for the 10 GIAB files in s3://ultimagen-feb-2024-giab/Crams/
- GIAB DeepVariant variant calls (VCF) for the 10 samples: s3://ultimagen-feb-2024-giab/DeepVariant_vcfs/
- Cn.mops CNV calls in bed files for the 10 samples: s3://ultimagen-feb-2024-giab/cn.mops_beds/
- In s3://ultimagen-feb-2024-giab/UG-High-Confidence-Regions/
- Full UG-LCR exclusion file (BED) - [ug_lcr.bed]
- As well as its definition in ug_hcr.md
- UG-HCR file (BED) - [ug_hcr.bed] – complementary to the ug_lcr.bed
- The basic QC sequencing statistics of this dataset are in s3://ultimagen-feb-2024-giab/GIAB_WGS.QC_stats.tsv
- The results of the variant calling evaluations on ug_hcr are summarized in s3://ultimagen-feb-2024-giab/VC_stats.tsv
CRAM and VCF download instructions:
The 10 files GIAB (HG001-HG007) dataset is available on a S3 bucket on AWS (total size: 420.6 GB).
You can download the full dataset using the AWS CLI:
$ mkdir ultima-GIAB-Feb-2024
$ cd ultima-GIAB-Oct-2023
$ aws s3 cp s3://ultimagen-feb-2024-giab/ . --recursive --no-sign-request
The CRAM reference can be downloaded here.
VCF only download instructions:
You can download just the VCF files for this dataset using the AWS CLI:
$ mkdir ultima-GIAB-Feb-2024-vcf-only
$ cd ultima-GIAB-Feb-2024-vcf-only
$ aws s3 cp s3://ultimagen-feb-2024-giab/DeepVariant_vcfs/ . --recursive --no-sign-request
How to reproduce these results:
The GitHub repository https://github.com/Ultimagen/healthomics-workflows contains our WDL-format pipelines for calling short and copy-number variants in the WDL format and instructions for running them in AWS HealthOmics service.
Short variant calling pipeline is available in AWS HealthOmics as a Ready2Run workflow.
Previous versions:
Previous datasets uploaded June 07, 2022, March 21, 2023 and November 2023