Go to home page

10x Genomics: 3' scRNA Data


Last Update: February 05, 2024

Introduction:

This page shows analysis of a 3’ 10x single cell expression data for a PBMC library. The pipeline converts an Ultima CRAM file into 2 simulated paired-end fastq files. The process is described here for a 10x 3’ library, but can be easily adapted to other similar read structures.


ultima-10x-file-conversion
Ultima-comatible-libraries

Data and code location:

File Size Location
Input cram 156.86 GB s3://ultimagen-feb-2024-scrna/10x/035843-scRNA_3-Z0007-CGATTCATGCTCGAT_rq1.cram
Output fastq R1 15.78GB s3://ultimagen-feb-2024-scrna/10x/035843-scRNA_3-Z0007-CGATTCATGCTCGAT_rq1_S1_L001_R1_001.fastq.gz
Output fastq R2 50.37 GB s3://ultimagen-feb-2024-scrna/10x/035843-scRNA_3-Z0007-CGATTCATGCTCGAT_rq1_S1_L001_R2_001.fastq.gz
Statistics csv 4.47 KB s3://ultimagen-feb-2024-scrna/10x/035843-scRNA_3-Z0007-CGATTCATGCTCGAT_rq1_combined_statistics.csv

WDL:

Path to wdl: https://github.com/Ultimagen/UltimaGenomicsApplications/tree/main/single_cell

Input template for wdl: [single_cell/Input_templates/single_cell_general_template.10x-atac.json]


Prerequisite Files and Skills:

  1. Cram file generated on Ultima Genomics UG 100™ tool
  2. Familiarity with the Linux command line
  3. Familiarity with the SAM/BAM/CRAM format

Software and Packages Used:

  • Capability to execute Workflow Description Language (WDL)
  • The main step of the pipeline involves running Trimmer
    • https://github.com/Ultimagen/UltimaGenomicsApplications/tree/main/trimmer)

Objectives:

  • Produce simulated paired-end reads that can be processed using other software packages (e.g., STARsolo and Cellranger)
  • Generate statistics (CSV file) which contain metrics regarding trimmed sequences, aligned sequences, data quality, etc.

Running Analysis pipelines:

Overview of processing steps

The analysis pipeline processes the single-ended reads to produce simulated paired-end reads that can be processed using other software packages (e.g., STARsolo and Cellranger).The following steps are run:

  • The relevant json for the pipeline is taken as input (e.g., 10x 5’ GEX, Parse Biosciences WT, Fluent pipseq v3; for this example 10x 3’ GEX). This jsonincludes a Trimmer input json, which describes the relevant read structure. The json also includes information regarding which sequence segments are in the barcode read.In addition, optionally, downstream analyses can be configured to be run once the paired reads are created. These include STARsolo, STAR, sorting of aligned output, andFastQC.
  • Using Trimmer, adapters are trimmed, and cell barcode sequences are matched to a reference whitelist and trimmed.The untrimmed sequence will comprise the insert.The cell barcode read is stored as a tag within the trimmed cram, and can be created by combining matched cell barcodes, UMIs, and expected linkers.
  • The trimmed cram is used to create the two fastqfiles. To this end, the trimmed cram is first converted to fastq format using Demux, with the barcode read sequencesaved in the fastq header. Next the barcode read portion is written as the barcode read (since this is a synthetic read, the quality is set to “I” for all barcode read bases), and the insertis retained as the insert read.
  • Optionally, STARsolo, STAR, sorting of alignment output, and/orFastQC are run.
  • The Trimmer and other statistics (STARsolo/STAR, if run) are gathered into a single csv.

Running locally

The steps for generating the simulated paired end reads can be run using two docker files:

1. Running Trimmer

  • The latest trimmer docker: us-central1-docker.pkg.dev/ganymede-331016/ultimagen/trimmer:master_b679323
  • The following are the inputs:
    • {trimmer_input_json} : The 3’ 10x gex json input format is contained within the trimmer docker.The path in the Trimmer docker is /trimmer/dev-formats/single_cell_trimmer_formats_10x_3p_v3_gex.json
    • {trimmer_input_format} : The format to use within {trimmer_input_json} .For 3’ 10x GEX, the input format used is "10x V3 3' 9bp UMI" , which takes the first 9bp of the 12bp UMI sequence.
    • {read_structure_for_barcode_read} : For 10x 3’ v3, this should be “br:Z:%1%2TTT”.This will result in creating a br tag in the output cram, which is composed of the Cell barcode (defined as token 1 in the Trimmer json), the UMI (defined as token 2 in the Trimmer json), and the sequence TTT , which effectively results in a 12bp UMI that has the last 3bp masked as T.
    • {barcode_whitelist_path} : The path with the cell barcode list files.The 10x 3’ v3 whitelist files can be copied from: gs://concordanz/single_cell/10x-3M-february-2018.csv .
    • {input_bam} : A cram or bam as input.If a cram is provided, it can be passed with the Trimmer input argument: --input.If a bam, then the input can be piped with samtools.
    • {n_threads} : The number of threads to use
  • File names for Trimmer outputs:
    • {trimmer_stats_csv} : Statistics regarding the trimming of each component in the read.
    • {trimmer_failure_code_csv} : Detailed statistics with a breakdown of the reasons each component failed Trimming.
    • {output_ucram} : An unaligned, trimmed cram.The reads contain the reverse complement of the cDNA portion of the read and the matched cell barcode and UMI (9bp + TTT) are stored within the br tag.
  • The command for running Trimmer:
samtools view -h {input_bam} -@ 32 | \
        /trimmer/trimmer \
        --description={trimmer_input_json} \
        --format={trimmer_input_format} \
        --statistics={trimmer_stats_csv} \
        --directory={barcode_whitelist_path} \
        --skip-unused-pattern-lists=true \
        --discard \
        --output-field {read_structure_for_barcode_read} \
        --failure-code-file {trimmer_failure_code_csv} \
        --progress \
        --nthreads={n_threads} \
        --cram true \
        --output {trimmed_ucram}

2. Run demux to create a fastq file with the cell barcode and UMI in the fastq header.Next, save the reads in the fastq and R2 and the CBC+UMI as R1.

  • The latest sorter docker (which contains the demux software): us-central1-docker.pkg.dev/ganymede-331016/ultimagen/sorter:master_4ebb634
  • The following are the inputs:
    • {trimmed_ucram} : The trimmed ucram, created in the previous step
    • {output_path} : The format to use within {trimmer_input_json} .For 3’ 10x GEX, the input format used is "10x V3 3' 9bp UMI" , which takes the first 9bp of the 12bp UMI sequence.
    • {barcode_whitelist_path} : The path with the cell barcode list files.The 10x 3’ v3 whitelist files can be copied from: gs://concordanz/single_cell/10x-3M-february-2018.csv .
    • {input_bam} : A cram or bam as input.If a cram is provided, it can be passed with the Trimmer input argument: --input.If a bam, then the input can be piped with samtools.
    • {n_threads} : The number of threads to use
  • File names for demux outputs:
    • {output_path} : The output path for creating a fastq
    • {basename} : The prefix to use for the fastq output file
  • File names for the command which splits the fastq into two reads
    • {fastq_output_from_demux} : The output filename created by demux
    • {r1_fastq} : The r1 output filename (CBC + 9bp UMI + TTT, the quality is set to “I” for all bases, since the CBC has been matched to the whitelist)
    • {r2_fastq} : The r2 output filename (cDNA, revcom)
  • The command for running demux:
samtools view -h {input_bam} -@ 32 | \
        /trimmer/trimmer \
        --description={trimmer_input_json} \
        --format={trimmer_input_format} \
        --statistics={trimmer_stats_csv} \
        --directory={barcode_whitelist_path} \
        --skip-unused-pattern-lists=true \
        --discard \
        --output-field {read_structure_for_barcode_read} \
        --failure-code-file {trimmer_failure_code_csv} \
        --progress \
        --nthreads={n_threads} \
        --cram true \
        --output {trimmed_ucram}
  • To create the paired end reads, tee is used to pipe the demux-generated fastq into two commands: 1) for generating the r2 (which is the same as the input file, while the CBC+UMI is removed from the header) and 2) r1 (which contains the CBC+UMI):
zcat {fastq_output_from_demux} | \
tee >( awk '{if (NR % 4 == 1) {print substr($0,0,length($0)-length($15))""} else {print}}' FS=: |\
    pigz > {r2_fastq} ) | \
    awk 'NR % 4 == 1 {print substr($0,0,length($0)-length($15))"""\n"$15"\n+";\
     for (i = 0; i < length($15); i++) {printf "I"}; printf "\n"}' FS=: |\
     pigz > {r1_fastq}

Running the wdl:

The wdl input json fields:

Field Input Comments
input_file Input cram file Needs to be edited
base_file_name The base name to be used in output files Needs to be edited
demux_extra_args Arguments to demux Set to add underscores for missing fields in the fastq header, and to take the barcode read sequence from the br tag in the trimmed cram
fastqc_limits Input parameter to FastQC FastQC is run on the insert.
barcode_fastq_file_suffix The suffix to add to the file name of the barcode read. The default addition is needed in order to run Cellranger
insert_fastq_file_suffix The suffix to add to the file name of the insert read. The default addition is needed in order to run Cellranger

Trimmer parameters: The parameters to pass to Trimmer. If using a different read structure, the trimmer parameters (including the formats_description) must be changed.

Field Input Comments
local_formats_description The path, within the Trimmer software to the Trimmer json
formats_description If there is not a local format, then a format can be passed with this parameter
untrimmed_reads_action Must be left as "discard"
format Which format to use (by label) with the Trimmer json
extra_args Additional arguments to pass to Trimmer "--output-field br:Z:%1%2TTT" describes the structure of the barcode read. In this case, the barcode read is constructed by taking token 1 (%1; the matched cell barcode) + token 2 (%2; the first 9bp of the UMI sequence) + "TTT". This creates a sequence with the cell barcode and a UMI sequence with the last 3bp masked as T.
pattern_files The cell barcode whitelist which is used by Trimmer
downstream_analysis Set as star_solo If you do not want to run star_solo, set this as empty.
star_solo_params Has params to pass to STARsolo, including a reference genome (zipped). For explanations regarding the additional params, see the STARsolo documentation.

Request for information

Please fill out the below fields to access this document.