The long-term goal of the Optimus workflow is to support any 3 prime single cell transcriptomics assay selected by the HCA project. Using the correct modularity, we hope to grow a generic pipeline that has specific modules to address differences in assays, while leveraging common code where steps of the assays are the same. We offer this as a community resource for community development and improvement. The first assay this workflow supports is the 10x v2 and v3 gene expression assay. Optimus has been validated for analyzing both human and mouse data sets.
The introduction of droplet-based technologies such as inDrop (Klein, et al., 2015) and Drop-seq (Macosko, et al., 2015) moved the throughput of a single-cell RNA sequencing experiment from hundreds to thousands of cells. Technology developed by 10x Genomics further increased throughput to hundreds of thousands of cells and has opened up the possibility of creating datasets for millions of cells. Common among many of the single cell transcriptomics high-throughput technologies is the use of:
The bead-specific barcodes and UMIs are encoded on sequencing primers that also contain polyT tracts to enable binding of the primers to polyA+ mRNA transcripts. After lysing cells, mRNA transcripts bind to the polyT tracts in the primer and transcripts are reverse transcribed to generate barcoded cDNA. Note that all cDNA molecules from a single cell have the same barcode, but they have different UMIs. Thus every transcript that is captured from an individual cell can be mapped to its cognate cell and also counted as a single transcript, correcting for PCR bias. cDNAs are pooled for amplification and construction of libraries to facilitate 3’ DNA sequencing.
|Assay Type||10x Single Cell Expression (v2 and v3)||10x Genomics|
|Overall Workflow||Quality control module and transcriptome quantification module||Code available from Github|
|Genomic Reference Sequence||GRCh38 human genome primary sequence and M21 (GRCm38.p6) mouse genome primary sequence||GENCODE Human and Mouse|
|Transcriptomic Reference Annotation||V27 GenCode human transcriptome and M21 mouse transcriptome||GENCODE Human and Mouse|
|Aligner||STAR||Dobin, et al.,2013|
|Transcript Quantification||Utilities for processing large-scale single cell datasets||Sctools|
|Data Input File Format||File format in which sequencing data is provided||FASTQ|
|Data Output File Format||File formats in which Optimus output is provided||BAM, Zarr version 2, Python numpy arrays (internal), Loom|
Overall, the workflow:
Special care is taken to avoid the removal of reads that are not aligned or that do not contain recognizable barcodes. This design (which differs from many pipelines currently available) allows use of the entire dataset by those who may want to use alternative filtering or leverage the data for methodological development associated with the data processing.
A general overview of the pipeline is shown below, followed by more detailed descriptions of the steps.
Each 10X v2 and v3 3’ sequencing experiment generates triplets of Fastq files:
Because the pipeline processing steps require a BAM file format, the first step of Optimus is to convert the R2 Fastq file, containing the alignable genomic information, to a BAM file. Next, the Attach10xBarcodes step appends the UMI and Cell Barcode sequences from R1 to the corresponding R2 sequence as tags, in order to properly label the genomic information for alignment.
Although the function of the cell barcodes is to identify unique cells, barcode errors can arise during sequencing (such as incorporation of the barcode into contaminating DNA or sequencing and PCR errors), making it difficult to distinguish unique cells from artifactual appearances of the barcode. Barcode errors are evaluated in the Attach10xBarcodes step mentioned above, which compares the sequences against a whitelist of known barcode sequences.
The output file contains the reads with correct barcodes, including barcodes that came within one edit distance (Levenshtein distance) of matching the whitelist of barcode sequences and were corrected by this tool. Correct barcodes are assigned a “CB” tag. Uncorrected barcodes (with more than one error) are preserved and given a “CR” (Cell barcode Raw) tag. Cell barcode quality scores are also preserved in the file under the “CY” tag.
The STAR alignment software (Dobin, et al., 2013 is used to map barcoded reads in the BAM file to the human genome primary assembly reference (see table above for version information). STAR (Spliced Transcripts Alignment to a Reference) is widely-used for RNA-seq alignment and identifies the best matching location(s) on the reference for each sequencing read.
The TagGeneExon tool then annotates each read with the type of sequence to which it aligns. These annotations include INTERGENIC, INTRONIC, and EXONIC, and are stored using the XF BAM tag. In cases where the gene corresponds to an intron or exon, the name of the gene that overlaps the alignment is associated with the read and stored using the GE BAM tag.
UMIs are designed to distinguish unique transcripts present in the cell at lysis from those arising from PCR amplification of these same transcripts. But, like cell barcodes, UMIs can also be incorrectly sequenced or amplified. Optimus uses the UMI-tools software package, which applies a network-based method to account for such errors (Smith, et al., 2017). Optimus uses the “directional” method.
A number of quality control tools are used to assess the quality of the data output each time this pipeline is run. For a list of the tools and information about each one please see our QC Metrics page.
In addition, the pipeline runs the EmptyDrops function from the dropletUtils R package to identify cell barcodes that correspond to empty droplets. Empty droplets are those that did not encapsulate a cell but instead acquired cell-free RNA from the solution in which the cells resided -- such as secreted RNA or RNA released when some cells lysed in solution (Lun, et al., 2018). This ambient RNA can serve as a substrate for reverse transcription, leading to a small number of background reads. Cell barcodes that are not believed to represent cells are identified in the metrics and raw information from dropletUtils is provided to the user.
The pipeline outputs a count matrix that contains, for each cell barcode and for each gene, the number of molecules that were observed. The script that generates this matrix evaluates every read. It discards any read that maps to more than one gene, and counts any remaining reads provided the triplet of cell barcode, molecule barcode, and gene name is unique, indicating the read originates from a single transcript present at the time of lysis of the cell represented by that respective barcode.
Outputs of the pipeline include: