Algorithm Introduction

Algorithm Overview

mobivision quantify can be used to analyze single-cell transcriptomic data from the MobiNova platform. The key analysis steps are shown in the figure below:

Barcode Correction

The MobiNova platform offers two types of single-cell transcriptomic sequencing data: 3’ transcriptome and 5’ transcriptome, both of which can be analyzed by mobivision quantify.

The Reads Structure of single cell 3’RNA is represented in the figure below:

The Reads Structure of single cell 5’RNA is represented in the figure below:

The Read structure shows that the 5' end of Read1 is the cell barcode (20bp) and UMI (10bp), regardless of whether it’s the 5'transcriptome or the 3'transcriptome. In order to determine whether the cell barcode from Read1 is correct, MobiVision will compare the cell barcode sequence in the sequenced fragment with the cell barcodes in the known white list. Currently, the MobiCube High-throughput Single Cell 3' Transcriptome v2.0 Kit provides nearly 3,000,000 cell barcodes. The sequencing reads that meet the following requirements will be retained:

  • The cell barcode of Read1 exists in the whitelist;
  • The cell barcode of Read1 does not exist in the white list, however, if the minimum Hamming distance between that barcode and the cell barcode in the white list is <=2, then the cell barcode in Read1 is corrected according to the cell barcode in the white list.

In the passed sequencing reads, Read1 only retains the corrected cell barcode and UMI sequence, and Read2 is not processed at this step.

Reads Trimming

For the fastq data after barcode correction, in theory, Read1 no longer contains the adapter sequence, so no special treatment is required.

  • For the Read2 sequence from the fastq data of the 3' transcriptome of a single cell, there may be a 30bp TSO sequence (“AAGCAGTGGTATCAACGCAGAGGTACATGGG”) at the 5’ end of the Read2 fragment, and poly-A sequence at the 3’ end. And the presence of TSO sequences and poly-A sequences will effectively reduce the alignment rate of the library. Therefore, it is necessary to remove the TSO sequences and poly-A sequences that may exist at both ends of the insert fragments before alignment.
  • For the Read2 fragment from the fastq sequence of the single-cell 5' transcriptome, there may be a poly-T sequence at the 5' end and a 13bp TSO reverse complementary sequence ("CCCATATAAGAAA") at the 3' end, which also needs to be removed before alignment.
  • Removal of adapter sequences and poly-A and poly-T may result in too short inserted DNA fragments, which will increase the probability of mismatching. Therefore, after the removal of adapter sequences, it is necessary to filter out the reads with inserted DNA fragments shorter than 30bp.

Reads Mapping

Mobivision quantify uses STARsolo for reads mapping, and the annotation results are shown in the figure below:

  • If more than 50% of the length of the sequenced fragment is aligned to the exon region, the fragment is considered to be an Exonic Read;
  • If greater than or equal to 50% of the length of the sequencing fragment is aligned to the intron region, the fragment is considered to be an Intronic Read;
  • If the sequenced fragment can be aligned to the genome, but it is neither Exonic Read nor Intronic Read, then the fragment is considered to be Intergenic Read;
  • If more than 50% of the length of the sequenced fragment is aligned to the reverse strand of the exon region, then the fragment is considered to be an Antisense Read.
  • The default parameters of MobiVision v2.0 include introns (--intron included) in the transcriptome, that is, the sequenced fragments are 100% aligned to introns and/or exons; if the --intron excluded mode is selected, the sequenced fragments must be 100% aligned to exonic regions.

mobivision quantify keeps records of all the sequencing fragments aligned to the genome. When the alignment quality of the sequencing fragment MAPQ=255, it means that the sequencing fragment is mapped uniquely to the genome. Only the sequenced fragments that are uniquely aligned to the transcriptome region will be used for the downstream UMI counting.

UMI Counting

Before the step of UMI counting, UMIs that do not meet the requirements in the Reads alignment results need to be eliminated:

  • UMI composed of the same base needs to be removed;
  • UMI containing N needs to be removed;
  • When one or more identical UMIs are aligned to the same gene, the UMI count is recorded as 1; when multiple identical UMIs are aligned to different genes, the alignment with the most UMIs aligned to one gene is retained, the UMIs aligned to other genes were removed, and the UMI count is recorded as 1;
  • If two UMIs only possess 1 base difference, and both of them are aligned to the same gene, then the two UMIs are considered to identical, and only one of the two UMIs is retained, and the UMI count is recorded as 1.

After the above filtering conditions, the retained UMI information and cell barcode sequences can be used to generate a raw cell-gene matrix.

Cell Calling

mobivision quantify currently provides two cell calling algorithms: CR2.2 and EmptyDrops (the algorithm published by Lun et al. in Genome biology in 2019). If the user needs to specify the number of cells, the --cellnumber INT can also be used to select the barcodes whose UMI number ranks the top INT among all the barcodes as valid cells.

For the sample of mixed two species, such as human and mouse, mobivision quantify divides the cells into three types: human-derived cells, mouse-derived cells, and human-mouse mixed cells (multiplet). mobivision quantify considers that only if no less than 90% of the UMIs in a single cell barcode come from one species, the cell barcode will be considered to be derived from this species. For example, when 80% of UMIs in a cell barcode come from species 1 and the other 20% UMIs come from species 2, then mobivision quantify will determine that the cell is a multiplet. Although mobivision quantify cannot directly determine the rate of doublets or multiplets in the library, we can indirectly evaluate the doublets or multiplets in the library through the calculation of multiplets. If there are doublets or multiplets in the library, theoretically, the case of species 1+species 1 should account for 1/4, the case of species 2+species 2 should account for 1/4, and the case of species 1+species 2 should account for 1/2. For example, in a double-species library, the multiplet rate is 5%, and it can be estimated that the doublet or multiplet rate in this library should be around 10%.

Quality Control Report

By default, after the filtered cell-gene matrix cell is generated, mobivision quantify makes summaries on the raw data and analysis results of the entire library, and generates a quality control report. The report is a feedback on the entire library, aiming to help users understand the quality of the original data and analysis results of the library from a macro perspective, without any data filtering. If necessary, extra quality control step can be performed according to the results of the quality control report before starting downstream analysis.