Support Center
Providing revolutionary resolutions for life science research and IVD
1. FASTQ data cannot be directly merged because MobiVision dataand 10X data have different read structures and barcode whitelists.
2. However, cell-gene expression matrices (filtered-cell-gene-matrix) can be merged. It is recommended to use software like Seurat, Liger, Harmony, or Scanorama for batch effect correction.
There are two scenarios:
1. When using the --intron exclude parameter, a read is counted ifit aligns to the exonic region of a gene (over 50% of the read length ismapped to the exonic region). Reads mapped to intronic or intergenicregions are not counted.
2. When using the --intron included parameter (the default), a read is counted if it aligns to the exonic or intronic regions of a gene (over 50% of the read length is mapped to these regions). Reads mapped to intergenic regions are not counted.
The runtime for 100GB of data depends on the server configuration and parameter settings. For example, using a Hygon C86 7285H 32-core Processor (2.5GHz):
1. For samples around 10GB, increasing the number of threadsdoesn't significantly reduce analysis time but does increase memory usage.For 10GB libraries, 2-8 threads are recommended.
2. For 100GB samples, using fewer than 24 threads does notsignificantly increase memory usage but does reduce analysis time. Usingmore than 24 threads increases memory usage. For 100GB libraries, 16-24threads are recommended.
3. The runtime and memory consumption depend on the library sizeand the number of threads used. For a library size of 300GB, at least 64GBof memory is recommended for analysis.
Sequencing saturation reflects the overall complexity and depth of sequencing for all fragments and can be obtained by calculating the redundancy of sequencing fragments with valid barcodes and UMIs that align to unique regions of the genome. The formula is: Sequencing Saturation = 1 - non-duplicated_unique_mapped_reads / total_unique_mapped_reads. For BAM files obtained through MobiVision Quantify, MAPQ=255 represents reads aligned to unique genome regions. Total_unique_mapped_reads can be obtained by calculating the number of sequencing fragments with corrected UMIs and barcodes in MAPQ=255 reads; non-duplicated_unique_mapped_reads can be obtained by counting non-duplicated sequencing fragments in MAPQ=255 reads with UMIs and barcodes. The code is as follows:
samtools view -q 255 Aligned.bam | gawk '{if (NF==16) {total_reads+=1; !umi[$19$20]++}} END {printf("%s%s\n" total_reads length(umi))}'
The mobivision mkindex command can be used to construct reference genomes, and specifying different values for the -m parameter will result in different sizes of the constructed reference genome. The larger the -m value, the larger and faster the reference genome will be. The default -m value is 16. With default parameters, the reference genome folder size for the human genome is approximately 19GB. The command to build the reference is:
mobivision mkindex -n GRCh38 -f GRCh38.primary_assembly.genome.fa -g gencode.v38.primary_assembly.annotation.gtf -r human-gencode-v1.0
It is recommended to use FASTA and GTF files from Gencode or Ensembl for reference genome construction. The GTF annotation file should include at least exon, gene, and transcript information.
MobiVision Quantify currently provides two cell filtering algorithms: CR2.2 and EmptyDrops (Lun et al., 2019, Genome Biology). Users can also specify the number of cells by using the --cellnumber INT parameter to select the top INT cells based on UMI count.
CR2.2 Algorithm: Barcodes aresorted by UMI count from high to low. If N is the expected number of cells(default is 3000), and m is the UMI count for the 99th percentile barcode,then all barcodes with UMI values greater than m/10 are identified ascells.
EmptyDrops Algorithm: Thisalgorithm further identifies cells with low RNA content, following apreliminary identification similar to CR2.2. It compares the RNA profileof these barcodes against a background model and identifies barcodes thatsignificantly deviate from the background as cells.
The primary goal of V(D)J analysis is to extract the V(D)J gene sequences and clonotypes of B cells or T cells from raw sequencing data. This process is typically adaptable to different sequencing platforms and data formats. Therefore, the V(D)J analysis pipeline does support FASTQ files from multiple sequencing platforms.
For example, at the data analysis level, V(D)J analysis software like IgBlast can handle FASTQ files from various sequencing platforms, including Illumina, BGI, and Ion Torrent. Similarly, MobiVision analysis software can process FASTQ files from different sequencing platforms. However, due to differences in read length and quality characteristics across platforms, these factors may need to be considered during V(D)J analysis.The V(D)J analysis pipeline can generally support single-end reads, including cases where.
only one end contains V(D)J gene information. However, this depends on the V(D)J analysis software used and the specific experimental design.
For single-end reads, V(D)J analysis software typically performs additional preprocessing and filtering to improve the accuracy of V(D)J rearrangement and clonotype identification. MobiVision can process both single-end and paired-end FASTQ files, specifying which read contains the V(D)J gene information. It can also identify which barcodes the reads belong to and determine the heavy and light chains of the V(D)J genes for effective V(D)J analysis.
It's important to note that for single-end reads containing only V(D)J gene information, the absence of other sequence information, such as UMI, might affect the accuracy and reliability of single-cell V(D)J analysis. Therefore, it's advisable to choose an appropriate sequencing strategy to ensure sufficient sequence information is available for V(D)J analysis.
Constructing a reference genome sequence file for particularly uncommon species can be challenging due to the lack of available reference genomes or genome annotation data. Here are some potentially useful methods:
1. Based on Known Related Species: Youcan use the genome sequence of a closely related species, utilizing genomesequence alignment and assembly techniques to construct the referencegenome sequence for the target species. This method requires sufficientsimilarity in genome sequences and adequate alignment depth.
2. RNA-Seq Data Assembly: If RNA-Seqdata for the target species is available, it can be used to assembletranscriptome sequences, which can then be used with alignment tools likeBLAST and STAR for genome assembly and annotation. This method is suitablewhen a full genome sequence is not required.
3. Subgenome Annotation: If no genomedata is available, subgenome annotation methods can be considered. Thisinvolves using alignment tools like BLAST and HMMER to map the genomeannotation information from known species to the target species' genome,inferring the composition and structure of the target species' genome.
4. Third-Generation Sequencing-Assisted Assembly: Consider using third-generation sequencing technologies, suchas Oxford Nanopore and PacBio SMRT, which produce long reads that aid inbetter genome assembly and annotation.
For species with available genome FA files and genome annotation GTF or GFF files, the mk_vdj_ref command can be used for construction. For specific construction methods, please refer to the corresponding page link.
Before analyzing raw FASTQ files, proper naming is essential. While different labs and analysis pipelines may have varying naming conventions, the following basic requirements should be met:
● The filename should clearly reflect the sample source,including sample ID, tissue origin, and treatment method, separated bycharacters such as underscores or hyphens (e.g.,Sample1_Blood_RNAseq_R1.fastq(.gz) and Sample1_Blood_RNAseq_R2.fastq(.gz)).
● The filename should include key experimental information, suchas sequencing type and platform, separated by characters like underscoresor hyphens (e.g., Sample1_Blood_RNAseq_Illumina_PE.fastq(.gz)).
● The filename should uniquely identify each FASTQ file to avoidduplication or overwriting existing data. Unique filenames or time-stampednames are recommended (e.g.,Sample1_Blood_RNAseq_Illumina_PE_20220331.fastq(.gz)).
Naming conventions should be consistent and follow general naming practices to facilitate data management and sharing. Some analysis software and tools may require specific file naming formats, so it's important to review the relevant software documentation before analysis to determine exact file naming requirements. For naming conventions specific to MobiVision single-cell V(D)J input files, please refer to question 8.
The appropriate sequencing depth for single-cell V(D)J sequencing depends on several factors, including sample complexity, sequencing depth, and experimental design.
Generally, the goal of single-cell V(D)J sequencing is to obtain comprehensive clonotype information, requiring sufficient sequencing depth to support high-quality rearrangement and clonotype identification. As a rule of thumb, each single cell should ideally have at least 4,000 reads to ensure high-quality V(D)J analysis results.
It's important to consider the specific experimental design and research question when choosing the appropriate sequencing depth. Some studies may require deeper sequencing, while others may need less. Therefore, the sequencing depth should be chosen based on the actual needs of the study.Fraction Reads in Cells is a key metric in single-cell sequencing data analysis used to evaluate the quality of sequencing data and the efficiency of single-cell capture. It represents the proportion of reads that can be assigned to individual cells out of all sequencing data. Typically, a higher Fraction Reads in Cells indicates better single-cell sequencing performance and a higher probability of capturing individual cells in the sample.
If the Fraction Reads in Cells is relatively low, it may indicate the following:
● Low Single-Cell Capture Efficiency: This could be due to experimental operation or sequencing technologyissues, requiring further optimization of experimental conditions andsequencing parameters.
● Poor Sample Quality: This mightresult from RNA degradation or cytoplasmic rupture in the sample,affecting the quality of single-cell sequencing.
● Poor Data Quality: Low-qualityreads or low coverage might prevent reads from being assigned toindividual cells, necessitating stricter data quality control andfiltering.
It's important to note that the ideal value for Fraction Reads in Cells depends on experimental design and sequencing technology, and there is no fixed threshold. The quality and efficiency of single-cell capture should be assessed comprehensively by considering other metrics and analysis results.
Paired Clonotype Diversity is a metric used to evaluate clonotype diversity in single-cell V(D)J sequencing data. It calculates the number of clonotypes in the same cell based on paired heavy and light chain V(D)J rearrangement information, clustering different clonotypes across cells, and determining the average number of clonotypes in each cluster. This metric is typically used to describe the clonotype diversity within individual cells.
The specific calculation process for Paired Clonotype Diversity is as follows:
1. Concatenate the heavy and light chain V(D)J sequences in eachcell to obtain a full-length V(D)J sequence.
2. Use V(D)J analysis software to analyze the full-length V(D)Jsequences and obtain clonotype information for the same cell.
3. Cluster the clonotypes within the same cell and calculate thenumber of clonotypes in different clusters.
4. Calculate the inverse Simpson index of the number of cells perclonotype to obtain the Paired Clonotype Diversity metric.
A higher Paired Clonotype Diversity value indicates greater clonotype diversity within single cells, meaning that more distinct clonotypes are present in individual cells. This metric can be used to compare the effects of different cell types, experimental conditions, and treatments on clonotype diversity within single cells. It's worth noting that Paired Clonotype Diversity reflects clonotype diversity within the sample. The Paired Clonotype Diversity value is included in the VDJ annotation section of MobiVision's websummary.html quality control results, providing a reference for clonotype diversity.
MobiVision TCR/BCR sequencing recommends approximately 30M reads per TCR/BCR sequencing; 9GB of data. On a 16-core, 64GB system, it typically takes about an hour to complete. Adjusting the thread count and memory allocation may alter the runtime. The default thread count for MobiVision v1.6.1 is 8; if there are no special requirements, it's advisable to use this thread count.
By default, MobiVision VDJ runs with the auto parameter, automatically recognizing TCR and BCR. However, if the data quality is poor and the pipeline cannot recognize the TCR or BCR mode, it may be necessary to manually specify the TCR or BCR type.
Significant differences in cell counts between single-cell 5’ transcriptome and V(D)J joint analysis generally occur in two situations. One is when the V(D)J cell count is much lower than the GEX cell count, commonly seen with TCRs due to their often low expression levels, sometimes failing to capture the full-length gene, resulting in false negatives. The other is when the VDJ cell count is much higher than the GEX cell count, typically seen with BCRs due to their often high expression levels, particularly in the presence of plasma cells, where a large amount of background mRNA may cause empty droplets to be counted, leading to false positives. Therefore, combining VDJ results with 5' transcriptome data for joint analysis improves the accuracy of the results.
Of course not. The reference genomes for single-cell transcriptomics and single-cell V(D)J differ significantly, even though they target the same species. In data analysis, the focus and application scenarios of these references differ. Single-cell immunogenomics mainly focuses on V(D)J recombination and clonotypes of immune receptors to understand the diversity, clonal expansion, and immune response of immune cells. Therefore, when constructing a reference genome for single-cell immunogenomics, particular attention should be paid to immune receptor-related genes such as V, D, and J gene annotations and databases. Single-cell transcriptomics, on the other hand, focuses on gene expression across the entire genome to reveal the transcriptional characteristics of different cell types and states. Consequently, when constructing a reference genome for single-cell transcriptomics, consideration must be given to gene annotations and databases across the entire genome.