Table 1

Tools for analysing million-scale single-cell transcriptomes

MethodLearning strategyCell numberTaskPrincipleSpecificityRef.
Statistic modelsScarfGraph-based t-stochastic neighbour embedding4 millionVisualizationGraph-based neighbouring embedding and hierarchical clusteringEmphasizing rare cells and lineage trajectories27
iNMFOnline integrative non-negative matrix factorization1.3 millionData integrationJointly decomposed inputs into shared and dataset-specific metagenesIntegrates datasets without needing the entire data during training28
scMerge2Integrates single-cells in a hierarchical manner11 millionData integrationHierarchical integration for local and global variationsIntegrates incoming datasets without complete dataset availability during training29
Seurat v5Dictionary learning8.6 millionData integration for multi-omic dataDecompose cells into multi-omics dictionaryIntegrates data independent of single-cell omics measurements30
Deep-learning methodsCumulusSupervised learning1.3 millionVisualizationLearns project unseen cells with subsamplingEnsures a higher rate of sampling from rare cells26
INSCTSemi-supervised learning2.6 millionData integrationEmploys batch-aware triplet network to generate combined embedding spaceProjects unseen single-cell data into pre-generated embeddings24
FugueSelf-supervised learning18 millionData integrationEncoding batch information in unsupervised networkMaintains consistent memory usage across varying data magnitudes25
SCALEXUnsupervised learning4 millionData integrationApplies VAE to project cells into a batch-invariant spaceIncorporates incoming data without recalculating.31
scPoliSemi-supervised learning7.8 millionData integrationApplying conditional VAE to regress batch effectsExplains sample and cell-level variations with sample embeddings32
ConcertoSelf-supervised learning10 millionData integration for multi-omic dataUtilizes an asymmetric teacher-student architecture for cell pairing and batch separationPioneers multi-omics data integration33
Large-scale single-cell pre-trainingiSEEEKMasked language modelling11.9 millionCell clustering, development trajectory, cell-cell communicationLeverages top 126 genes for each cell; predicts masked gene with bidirectional self-attentionEnables focused analysis and noise reduction in single-cell data; enhances contextual understanding20
GeneformerMasked language modelling29.9 millionChromatin network and therapeutic targets inferenceLeverages all genes within each cell; predicts masked genes using bidirectional self-attentionFosters a comprehensive understanding of the cellular context; enhances contextual understanding22
tGPTAuto-regressive modeling22.3 millionCell clustering, cell-phenotype, development trajectory, therapeutic targets inference.Leverages top 64/126 genes for each cell; predicts the next gene based on previously generated genesEnables focused analysis, noise reduction; suitable for single-cell data with temporal or positional order21