Table 1

Tools for analysing million-scale single-cell transcriptomes

Method		Learning strategy	Cell number	Task	Principle	Specificity	Ref.
Statistic models	Scarf	Graph-based t-stochastic neighbour embedding	4 million	Visualization	Graph-based neighbouring embedding and hierarchical clustering	Emphasizing rare cells and lineage trajectories	²⁷
	iNMF	Online integrative non-negative matrix factorization	1.3 million	Data integration	Jointly decomposed inputs into shared and dataset-specific metagenes	Integrates datasets without needing the entire data during training	²⁸
	scMerge2	Integrates single-cells in a hierarchical manner	11 million	Data integration	Hierarchical integration for local and global variations	Integrates incoming datasets without complete dataset availability during training	²⁹
	Seurat v5	Dictionary learning	8.6 million	Data integration for multi-omic data	Decompose cells into multi-omics dictionary	Integrates data independent of single-cell omics measurements	³⁰
Deep-learning methods	Cumulus	Supervised learning	1.3 million	Visualization	Learns project unseen cells with subsampling	Ensures a higher rate of sampling from rare cells	²⁶
	INSCT	Semi-supervised learning	2.6 million	Data integration	Employs batch-aware triplet network to generate combined embedding space	Projects unseen single-cell data into pre-generated embeddings	²⁴
	Fugue	Self-supervised learning	18 million	Data integration	Encoding batch information in unsupervised network	Maintains consistent memory usage across varying data magnitudes	²⁵
	SCALEX	Unsupervised learning	4 million	Data integration	Applies VAE to project cells into a batch-invariant space	Incorporates incoming data without recalculating.	³¹
	scPoli	Semi-supervised learning	7.8 million	Data integration	Applying conditional VAE to regress batch effects	Explains sample and cell-level variations with sample embeddings	³²
	Concerto	Self-supervised learning	10 million	Data integration for multi-omic data	Utilizes an asymmetric teacher-student architecture for cell pairing and batch separation	Pioneers multi-omics data integration	³³
Large-scale single-cell pre-training	iSEEEK	Masked language modelling	11.9 million	Cell clustering, development trajectory, cell-cell communication	Leverages top 126 genes for each cell; predicts masked gene with bidirectional self-attention	Enables focused analysis and noise reduction in single-cell data; enhances contextual understanding	²⁰
	Geneformer	Masked language modelling	29.9 million	Chromatin network and therapeutic targets inference	Leverages all genes within each cell; predicts masked genes using bidirectional self-attention	Fosters a comprehensive understanding of the cellular context; enhances contextual understanding	²²
	tGPT	Auto-regressive modeling	22.3 million	Cell clustering, cell-phenotype, development trajectory, therapeutic targets inference.	Leverages top 64/126 genes for each cell; predicts the next gene based on previously generated genes	Enables focused analysis, noise reduction; suitable for single-cell data with temporal or positional order	²¹