Abstract
Colorectal cancer (CRC) remains a major global health burden with the gut microbiome emerging as a critical contributor to tumor initiation and progression. Advances in high-throughput sequencing have deepened our understanding of host-microbe interactions across genomic, transcriptomic, epigenomic, and metabolomic levels. This review synthesizes current knowledge on how microbial communities shape colorectal carcinogenesis, including induction of genomic instability, remodeling of host transcriptional and epigenetic landscapes, and reprogramming of metabolic pathways within the tumor microenvironment. Integrative multi-omics strategies and advanced computational tools are powerful means for dissecting these complex biological systems. However, analytical challenges, such as data compositionality, sparsity, and high dimensionality, still hinder meaningful interpretation. Emerging technologies, like long-read sequencing and bacterial single-cell spatial transcriptomics, are enhancing the resolution and accuracy of microbiota profiling. Finally, the convergence of advanced experimental models, artificial intelligence-driven computational integration, and precision microbiome medicine are highlighted as key avenues for translating microbiome insights into preventive, diagnostic, and therapeutic innovations in CRC.
keywords
Introduction
Colorectal cancer (CRC) ranks third as the most commonly diagnosed cancer and second as the leading cause of cancer-related deaths worldwide1. The development of CRC involves a complex interplay between genetic, environmental, and lifestyle factors with chronic inflammation serving as a key contributor. Within this context, the gut microbiota, which constitutes a vast community of bacteria, fungi, archaea, and viruses, has emerged as a dynamic component of the tumor microenvironment (TME) and actively influences cancer initiation and progression2,3. This complex ecosystem has a crucial role in maintaining human health through nutrient metabolism, immune modulation, and epithelial barrier integrity4. However, perturbations of this ecosystem, referred to as dysbiosis, can convert commensal populations into drivers of tumorigenesis. Compelling evidence from fecal microbiota transplantation (FMT) studies demonstrates that germ-free mice receiving microbiota from CRC patients exhibit increased cell proliferation, a higher number of polyps, greater dysplasia, and elevated inflammatory markers compared to mice receiving microbiota from healthy individuals5. Large-scale multi-kingdom metagenomic analyses further identified specific microbiome signatures associated with the adenoma-carcinoma sequence, underscoring the critical role of microbiota in CRC evolution6–8.
The mechanisms by which the microbiota influences CRC initiation and progression are multifaceted and primarily mediated through direct microbial contact with host cells and the production of bioactive metabolites9. For example, some pathogens, including Fusobacterium nucleatum, can directly bind to host epithelial E-cadherin via surface virulence factors [e.g., fusobacterial adhesin A (FadA)], activating Wnt/β-catenin signaling to promote cellular proliferation10,11. Concurrently, microbial components [e.g., lipopolysaccharide (LPS)], engage host pattern-recognition receptors, such as Toll-like receptor 4 (TLR4), triggering downstream NF-κB-mediated inflammation and fostering a tumor-promoting microenvironment12. In addition to direct contact, microbiota-derived metabolites exert profound and context-dependent influences. Secondary bile acids, such as deoxycholic acid (DCA), which are generated via microbial biotransformation of primary bile acids, have been shown to impair cytotoxic CD8⁺ T cell functionality and thereby promote immune evasion in CRC13. Likewise, trimethylamine N-oxide (TMAO), a microbial product of dietary choline metabolism, is associated with increased CRC risk through pro-inflammatory and -angiogenic effects14. Collectively, these findings indicate that the gut microbiome functions not as a passive passenger within the TME but as an active trigger and sustained modulator of CRC pathogenesis.
In the past two decades the application of culture-independent, high-throughput genetic sequencing technologies has driven expansion of the human microbiome database and exponential growth in ‘-omics’ data15. These advances have shifted the field beyond simple compositional profiling toward integrative analyses that interrogate microbial function, ecology, and molecular interactions with the host. Therefore, dysbiosis in CRC encompasses not only alterations in microbial abundance but also profound functional remodeling, including disruptions in short-chain fatty acid (SCFA) biosynthesis, bile acid metabolism, and the production of genotoxic or pro-inflammatory metabolites16. Statistical, computational, and mathematical frameworks now enable researchers to identify core microbial pathways that may disrupt the delicate homeostatic balance with the host, providing mechanistic insight into how microbiota compromise epithelial barrier integrity, modulate antitumor immunity, and contribute to oncogenic signaling17,18. Understanding these complex and multilayered interactions is essential for developing targeted therapeutic strategies in the era of microbiome-aware precision oncology.
In this review contemporary advances in microbiota-host interactions in CRC are synthesized across multiple omics layers, highlighting mechanistic insights into how microbes and microbe products modulate tumorigenesis. The prevalent computational frameworks and persistent methodologic challenges that shape current microbiome research are discussed. Finally, innovative technologies and future directions that may ultimately enable the clinical translation of microbiome-informed diagnostics, prognostics, and therapeutics in CRC are explored.
Microbiome and host interplay at multi-omics layers
The intricate interplay between the microbiome and host during CRC development can be systematically characterized across multiple molecular layers, including the genome, transcriptome, epigenome, and metabolome (Figure 1). To this end, researchers use a diverse suite of profiling techniques to generate multidimensional datasets and jointly analyze these data with microbiome features to uncover key insights into host-microbiota interactions in CRC (Table 1).
Microbiome and host interaction in CRC. pks+ Escherichia coli: Produces the genotoxin colibactin, which induces DNA double-strand breaks and genomic instability. Enterotoxigenic Bacteroides fragilis (ETBF): Secretes BFT, which contributes to ROS generation and induces DNA damage; BFT also disrupts E-cadherin-mediated cell adhesion, activating β-catenin signaling. ETBF downregulates the tumor-suppressive microRNA, miR-149-3p. Fusobacterium nucleatum: Utilizes virulence factors (FadA and Fap2) to bind host E-cadherin and Gal-GalNAc, respectively. F. nucleatum promotes TNFSF9 gene expression, upregulates the lncRNA ENO1-IT1, and suppresses the m6A “writer” enzyme (METTL3) via the YAP signaling pathway. Other bacteria promote cholesterol biosynthesis. Commensal-derived metabolites, such as butyrate, function as HDAC inhibitors, while microbial LPS activates TLR4 and NF-κB signaling, collectively fostering an inflammatory and pro-tumorigenic microenvironment. BTF, Bacteroides fragilis toxin; CRC, colorectal cancer; Gal-GalNAc, galactose-N-acetyl-d-galactosamine; HDAC, histone deacetylase; LPS, lipopolysaccharide; ROS, reactive oxygen species; TLR4, Toll-like receptor 4.
Overview of the application of multi-omics technologies and the main findings in the CRC-related microbiome research
Microbiome and host genetic mutation
The gut microbiota contributes to CRC initiation and clonal evolution by directly inducing genomic instability and modulating host oncogenic pathways. Whole-genome sequencing (WGS) and whole-exome sequencing (WES) of CRC tumors have identified distinct mutational signatures that are mechanistically attributed to bacterial genotoxins36. The pks+ island from Escherichia coli encodes a series of enzymes responsible for synthesizing colibactin, a potent genotoxin capable of inducing DNA damage in epithelial cells37. This damage manifests as interstrand crosslinks and double-strand breaks, ultimately leading to a characteristic mutational signature (i.e., the “colibactin damage motif”) that is characterized by preferentially damaging sequences (AAWWTT, where W represents A or T) that lead to genomic instability, which is a hallmark of cancer progression19. Colibactin-associated mutations have been identified in > 12% of CRC cases38, underscoring the substantial contribution to disease burden. Similar genotoxic effects have been observed with other bacterial toxins, including Bacteroides fragilis toxin (BFT), which is produced by enterotoxigenic Bacteroides fragilis (ETBF), and promotes DNA damage through spermine oxidase-dependent generation of reactive oxygen species (ROS), further establishing direct molecular links between specific gut bacteria and somatic mutations in CRC39,40.
In addition to inducing de novo mutations, the microbiome can reshape the function of key tumor suppressor genes. TP53, the most frequently mutated gene in human cancers, is altered in approximately 43% of CRC cases41. Mutant p53 typically promotes tumor development, enhances cancer cell survival, and is associated with treatment resistance and poor prognosis. However, when mutant p53 is exposed to gallic acid, a polyphenol metabolite produced by gut commensal microbes, mutant p53 loses the tumor-suppressive property and switches to perform oncogenic functions by over-activating the Wnt signaling pathway, significantly enhancing tumor cell proliferation and invasion20. In addition, some microbes can promote p53 ubiquitination and degradation, further attenuating the tumor-suppressive function42.
Host genetics reciprocally influence microbial colonization, establishing a bidirectional feedback loop that reinforces tumor-promoting interactions. For example, KRAS mutations promote intratumoral colonization of ETBF in CRC by regulating the miRNA3655/SURF6/IRF7/IFNβ axis21, whereas F. nucleatum is preferentially enriched in KRAS p.G12D mutant CRC tumor tissues and contributes to tumorigenesis via adhesion to the RNA helicase, DHX1522. Similarly, BRAF V600E mutations render intestinal stem cells susceptible to colonization with toxigenic bacteria43, highlighting how common oncogenic drivers can create a niche for specific tumor-promoting microbes.
The search for microbiome-genome interactions is increasingly powered by microbiome-wide association studies (mGWAS), which correlate host genetic variation with microbial abundance44. The single nucleotide polymorphism (SNP), rs2355016, which is located in the intron of ATP-sensitive inward rectifier potassium channel 11 (KCNJ11), was shown to be significantly associated with the abundance of F. nucleatum in a cohort of 748 Chinese CRC patients. Decreased KCNJ11 expression enhances bacterial adhesion through galactose-N-acetyl-d-galactosamine (Gal-GalNAc) binding and promotes CRC growth45. A two-sample Mendelian randomization (MR) study, which tests for potential causal relationships between gut microbiota and health outcomes by comparing summary statistics from a separate microbiome and outcome genome-wide association study (GWAS), has also identified a causal relationship between six bacterial taxa and CRC at a locus-wide significance level46. Additional studies indicated that early-life microbial exposures interact with host polymorphisms (e.g., DUOX2 variants) that collectively modulate CRC susceptibility through microbiota-mediated pathways23.
In summary, the microbiota contributes to CRC-associated genetic instability through both direct genotoxic activity and functional interference with canonical tumor suppressors, while host genetic alterations also shape microbial colonization patterns. This bidirectional interplay solidifies the role of the microbiome as an active contributor to the mutational and ecological landscape of CRC.
Microbiome and host transcriptomic alteration
The gut microbiota exerts significant influence on host gene expression programs during CRC development, reshaping signaling pathways, immune responses, and cellular phenotypes47. Researchers have used a wide range of techniques, including bulk RNA sequencing (RNA-seq), single-cell RNA sequencing (scRNA-seq), and spatial transcriptomics, to profile host gene expression, and jointly analyze these data with microbiome features to uncover key insights into host-microbiota interactions. Early bulk RNA-seq of CRC tumor tissues has been instrumental in revealing broad associations between specific microbial presence or dysbiosis and distinct host gene expression programs, identifying links between pathogenic mucosal bacteria and the expression of host genes involved in inflammation, proliferation, and epithelial barrier dysfunction48. For example, colonization with F. nucleatum activates TNFSF9, a co-stimulatory molecule that modulates tumor-associated immune responses24.
More recently, advances in single-cell technologies have significantly transformed our understanding of host-microbe interactions by enabling cell type-specific resolution of these complex relationships. Spatial host-microbiome sequencing, a novel technique that combines spatial transcriptomics and 16S rRNA gene amplicon sequencing, allows simultaneous profiling of host gene expression and microbial composition in tissues with spatial resolution. Distinct spatial niches were identified using this approach in the mouse gut. Bacterial genera, such as Pseudobutyrivibrio and Oscillibacter, were shown to influence the expression of host genes (e.g., Muc2 and Ceacam20), which are involved in mechanisms critical for maintaining gut barrier integrity and immune signaling49. The development of invasion-adhesion-directed expression sequencing (INVADEseq) represents a major technological leap, enabling precise identification of bacterial invasion events at the single-cell resolution25. F. nucleatum-infected epithelial cells exhibit robust induction of inflammatory genes (e.g., CXCL1) and matrix-remodeling factors (e.g., MMP9) in CRC tissues, implicating the role of bacteria in driving metastatic potential. Concurrently, distinct immune cell populations, such as tumor-associated macrophages and T cells, exhibit profoundly altered expression profiles in pathways related to DNA repair and cellular dormancy upon bacterial exposure. These findings suggest sophisticated, cell-type-specific mechanisms of bacterial modulation that are entirely obscured in bulk tissue analyses.
In addition to protein-coding mRNAs, the microbiota also exerts influence through the regulation of non-coding (nc)RNAs. For example, F. nucleatum upregulates the long non-coding (lnc)RNA, ENO1-IT1, which in turn promotes chemoresistance by activating autophagic pathways that allow cancer cells to survive therapeutic stress26,27. In contrast, ETBF downregulates the tumor-suppressive micro (mi)RNA, miR-149-3p, enhancing Th17 cell differentiation and creating a pro-tumorigenic environment. Moreover, microbial suppression of the lncRNA, Snhg9, disrupts the role in stabilizing p53 via the SIRT1–CCAR2 complex, illustrating how microbes exploit ncRNA circuits to reinforce oncogenic signaling50.
In summary, these findings revealed that the gut microbiota shapes CRC progression through cell-type-specific transcriptional reprogramming, direct modulation of cancer-relevant pathways, and sophisticated regulation of ncRNA networks. This comprehensive rewiring of the host transcriptome represents a fundamental mechanism through which microbial communities influence tumor behavior and therapeutic responses.
Microbiome and host epigenomic modulation
Epigenetics refers to the study of heritable changes in gene expression that occur without alterations to the underlying DNA sequence. The primary epigenetic mechanisms include DNA methylation, various histone modifications (e.g., acetylation and methylation), and regulatory ncRNAs, which collectively mediate the interplay between genetic predisposition and environmental factors. The gut microbiota, as a dynamic and abundant environmental factor resident within the host, is a potent modulator of this epigenetic landscape51,52. Advances in technologies, such as chromatin immunoprecipitation sequencing (ChIP-seq) for histone modifications, whole-genome bisulfite sequencing for DNA methylation, and assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) for chromatin accessibility, have begun to illuminate how microbial signals directly reshape the host epigenome in CRC.
A growing body of evidence delineates a direct causal link between CRC-associated dysbiosis and aberrant epigenetic patterning. Commensal bacteria have been shown to induce TET2/3-dependent DNA demethylation in intestinal epithelial cells, thereby reprogramming transcriptional responses associated with colitis and CRC28. FMT from CRC patients-to-mice induces hypermethylation of the promoter regions of key tumor suppressor genes, such as SFRP1, PENK, and WIF153. Several pathogens, such as F. nucleatum, Hungatella hathewayi, and Streptococcus species, which are consistently enriched in CRC tumors, have been reported to promote hypermethylation of tumor suppressor gene promoters by upregulating the expression of DNA methyltransferases (DNMTs)29. Moreover, a population-based study revealed that CRC tissues infected with a high abundance of F. nucleatum are significantly associated with the CpG island methylator phenotype (CIMP)-high subtype and correlate with a poorer patient prognosis54. Supporting these findings, integrated multi-omics analyses combining 16S/metagenomic sequencing with whole-genome bisulfite sequencing have consistently identified significant methylome differences between CRC tumors and adjacent normal tissues, pinpointing specific microbial-associated CpG methylation patterns in gene promoter regions30.
The microbial influence extends beyond DNA methylation to the rapidly growing field of RNA epigenetics, in which bacteria and bacteria-derived metabolites can modulate mRNA N6-methyladenosine (m6A) modification. For example, F. nucleatum has been shown to suppress the m6A “writer” enzyme, METTL3, via the YAP signaling pathway, leading to reduced m6A modification and subsequent stabilization of pro-metastatic genes, like Kif26b31. In contrast, butyrate, a metabolite derived from commensal bacteria, such as Clostridium, functions not only as a potent endogenous histone deacetylase inhibitor (HDACi) but also as a key regulator of epi-transcriptomic modifications. Butyrate inhibits CRC development by downregulating METTL3 expression and downstream targets, such as cyclin E1, thereby attenuating processes, like epithelial-mesenchymal transition (EMT)55,56.
Therefore, the gut microbiota could serve as a master regulator of the host epi-transcriptome, influencing CRC progression through both classical epigenetic mechanisms (DNA methylation) and emerging epi-transcriptomic pathways (m6A modification). While these insights reveal promising therapeutic targets for epigenetic therapy, this field remains nascent and warrants further investigation to fully exploit the potential of targeting the microbiota-epigenome axis for CRC intervention.
Microbiome and host metabolomic crosstalk
Metabolomic crosstalk represents one of the most direct and dynamic interfaces of the microbiota-host interaction in CRC. The gut microbiota functions as a vast, diverse, and highly adaptable bioreactor, transforming dietary and host-derived compounds into a wide array of bioactive metabolites that directly influence host cellular processes. Mass spectrometry-based metabolomics has enabled systematic profiling of these interactions, enabling comprehensive profiling of thousands of small molecules in biological samples, such as serum and stool. Integrated multi-omics approaches have linked specific metabolite profiles to CRC-associated microbiome signatures, identifying both enriched metabolites, such as leucylalanine, serotonin, and imidazole propionate, and depleted species, including perfluorooctane sulfonate and sphingadienine, in CRC patients32.
Maladaptation in host-microbiota metabolic crosstalk has a critical role in colorectal carcinogenesis. An arginine succinyltransferase (AST)-deficient strain of E. coli, Nissle 1917 (ΔacEcN), inhibits intestinal arginine catabolism, leading to arginine accumulation that promotes M2 macrophage polarization and activates Wnt/β-catenin signaling, ultimately accelerating tumor growth33. Similarly, multi-omics analyses involving single-cell transcriptomics, microbiome profiling, metabolomics, and clinical data have revealed significant activation of the host urea cycle as a hallmark of colorectal tumorigenesis. This metabolic shift is accompanied by a loss of ureolytic beneficial bacteria, such as Bifidobacterium, and an expansion of non-ureolytic pathobionts, collectively disrupting intestinal nitrogen homeostasis and local immune responses34.
The gut microbiome also profoundly modulates lipid metabolism in CRC. For example, F. nucleatum infection reduces lipid accumulation in CRC stem-like cells (CCSCs) by enhancing fatty acid oxidation, thus promoting CCSC self-renewal. F. nucleatum infection induces fatty acid formation and promotes lipid accumulation in non-CCSCs, which in turn drives cancer stemness via activation of the Notch/Numb signaling pathway35. Similarly, Peptostreptococcus anaerobius interacts with TLR2 and TLR4 on colon epithelial cells to increase intracellular levels of reactive oxidative species, which promote cholesterol synthesis and cell proliferation via SREBP2 and PI3K-Akt-mTOR signaling57. Bile acid metabolism constitutes another major node of microbiota-driven metabolic reprogramming. Genes involved in secondary bile acid synthesis are strongly associated with CRC progression58. A landmark shotgun metagenomics-metabolomics study demonstrated that the levels of the carcinogenic secondary bile acid, DCA, are markedly elevated in patients with advanced adenomas and Bilophila wadsworthia is the only species consistently correlated with DCA abundance59.
In summary, the metabolome provides a functional readout of the microbiome biochemical activity within the TME. Microbial metabolites act as signaling molecules, immune modulators, and metabolic substrates that collectively shape tumorigenesis. Deciphering this complex crosstalk is essential for developing metabolism-focused interventions and biomarker strategies in CRC.
Computational approaches for microbiome multi-omics integration
The complexity of microbiota-host interactions necessitates computational methods that can integrate disparate omics datasets to infer meaningful biological relationships. These approaches move beyond analyzing each data type in isolation, aiming to construct a holistic model of the interplay between microbial communities and host biology in cancer (Figure 2).
Overview of the multi-omics integration method to reveal host-microbiota interaction. Principal coordinates analysis is a fundamental linear method that transforms the original high-dimensional data into a set of orthogonal axes capturing major sources of variation. Hierarchical clustering organizes microbiota samples into dendrograms through pairwise distance metrics, such as Bray-Curtis or Jensen-Shannon divergence. The Spearman rank correlation is widely used due to robustness of the non-normal distributions typical of microbial data. Random forest models can handle high-dimensional data and provide measures of feature importance, ranking microbes by the predictive power. Network analysis provides a systems-level framework for integrating multi-omics data by representing complex microbiota-host interactions as unified graphs.
Dimensionality reduction methods
Microbiome and transcriptome datasets are typically high-dimensional, characterized by numerous features [e.g., operational taxonomic units (OTUs) and genes] relative to sample size. Dimensionality reduction techniques provide an essential first step for exploratory data analysis, providing a high-level overview of sample similarities and underlying patterns60. Principal component analysis (PCA) and an extension for distance matrices, principal coordinates analysis (PCoA), are fundamental linear methods that transform the original high-dimensional data into a set of orthogonal axes capturing major sources of variation61. These techniques are widely used to visualize the overall structure of microbiome data (e.g., based on beta-diversity metrics, like Bray-Curtis or UniFrac distances) and host omics data (e.g., gene expression profiles from RNA-seq), enabling researchers to assess sample relationships, detect potential batch effects, and identify the main sources of biological variation. For more complex, non-linear structures that linear methods fail to capture, non-linear dimensionality reduction techniques, such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), have gained prominence62,63. These methods are particularly powerful for revealing subtle sample clusters that may correspond to biologically or clinically distinct patient subtypes, including those defined by specific microbiome-host interaction states.
Clustering methods
Clustering methods are essential for deciphering complex microbiome data structures by grouping samples or microbial features based on similarity measures64. Hierarchical clustering, one of the most established approaches, organizes microbiota samples into dendrograms through pairwise distance metrics, such as Bray-Curtis or Jensen-Shannon divergence. This method enables multi-resolution visualization of sample relationships and is particularly valuable for identifying broad ecological patterns in microbial communities65. K-means clustering offers an alternative partitioning approach that groups samples into a predefined number of clusters by minimizing within-cluster variance, providing efficient processing of large datasets despite requiring prior specification of cluster numbers66. Partitioning around medoids (PAM) clustering with Jensen-Shannon distance is widely adopted for enterotype analysis due to robustness to outliers, effectively identifying stable states of the gut ecosystem across populations67. Recent advances have incorporated machine learning to enhance clustering performance.
The integration of microbial data with host transcriptomic classifications has yielded significant insights into CRC heterogeneity. A landmark study by Guinney et al. established the consensus molecular subtypes (CMS) classification, which categorizes CRC tumors into four distinct subgroups based on gene expression patterns: CMS1 (immune invasive, 14%); CMS2 (canonical, 37%); CMS3 (metabolic dysregulation, 13%); and CMS4 (mesenchymal, 23%)68. Subsequent research has revealed critical interactions between these molecular subtypes and specific microbial signatures. For example, F. nucleatum enrichment has been identified as a prognostic factor for patients within the CMS4 subtype with studies showing that mesenchymal tumors (CMS4) exhibiting high levels of Fusobacteriales are associated with an approximately two-fold higher risk of poor clinical outcomes69. Building on this finding, more recent frameworks have integrated microbial community data directly into stratification models, leading to the proposal of onco-microbial community subtypes (OCSs). This classification system stratifies CRCs into three distinct subgroups based on the microbial composition, each demonstrating unique clinic-molecular features and patient outcomes, thereby highlighting the power of integrated omics approaches for refining cancer taxonomy70. Furthermore, multi-omics profiling of fecal samples has identified five enterotypes associated with immunotherapy responses in cancer patients, including patients with CRC71. These approaches not only reveal microbial community structures but also enable the development of novel classification systems that combine microbial and molecular features, offering new perspectives for personalized cancer diagnostics and therapeutics.
Correlation-based methods
Correlation-based analyses represent a fundamental approach for identifying specific pairwise associations between microbial features (e.g., species abundance) and host molecular features (e.g., gene expression and metabolite concentration). Standard non-parametric tests, like Spearman’s rank correlation, are widely used due to the robustness of non-normal distributions, which are typical of microbial data. However, a critical limitation arises from the compositional nature of microbiome data, in which the relative abundance of each taxon is intrinsically dependent on other taxa, often leading to spurious correlations that reflect data structure rather than true biological relationships. To address these limitations, more sophisticated multivariate approaches have been developed. Multivariate statistical methods, including partial least squares regression (PLS), orthogonal partial least squares, and non-metric multidimensional scaling (NMDS), enable the identification of key features contributing to associations across multiple omics datasets. Canonical correlation analysis (CCA) and the sparse variant (sCCA) enable the identification of complex associations between groups of microbial and host features. Priya et al. demonstrated the power of this approach by applying sCCA to colonic mucosal samples from CRC patients, revealing coordinated microbial-host gene networks with both shared and disease-specific interaction patterns72. Other advanced methodologies, including weighted correlation network analysis (WGCNA), facilitate the construction of co-expression networks that capture complex interaction patterns, while Procrustes analysis coupled with Mantel testing provides a framework for assessing overall concordance between microbial and host data matrices. This method rotates, scales, and translates one dataset (e.g., microbiome PCoA coordinates) to maximize the similarity with another dataset (e.g., transcriptome PCoA coordinates) and the statistical significance of the association is typically assessed using a Mantel test73. Tools, such as MaAsLin2, are specifically designed to identify multivariate associations, while controlling for confounding factors, like age, BMI, and technical batch effects, providing robust statistical frameworks for high-dimensional data74. Despite the challenges posed by high-dimensional multi-omics data, these correlation-based methods remain essential tools for generating initial hypotheses in microbiota-host interaction studies. The correlation-based methods provide a crucial foundation for subsequent experimental validation of identified relationships, bridging the gap between observational associations and mechanistic understanding.
Regression and classification methods
Regression and classification methods represent a critical advancement beyond correlation-based analyses, moving from identifying associations-to-building microbiome-based predictive models in which microbial features forecast host-related outcomes75. These techniques often involve feature selection, probabilistic inference, model optimization, and performance assessment. Regularized regression techniques, particularly least absolute shrinkage and selection operator (LASSO) and Elastic Net, are well-suited to high-dimensional microbiome data. Regularized regression techniques perform variable selection by penalizing regression coefficients, effectively identifying a small subset of microbial taxa that are most predictive of a continuous host outcome (regression), such as immune cell infiltration score, or a categorical outcome (classification), such as response to immunotherapy. For more complex, non-linear relationships, machine learning algorithms, like random forests, are widely used. These models can handle high-dimensional data and provide measures of feature importance, ranking microbes by predictive power. CRC research has demonstrated particularly successful applications of these methods. For example, a random forest-based multiclass model combining gut bacterial abundances with metabolite markers significantly improved diagnostic performance in distinguishing CRC from other gastrointestinal diseases, showcasing the clinical potential of integrated microbiome-host feature analysis76. The MetaNN framework, a neural network-based deep learning approach, demonstrates superior classification accuracy for host phenotype prediction by leveraging both synthetic and real metagenomic data77. These frameworks illustrate the growing potential of predictive modeling to translate microbiome-host interactions into actionable biomarkers.
Network methods
Network analysis provides a systems-level framework for integrating multi-omics data by representing complex microbiota-host interactions as unified graphs78. Nodes correspond to entities across different biological layers in these networks (e.g., bacterial species, host genes, and metabolites), while edges represent statistically significant associations between the nodes, including correlations, partial correlations, or more advanced dependency measures. This approach moves beyond pairwise analyses to reveal larger, interconnected functional modules that may drive biological processes. Methods, like graphical LASSO, enable the inference of conditional dependency networks that estimate direct interactions between nodes after accounting for all other variables in the system. Such estimates provide a more accurate representation of potential direct biological relationships than simple correlation analyses. Once constructed, these networks can be analyzed to identify highly interconnected modules, which are clusters of nodes that may represent functional units, such as a group of co-occurring bacteria interacting with host genes involved in specific signaling pathways (e.g., JAK-STAT or Wnt/β-catenin signaling)79. Network topology analysis further revealed biologically significant patterns through centrality measures. Nodes with high-degree centrality (many connections) or high-betweenness centrality (positioned on many shortest paths) may represent keystone species or critical host factors central to network stability and function. For example, microbial networks have been shown to vary spatially within tumors and across disease states with specific modules enriched in cancer samples compared to healthy tissue80. Functional enrichment analysis [e.g., Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG)] of host genes within these modules helps elucidate the mechanistic pathways through which microbial communities may influence tumor biology. Overall, network methods offer a powerful integrative approach for dissecting multi-omics interactions and identifying system-level drivers of cancer progression.
Challenges in microbiome multi-omics integration
Current approaches to microbiome profiling, primarily through 16S rRNA amplicon sequencing or shotgun metagenomics, have revolutionized our ability to characterize microbial communities. While 16S sequencing provides a cost-effective method for taxonomic classification, particularly valuable in low-biomass environments, shotgun metagenomics provides deeper functional and strain-level insights. However, integrating these microbial data with host multi-omics layers (genome, transcriptome, epigenome, and metabolome) present substantial computational and biological challenges that complicate the extraction of meaningful biological inferences.
Compositionality
A fundamental analytical challenge in microbiome studies arises from the compositional nature of sequencing data. Samples often vary in sequencing depths (library sizes) in high-throughput sequencing results, which yields relative rather than absolute measures of microbial abundance81. As a result, an apparent increase in the relative abundance of one microbe necessarily corresponds to a proportional decrease across all other microbes, even if the actual counts remain unchanged. This phenomenon, known as compositionality, can distort common analytical tasks, such as diversity estimation, correlation analysis, and differential abundance testing82. Standard statistical techniques assuming absolute measurements may therefore yield misleading results. For example, a significant negative correlation between two bacteria may reflect this mathematical artifact rather than a true biological interaction. To address this issue, data normalization methods, such as rarefaction, scaling, and additive or centered log ratio transform, have been developed to transform compositional data into a form that can be readily analyzed with non-compositional analysis techniques, such as linear models83.
Sparsity and zero-inflation
Microbiome datasets typically exhibit high sparsity, meaning that many microbial taxa are absent from most samples, leading to an abundance of zero values in the data matrix. These zeros can arise from the following two distinct sources: true biological absence, in which a microbe is genuinely not present; or technical limitations, in which a microbe is present but not detected due to insufficient sequencing depth. The prevalence of zeros violates the assumptions of many conventional statistical models, biases the estimate of diversity and correlation, and complicates the application of machine learning algorithms. Distinguishing between structural and sampling zeros is essential for accurate biological interpretation. Statistical approaches, such as zero-inflated negative binomial and hurdle models, are often used to account for this characteristic, although improved study design, increased sequencing depth, and technical replicates remain the most effective strategies to reduce technical zeros and improve inference.
High dimensionality and biological variability
Microbiome-host studies often face the “small n, large p” problem, in which the number of features (e.g., thousands of microbial taxa, genes, or metabolites) vastly exceeds the number of samples (e.g., dozens or hundreds of patients). In this high-dimensional space, models are highly prone to overfitting, in which the model memorizes noise and patterns specific to the training dataset but fails to generalize to new, independent data. This situation can lead to the identification of numerous false-positive associations. Utilizing regularization techniques (e.g., LASSO, ridge regression), performing rigorous cross-validation, and seeking validation in independent cohorts are necessary steps to mitigate this risk. In addition, the gut microbiota exhibits substantial interindividual variation influenced by diet, genetics, age, medication, and other factors, which can confound true associations. Including these variables as covariates in models and expanding cohort sizes are necessary to distinguish genuine microbiome-host interactions from confounding effects.
Emerging technologies for studying host-microbiota interplay
Advances in sequencing and imaging technologies have expanded our ability to characterize host-microbiota interplay at unprecedented resolution. Long-read sequencing platforms, including PacBio single-molecule real-time (SMRT) sequencing and Oxford Nanopore Technologies (ONT), generate long reads spanning thousands-to-millions of base pairs, overcoming key limitations of short-read sequencing. While short-read approaches remain valuable for affordability, the limited read length often compromises assembly accuracy and taxonomic resolution84. In contrast, long-read technologies enable more contiguous genome assemblies, improved detection of structural variants, and direct sequencing of epigenetic modifications, including bacterial DNA methylation patterns. These advances significantly enhance metagenome-assembled genomes (MAGs) in microbiome research by resolving challenging genomic regions, such as repetitive elements, mobile genetic elements, and loci with extreme GC content85,86. Recent applications have successfully reconstructed complete bacterial genomes and plasmids from complex microbial communities, providing unprecedented insights into strain-level variation, horizontal gene transfer dynamics, and functional adaptation within host-associated microbiota87,88.
Beyond bulk taxonomic characterization, emerging single-cell technologies now enable investigation of microbial community function at the resolution of individual cells. The field has witnessed transformative advances with the adaptation of single-cell gene expression profiling techniques to microbial systems, revealing how cellular heterogeneity underpins population-level bacterial dynamics89,90. A prime example is bacterial-multiplexed error-robust fluorescence in situ hybridization (MERFISH), which integrates expansion microscopy with high-throughput RNA imaging to achieve spatial transcriptomic profiling of individual bacterial cells within complex communities. This methodology overcomes the fundamental challenges of bacterial cell size and high RNA density, enabling researchers to map gene expression dynamics with subcellular resolution under various environmental conditions91. Such technologies provide unprecedented opportunities to investigate microbial subpopulations, host-microbe spatial interactions, and transcriptional responses at the single-cell level within native biological contexts, opening new frontiers in understanding the functional basis of host-microbiota interactions in health and disease.
Despite the transformative potential, these emerging technologies present practical limitations that must be considered. Long-read sequencing remains relatively costly and computationally intensive, requiring specialized bioinformatics pipelines for data processing and analysis. Similarly, spatial host-microbiome methods, such as bacterial-MERFISH, involve complex experimental workflows and currently lack standardized analytical frameworks, which can hinder reproducibility and cross-study comparisons. Addressing these challenges will be essential to fully realize the potential of high-resolution technologies in advancing microbiome research and clinical translation.
Conclusions and future directions
In conclusion, this systematic review has delineated the complex and multifaceted interplay between the microbiota and the host in CRC across genomic, transcriptomic, epigenomic, and metabolomic layers. It has been established using integrative multi-omics approaches that the microbiome functions not as a passive bystander but as an active contributor to carcinogenesis, influencing genomic instability, reshaping transcriptional and epigenetic landscapes, and creating a metabolite-rich TME that promotes cancer progression. The application of advanced computational models to multi-omics data has been instrumental in identifying key microbiota-host interactions that may be mechanistically linked to CRC pathophysiology. Despite significant progress, this field continues to face analytical challenges related to data dimensionality, compositionality, and sparsity. However, emerging technologies in long-read sequencing, single-cell microbial omics, spatial transcriptomics, and high-resolution microbial imaging are poised to greatly refine our understanding of microbiota-driven processes in CRC.
Future research should prioritize bridging the gap between correlation and causation through establishing sophisticated experimental models that are capable of recapitulating human pathophysiology while permitting precise manipulation of microbial and host variables. Germ-free and gnotobiotic mouse systems colonized with defined microbial consortia provide a robust platform for causal inference, while human-derived organoids and organ-on-chip systems co-cultured with specific bacterial strains offer unprecedented opportunities to dissect host-microbe interactions within a patient-relevant context92. Specifically, microfluidic-based gut-on-chip systems provide fine control over chemical gradients, shear stress, oxygen tension, and microbial spatial organization, enabling detailed interrogation of microbial behaviors, biofilm formation, and host responses under near-physiologic conditions93. Integrating these experimental systems with longitudinal and spatially resolved multi-omics profiling will be essential to elucidate the temporal dynamics of microbiota influence on tumorigenesis and therapy response.
Concurrently, rapid advances in computational methodologies, especially artificial intelligence (AI), machine learning (ML), and deep learning (DL), are reshaping microbiome research94. Traditional association-based pipelines are increasingly limited by the complexity, scale, and multimodality of microbiome datasets. In contrast, AI-powered frameworks offer transformative capabilities in modeling non-linear relationships, integrating cross-kingdom omics data, predicting host-microbial interactions, and uncovering latent biological patterns from high-dimensional datasets95. For example, protein large language models, such as ProteoGPT, enable high-throughput screening of antimicrobial peptides with high predicted efficacy and minimal toxicity96, while explainable AI approaches, such as Shapley Additive Explanations, have revealed reproducible microbial signatures of CRC, including Fusobacterium, Peptostreptococcus, and Parvimonas, and improved subtype stratification97,98. These computational advances are not only enhancing mechanistic insight but also paving the way for clinically oriented prediction frameworks capable of stratifying patients, forecasting treatment outcomes, and identifying microbial biomarkers with greater accuracy and robustness99.
Efforts are also underway to translate microbiome insights into preventive, diagnostic, and therapeutic innovations in CRC, including probiotic administration and elimination of deleterious microorganisms, such as F. nucleatum, E. coli, or B. fragilis, from the TME100. Advances in precision microbiome medicine requires the development of comprehensive computational frameworks that integrate multi-omics data into predictive models of individual host-microbiome interactomes101. Such “digital twin” approaches could revolutionize personalized cancer care by forecasting responses to dietary interventions, prebiotics, and next-generation live biotherapeutic products102. Engineering bacteria capable of modulating specific cancer-relevant pathways, such as immune activation, barrier function restoration, or metabolic pathway manipulation, offers a promising avenue for microbiome-based therapeutics103. However, by realizing these potential demands large-scale, longitudinal cohort studies with standardized methodologies for sample processing, sequencing, and computational analysis can be conducted to ensure robust and reproducible biomarker discovery104. Beyond these technical challenges, clinical translation faces additional challenges, including substantial inter-individual variability in microbiome composition and function, the need for standardized protocols for microbiome modulation therapies, and the development of clear regulatory pathways for live biotherapeutic products. Addressing these barriers will be essential for translating microbiome research into clinically actionable interventions for CRC prevention and treatment.
In summary, the future of microbiome research in CRC lies at the convergence of advanced experimental models, AI-driven computational integration, and large-scale clinical data integration. By addressing current challenges in mechanistic validation, methodologic standardization, and clinical translation, the potential of microbiome research can be harnessed to develop novel strategies for CRC prevention, diagnosis, and treatment, ultimately improving patient outcomes and reshaping our understanding of cancer biology.
Conflict of interest statement
No potential conflicts of interest are disclosed.
Author contributions
Conceived and designed the analysis: Jun Yu.
Wrote the paper: Yinghong Lu.
- Received November 28, 2025.
- Accepted January 12, 2026.
- Copyright: © 2026, The Authors
This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵









