- Research
- Open access
- Published:
Relating ecological diversity to genetic discontinuity across bacterial species
Genome Biology volume 26, Article number: 8 (2025)
Abstract
Background
Genetic discontinuity represents abrupt breaks in genomic identity among species. Advances in genome sequencing have enhanced our ability to track and characterize genetic discontinuity in bacterial populations. However, exploring the degree to which bacterial diversity exists as a continuum or sorted into discrete and readily defined species remains a challenge in microbial ecology. Here, we aim to quantify the genetic discontinuity (\(\delta\)) and investigate how this metric is related to ecology.
Results
We harness a dataset comprising 210,129 genomes to systematically explore genetic discontinuity patterns across several distantly related species, finding clear breakpoints which vary depending on the taxa in question. By delving into pangenome characteristics, we uncover a significant association between pangenome saturation and genetic discontinuity. Closed pangenomes are associated with more pronounced breaks, exemplified by Mycobacterium tuberculosis. Additionally, through a machine learning approach, we detect key features such as gene conservation patterns and functional annotations that significantly impact genetic discontinuity prediction.
Conclusions
Our study clarifies bacterial genetic patterns and their ecological impacts, enhancing the delineation of species boundaries and deepening our understanding of microbial diversity.
Background
Bacteria exhibit remarkable genetic makeup and ecological versatility, thriving in diverse niches worldwide. Plummeting sequencing costs have led to a wealth of genomic data, enabling genetic diversity and evolutionary relationships to be explored across many bacterial species. However, an essential inquiry in microbial ecology pertains to whether bacterial genetic diversity exists as a continuum or as distinct species groups [1,2,3].
The definition of bacterial species faces challenges because bacteria can exchange genetic material through horizontal gene transfer (HGT) [4], potentially blurring the species boundaries. This complexity has led to divergent views on bacterial species existence: while it was once thought that excessive recombination would preclude species formation [5], a contemporary perspective suggests that the gene flow patterns can even delineate species [6, 7]. Besides, recent studies have revealed a clear genetic discontinuity across bacterial genomes, supporting the existence of discrete genetic clusters [8,9,10,11].
Genetic discontinuity refers to the occurrence of a significant difference in genetic makeup between populations or groups of organisms [12], indicating potential boundaries between and within species [13]. This discontinuity can arise over time due to natural selection, genetic drift, or geographic isolation. By assessing the degree of discontinuity, researchers can determine whether populations should be classified as separate species, providing a clearer understanding of evolutionary relationships and ensuring precise taxonomic classification [9, 11].
Defining bacterial species is important for reasons beyond any human desire to catalog bacterial diversity; it is vital for understanding how evolutionary forces shape genetic lineages [14, 15]. Furthermore, proper classification impacts practical applications in industry, agriculture, and medicine. For instance, Gardnerella vaginalis is known to cause bacterial vaginosis, but it can also be found in healthy vaginal microbiomes [16]. Historically, one of the explanations was that certain strains of G. vaginalis were more pathogenic than others. However, genome-based taxonomic methods have since shown that G. vaginalis actually comprises multiple distinct species rather than just different strains [17]. Therefore, the previous misclassification of G. vaginalis limited the ability of clinicians to assess when and whether the presence of Gardnerella indicated a health risk.
One way to assess species boundaries is by estimating the genetic relatedness between genomes. A robust method to classify bacterial species is based on the Average Nucleotide Identity (ANI) estimate, with organisms belonging to the same species if they possess around 95% ANI or more among themselves [4, 5]. Since estimating ANI for thousands of genomes is computationally expensive, alternative methods were developed to accommodate the growing genomic databases [11, 18].
Despite observed breaks in genetic identity distributions for various species, quantifying the magnitude of these breaks and their ecological implications remains a challenge. Here, we address the complex nature of bacterial diversity through an empirical genomic distance network approach and pangenome analysis. We aimed to quantify the extent of genetic discontinuity within and between bacterial populations to determine whether intrinsic genetic boundaries can provide a more ecologically relevant basis for species redefinition. Here, we seek not to pigeonhole bacterial species, but to examine the presence of genetic boundaries, quantify their extent, and explore their ecological implications for species classification.
Results
Dataset information
We obtained 258,603 genomes from RefSeq that were filtered following the GTDB protocol [19]. After removing low quality and fragmented genomes, we retained 210,129 genomes to explore bacterial genetic discontinuity. According to the NCBI classification, the top four most abundant species in our dataset are Escherichia coli (n = 22,853), Staphylococcus aureus (n = 12,747), Klebsiella pneumoniae (n = 10,387), and Salmonella enterica (n = 9755).
Over 44 billion comparisons were performed to construct a network from an identity matrix \(M\), in which genomes are represented as nodes. After removing edges below 95% identity in \(M\), 7122 communities were identified—a proxy for species number (see “Methods” for more details). Notably, 84.84% of the communities contained fewer than 10 genomes, consistent with prior observations in the genus Pseudomonas [20]. For instance, a previous study found that 29% of officially recognized Pseudomonas type strains appeared as isolated nodes in a similar network analysis, highlighting the substantial underestimation of diversity when relying solely on type strains [20]. By focusing on communities with over 40 genomes (3.70%), we obtained a set of 261 representative genomes, enabling meaningful downstream comparative genomic analyses.
We reannotated a total of 45,550 genomes, representing nine phyla according to GTDB classification (Fig. 1). Out of 942,094 detected genes, 95.1% were successfully assigned across 39,552 orthogroups. Moreover, species-specific orthogroups represented approximately 0.9% of the genes. Single-copy genes were used to reconstruct the phylogenetic tree. Except for Proteobacteria, the tree indicated monophyly in all other phyla based on GTDB classification, contrasting with the polyphyly observed based on NCBI classification, as noted before [19]. The GC content in these genomes ranged from 25.9% in Mesomycoplasma hyorhinis to 73.4% in Streptomyces albidoflavus (Additional file 2: Table S1). Notably, Clostridioides difficile exhibited the greatest number of CRISPR arrays (median = 8) in a community of 500 genomes, with at least one array per genome. The dataset included genomes with diverse sizes, from the smallest commensal Metamycoplasma hominis (0.7 Mb) to the larger free-living bacterium Burkholderia cepacia (8.5 Mb).
Clear genetic discontinuity revealed across bacterial species
To explore genetic discontinuity across different species, we employed a strategy in which a representative genome serves as a “bait” node within the network and the identity of all other genomes from it is calculated. This method allowed us to assess the ranked identity distribution from each representative genome (Fig. 2a), revealing clear breakpoints within the distribution. For instance, considering the representative species of Acinetobacter baumannii, the 4968th genome maintains a 97.27% identity. However, the identity of the 4969th genome drops drastically to 93.34%, exemplifying the observed genetic discontinuity or genetic break.
Genetic discontinuity properties of ten selected species. a Distribution of ranked genomic identities, revealing breakpoints around 95%. Vertical dashed lines for Chlamydia trachomatis and Mycobacterium tuberculosis indicate breaks beyond 90% from their closest genomes. The x-axis is represented in log-scale. b Genetic Rate of Change is depicted across different identity values, with the break-associated point emphasized by a red circle. This point represents the key measure of genetic discontinuity \((\delta )\) examined in this work (see “Methods” for comprehensive information). Genomes are colored based on GTDB classification
We systematically quantified the genetic discontinuity by estimating how rapidly the genomic identity decayed as we moved through the sorted identity array. We took the first derivative of the distribution, offering a measure of variability in genomic similarity that we named as Genetic Rate of Change (GRC) (Fig. 2b). The maximum value of GRC resulted in the genetic discontinuity metric \(\delta\), which characterizes the steepest change in genomic identity (see “Methods”). For instance, the genetic break observed from 97.27 to 93.34% in A. baumannii, corresponds to \(\delta=<0.9727-0.9334=0.0393\) for this species (Table 1, Fig. 2b).
Eight species were selected to showcase bacterial discontinuity, encompassing both pathogenic and non-pathogenic strains across various phyla and lifestyles (Fig. 2b). Notably, Chlamydia trachomatis and M. tuberculosis exhibited pronounced \(\delta\), indicating substantial shifts in genetic similarity. Conversely, Helicobacter pylori represented few instances where a species lacks a clear genetic discontinuity, suggesting a blurred genetic boundary possibly influenced by its evolutionary history and lifestyle.
Higher genetic discontinuity associates with allopatric lifestyle
To explore the ecological implications of bacterial discontinuity, we analyzed the pangenome of the 261 species mentioned above, which sheds light on their lifestyles and evolutionary patterns [21, 22]. The pangenome encompasses the core genome (genes present in all isolates), the accessory genome (genes in more than one but not all isolates), and isolate-specific genes. Pangenome openness, measured by the saturation coefficient (\(\alpha\)), indicates the extent to which new gene families are detected in the pangenome as more genomes are included. Higher \(\alpha\) values suggest gene pool saturation (the addition of new genomes contributes little to the detection of new genes), while lower \(\alpha\) values imply a more flexible genomic repertoire.
We used the pangenome openness to indirectly assess the species lifestyle [21] (Table 1). A low saturation coefficient (open pangenome) suggests a flexible genomic repertoire, characteristic of sympatric populations that frequently exchange genes to adapt to various environments. Conversely, a high saturation coefficient (closed pangenomes) is associated with allopatric populations adapted to specific niches with limited gene exchange due to physical isolation or genetic incompatibility [21, 22].
In our investigation, we identified a noteworthy correlation between pangenome openness and genomic fluidity (\(\phi\)) (Fig. 3a)—a measure of genomic dissimilarity at the gene level [23]. Specifically, we found a negative correlation, indicating that species with closed pangenomes exhibit lower genomic fluidity, as previously noted [24]. Furthermore, we observed a pronounced increase in genetic discontinuity as the pangenome saturation coefficient rises.
Pangenome properties of representative species. a Genomic fluidity in function of pangenome saturation with a linear regression line given by \(\phi =-0.51148\times \alpha +0.52509\). The size of the dots corresponds to different levels of genetic discontinuity, grouped into four categories. Additionally, the color of the dots indicates the coding density of each representative genome. b Pangenome openness of ten selected species, with \(\alpha\) highlighted for reference
C. trachomatis and Bacillus cereus exhibit distinct pangenome characteristics that reflect their contrasting lifestyles. C. trachomatis exhibited a closed pangenome (\(\alpha\) = 0.97), indicating a limited capacity for gene acquisition through HGT. This suggests a relatively stable genome and a more specialized lifestyle, features associated with an obligate intracellular pathogenic behavior [25]. This pattern is frequent among species with high genomic discontinuity, closed pangenomes, allopatric lifestyles, and highly conserved pangenomes (Table 1). In contrast, B. cereus displays an open pangenome (\(\alpha\) = 0.64), indicating a high propensity for gene acquisition and genomic diversity. This suggests a more versatile lifestyle, potentially enabling B. cereus to occupy various ecological niches and adapt to changing environments. The variations in pangenome openness observed in these two species highlight differences in their lifestyles and on how their pangenomes evolve in the context of genetic discontinuity.
Uncovering most influential features to predict bacterial genetic discontinuity
We aimed to evaluate the importance of different features in predicting bacterial genetic discontinuity. To address the asymmetrical nature of \(\delta\) distribution (Fig. 4a) and the impact of the number of genomes to estimate the pangenome openness, we employed a quantile regression that allowed us to assess the influence of pangenome saturation on genetic discontinuity while controlling for the number of genomes used to compute \(\alpha\). We found a significant impact across all quantiles examined (Fig. 4b). Notably, as the quantile increased, the impact of pangenome saturation in \(\delta\) became more pronounced. For instance, in the top quantile (\(\tau=0.95;\lbrack0.1040-0.291\rbrack\)), an increase of one unit in \(\alpha\) corresponded to a 0.44-unit rise in \(\delta\). These results highlight the crucial influence of pangenome saturation on the patterns of genetic discontinuity.
Genetic discontinuity modeling. a Probability distribution of genetic discontinuity (\(\delta\)) estimated from representative genomes. b Quantile regression analysis of \(\delta\) as a function of pangenome saturation (\(\alpha\)), highlighting a positive impact of \(\alpha\) on \(\delta\) across different quantiles (0.15, 0.25, 0.50, 0.75, 0.95). c SHAP values derived from random forest regression model, indicating the importance of ten features in predicting \(\delta\). Each representative genome is displayed, along with the positive or negative impact of each feature. d Linear regression between \(\delta\) and the percentage of orthogroups containing species, the feature with the highest impact on predicting \(\delta\). The equation of the regression line and the associated p value are also shown
Beyond pangenome features, we also used a rich set of features ranging from taxonomical classification to orthology assessment to model bacterial discontinuity through a machine learning approach (see “Methods”, Additional file 3: Table S2). Among the six tested methods, linear regression and random forest demonstrated superior performance in terms of root mean squared error, mean absolute error, and quantile loss over different quantiles (Additional file 1: Fig. S2). Random forest was chosen due to its ability to handle data distribution and collinearity without imposing strict assumptions. After hyperparameter tuning via k-fold cross-validation and grid search, we delved into feature importance using SHAP values to predict \(\delta\).
The most significant variable affecting the prediction of genetic discontinuity was the “percentage of orthogroups containing species” (Fig. 4c). This metric gauges the proportion of orthogroups containing at least one gene from a given species. For example, if at least one gene from species A is present in 90 orthogroups from a total of 100, the percentage of orthogroups containing species A would be 90%. Moreover, this feature negatively impacts \(\delta\) (Fig. 4d; p value < 0.001). This metric helps associate the presence and representation of a species within orthogroups with genetic discontinuity, as it indicates how frequently genes from that species are involved in shared functional contexts across different organisms.
Discussion
In this study, we systematically quantified genetic discontinuity (\(\delta\)), which reflects significant shifts or abrupt changes in genetic similarity compared to a representative genome. Our work offers insights into the ecological roles of bacterial discontinuity and its implications for species classification. We carefully considered whether rigid taxonomic boundaries capture the fluidity and evolutionary dynamics that shape species’ histories, especially considering organisms where genetic exchange, adaptation, and hybridization prevail [26, 27].
Philosophically, species classification raises the issue of Aristotelian essentialism—the idea that there are inherent qualities that define a species [28, 29]. The act of assigning species to specific categories becomes an exercise in grappling with the fundamental question of what defines a species. Is it solely genetic similarity, shared phenotypic traits, ecological niche, or something deeper that eludes our current understanding? In this work, we employed genomics and ecology approaches to assess species boundaries. We used the essentialism idea as a prior to select representative genomes and explore whether the resulting genetic variation exhibited continuity or discreteness. To be agnostic about the choice of representative genomes, we employed a network approach where we could also retrieve genomes from understudied populations to represent a given community, avoiding the limitation of exploring species based only on well-studied type strains [20].
Bacterial genetic discontinuity has already been observed in studies comprising thousands of genomes by using different approaches [8, 9, 20], including metagenomic data [1]. The remaining inquiry was how to measure genetic discontinuity and examine its ecological significance, while accounting for potential external factors such as sampling biases. To address this, we devised a novel metric (\(\delta\)) by examining the maximum value in the first derivative distribution of genome identity derived from a representative genome. We observed \(\delta\) values spanning from 0.005 (Acinetobacter pittii) to 0.290 (Coxiella burnetii)—species with remarkably distinct lifestyles. For the extreme case, C. burnetii, the idea of \(\delta =0.29\) means that the most similar genome in a different community within a network of over 200 thousand genomes shares only 29% genetic identity with the representative genome of the C. burnetii community, which comprises 65 genomes. This identity value is way beyond what we expect to distinguish species and approaches to those used to define higher taxonomical ranks such as genera and families [30, 31]. C. burnetii is an obligate intracellular pathogen responsible for causing the Q fever disease in humans [32]. Its allopatric lifestyle may explain its extreme genetic discontinuity.
After identifying breaks that varied according to species lifestyles (Table 1), the next challenge was to assign an ecological meaning to them. It has been shown that bacterial lifestyle shapes pangenomes [33]. Therefore, we reversely used pangenome metrics to retrieve the likely lifestyle and evolution of a set of species [21, 22, 33]. In our study, we established a clear link between bacterial genetic discontinuity and pangenome openness, ensuring accurate control for the number of genomes utilized in estimating the saturation coefficient. Specifically, organisms such as C. burnetii, M. tuberculosis, and C. trachomatis demonstrate how significant genetic breaks correlate with specific lifestyles that influence the dynamics of HGT. These species exhibit closed pangenomes, potentially due to their ecological niches providing little advantage or necessity for gene exchange. This result highlights the impact of lifestyle on genetic architecture, suggesting that certain ecological conditions can lead to pronounced genetic discontinuities.
Conversely, species with ubiquitous or environmental lifestyles such as E. coli and B cereus presented smaller breaks, but still well-defined boundaries. Those bacteria are known for their ability to thrive in various environments, including soil, water, and plant surfaces. Their diverse ecological niches might lead to more continuous genetic variation, exhibiting less pronounced breaks. In contrast, H. pylori posed an intriguing challenge with regard to genetic discontinuity. Our analysis revealed either an ambiguous or non-existent genetic break in this species, rendering it difficult to draw a line to delineate its boundaries. The lack of a discernible genetic break in H. pylori may be connected to its extremely high rate of homologous recombination [34]. This emphasizes the need for a nuanced approach to understand species boundaries and genetic cohesion, suggesting unique evolutionary dynamics that require further investigation.
Beyond the use of pangenome features to predict bacterial discontinuity, assessing orthogroups can be vital for understanding genetic discontinuity. By using both the SHAP values and ExtraTrees regressor to retrieve feature importance, the “percentage of orthogroups containing species,” assigned by Orthofinder, was the most important variable. This metric quantifies the proportion of orthogroups that include at least one gene from a given species. It measures genetic connectivity and functional commonality among bacterial species within the same ecological niche, highlighting the extent of shared evolutionary history. A higher percentage suggests ecological overlap. Conversely, a lower percentage implies species specialization, indicating distinct ecological roles. For example, species with closed pangenomes such as C. trachomatis had genes present in only 2% of the total number of orthogroups.
In conclusion, our study highlights the significance of bacterial genetic discontinuity in understanding microbial diversity and evolution. Ultimately, species boundaries represent discontinuities between distinct populations. We anticipate that early stages of speciation will manifest as such discontinuities. However, this paper focuses on the characteristics of established species rather than the processes of their formation. We have shown that closed pangenomes and pronounced genetic breaks correspond to specific bacterial lifestyles, highlighting significant speciation events or ecological adaptations. This study enhances our understanding of bacterial diversity by highlighting the importance of quantifying genetic discontinuity to reassess traditional species classifications in microbial ecology.
Methods
Data collection and network analysis
We downloaded a dataset comprising 210,129 genomes available on RefSeq [35] as of September 2022. To ensure data quality, we retained only genomes with less than 500 scaffolds and utilized the GTDB quality control [19] to exclude genomes displaying low quality or contamination. Filtered genomes were used to perform tens of billions of comparisons with mash v2.2.2 [18] to construct a weighted network (\(M\)) with igraph v1.5.1 [36]. \(M\) comprises genomic relationships among the set of genomes (\(g\)), with edges (\(e\)) corresponding to the genomic identity between pairs of genomes, quantified as the inverse of Mash distance (1 − Mash).
The weighted network was used to select representative genomes to infer the genetic identity patterns. To select representative genomes, we subsetted the network \(M\) to create \(M{^\prime}\):
Therefore, \(M{^\prime}\) represents a species network containing the same set of genomes, but retaining edges above 95% identity, a threshold used to define species [4]. We employed the label propagation algorithm [37] to detect communities in \({M}{^\prime}\) that represent species. Only communities containing more than 40 genomes were used. This criterion was adopted to mitigate downstream modeling errors and enhance confidence in the biological significance of species representation.
Representative genomes of each species were chosen based on the following criteria: (i) prior designation as representative in both GTDB and RefSeq databases, (ii) representative at least in GTDB, or (iii) fewest scaffolds if the previous criteria were not met. Ties were resolved by random selection. This yielded a subset of 261 representative genomes (\(t\)) used for subsequent analysis.
Genome annotation, pangenome, and phylogenetic analysis
Genomes assigned to communities containing representative genomes were annotated using Bakta v1.5.1 [38]. To ensure consistency in annotation, all genomes were reannotated within the same framework, mitigating potential discrepancies. In cases where communities exceeded 500 genomes, we implemented a downsampling strategy. Genomes showing 99.5% or higher identity with the representative genome were removed. For communities still surpassing this threshold, the remaining genomes were randomly selected.
We employed Panaroo v1.2.10 [39] in the moderate mode to obtain the pangenome. The R packages Pagoo [40] and Micropan [41] were used to estimate the pangenome openness and genomic fluidity, respectively. We used Orthofinder v2.5.4 [42] to obtain the orthogroups from representative genome communities. An orthogroup refers to a collection of genes across multiple species that are identified as orthologs, meaning they are derived from a single ancestral gene in the last common ancestor of the species being studied. Orthogroups containing a single copy of a gene present in all species represent core single-copy core genes. All core single-copy genes were aligned with Mafft v7.505 [43], filtered with BMGE v1.12 [44], and concatenated to reconstruct the phylogenetic tree with IQ-Tree v2.1.4 [45], incorporating ModelFinder [46] to identify the best fitting model. One thousand bootstrap replicates were generated to assess the significance of internal nodes. The phylogenetic tree was visualized with ggtree [47].
Genetic discontinuity estimation
To estimate and quantify the genetic discontinuity across species, we employed an egocentric-based approach using representative genomes as “baits” within the network \(M\). For each genome (\(i\)) in the representative subset (\(t\)), \(i\) served as an egocentric node to calculate its genomic distance (\({d}_{i}(g)\)) from all other genomes \(g\) in \(M\). These distances \({d}_{i}(g)\) were sorted in descending order to retrieve genomes most similar to the representative genome \(i\). The sorted array (\({D}_{i}\)) yielded a ranked list of genomic similarities. For each index \(j\) in \({D}_{i}\), the corresponding genomic identity (\({I}_{i}(j)\)) was determined, quantifying the similarity between the representative genome \(i\) and the genome at index \(j\) in the array.
To assess the GRC, we calculated the first derivative of the genomic identity (\({I}_{i}(j)\)) with respect to \(j\). The first derivative \({\Delta }_{i}\) was calculated as the change in genomic identity between two consecutive indices \(j\) and \(j+1\) in \({D}_{i}\), divided by index difference:
This rate of change \({\Delta }_{i}\) offered insights into how rapidly the genomic identity changed as we moved through the sorted array (\({D}_{i}\)), offering a measure of variability in genomic similarity.
Finally, let \(\delta\) represent the genetic discontinuity, reflecting the idea of a break or sharp change in the genetic identity between a species and its closest relative. For each species, \(\delta\) was defined as the maximum GRC above 94%. This threshold aligns with the commonly used interval to define species (94–96% ANI) [4, 13, 48] and was adopted to exclude breaks representing higher taxonomical classifications, focusing solely on the species level. For instance, consider hypothetical genera G1 and G2: \(\delta\) captures the steepest change in genomic identity from species in G1, not those from G1 to G2 (see Fig. 1). Thus, \(\delta\) can be calculated as:
Additionally, we also incorporated GTDB species classification as a second prior to enhance accuracy in modeling \(\delta\) for downstream analyses. To assess the robustness and uncertainty of this metric, we utilized a jackknife resampling strategy. This involved conducting 100 replications, where in each iteration, 20% of the genomes were randomly excluded to reduce sampling bias and variance. This resampling method enabled us to calculate both lower and upper confidence intervals for \(\delta\). The metric demonstrated robustness, with observed estimates consistently falling within the calculated confidence intervals (Additional file 1: Fig. S2).
Machine learning modeling
The outcome prediction task was formulated as a regression problem. We tested four different machine learning (ML) models to predict the genetic discontinuity (\(\delta\)) for each species: linear regression, Lasso regression, support vector regressor, random forest regressor, and gradient booster regressor. This analysis was employed using the Scikit-learn v1.0.2 [49] and XGBoost v1.5.2 [50] python [51] libraries.
We considered 261 species and 32 features categorized into four main groups: taxonomical, orthology-related, annotation-based, and pangenome metrics. Taxonomical features encompassed GTDB classification for each species (phylum, class, order, family, and genus). Orthology-related features refer to six metrics obtained after detecting orthogroups for reference genomes (e.g., proportion of species-specific orthogroups) (Additional file 3: Table S2). Categorical features with more than two categories were represented by a set of dummy variables, with one variable for each category.
Annotation-based features were retrieved from all 45,550 reannotated genomes. Continuous variables, including %GC content and coding density, were represented by their median value to account for their asymmetrical probability distribution. For discrete variables such as the number of CRISPR arrays, we calculated their relative frequency, indicating the likelihood of a species carrying such elements. Pangenome metrics, encompassing pangenome openness, genomic fluidity, and core genome proportion (core genes to pangenome size ratio), were included in the feature dataset.
Given the relatively low number of observations, the entire dataset was employed for training the model. We utilized the extra tree regression feature selection method to reduce dimensionality, improve the estimator’s accuracy, and boost the model performance. This algorithm employs randomized feature selection and ensemble averaging to make predictions, helping identify influential features, reduce overfitting, and enhance the model’s performance [52]. Also, we adopted k-fold cross-validation to mitigate dataset size limitations and to evaluate model performance metrics (root mean squared error (RMSE), mean absolute error (MAE), and quantile loss).
To tune hyperparameters in the final ML model, we conducted a grid search with k-fold cross-validation utilizing 5 folds. We used the RMSE as the model score metric. We retrieved the importance of variables on explaining the model, by adopting the SHAP (Shapley Additive exPlanations) technique. The essence of SHAP is to measure the feature contribution of each individual to the outcome and whether the feature has a positive or negative impact on predictions [53]. We also utilized quantile regression to model the relationship between pangenome saturation (\(\alpha\)) and genetic discontinuity (\(\delta\)), focusing on various quantiles (15%, 25%, 50%, 75%, and 95%) to understand the impacts across the distribution of genetic discontinuity.
References
Caro-Quintero A, Konstantinidis KT. Bacterial species may exist, metagenomics reveal. Environ Microbiol. 2012;14:347–55.
Cohan FM. Systematics: the cohesive nature of bacterial species taxa. Curr Biol. 2019;29:R169–72.
Shapiro BJ, Friedman J, Cordero OX, Preheim SP, Timberlake SC, Szabo G, Polz MF, Alm EJ. Population genomics of early events in the ecological differentiation of bacteria. Science. 2012;336:48–51.
Bobay LM. The prokaryotic species concept and challenges. In: Tettelin H, Medini D, editors. The pangenome. Cham: Springer; 2020. p. 21-49. https://doiorg.publicaciones.saludcastillayleon.es/10.1007/978-3-030-38281-0_2.
Doolittle WF, Papke RT. Genomics and the bacterial species problem. Genome Biol. 2006;7:116.
Diop A, Torrance EL, Stott CM, Bobay LM. Gene flow and introgression are pervasive forces shaping the evolution of bacterial species. Genome Biol. 2022;23:239.
Arevalo P, VanInsberghe D, Elsherbini J, Gore J, Polz MF. A reverse ecology approach based on a biological definition of microbial populations. Cell. 2019;178(820–834):e814.
Knight DR, Imwattana K, Kullin B, Guerrero-Araya E, Paredes-Sabja D, Didelot X, Dingle KE, Eyre DW, Rodriguez C, Riley TV. Major genetic discontinuity and novel toxigenic species in Clostridioides difficile taxonomy. Elife. 2021;10:e64325.
Olm MR, Crits-Christoph A, Diamond S, Lavy A, Matheus Carnevali PB, Banfield JF. Consistent metagenome-derived metrics verify and delineate bacterial species boundaries. mSystems. 2020;5(1):e00731-19.
Passarelli-Araujo H, Jacobs SH, Franco GR, Venancio TM. Phylogenetic analysis and population structure of Pseudomonas alloputida. Genomics. 2021;113:3762–73.
Jain C, Rodriguez RL, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun. 2018;9:5114.
Hanage WP, Fraser C, Spratt BG. Sequences, sequence clusters and bacterial species. Philos Trans R Soc Lond B Biol Sci. 2006;361:1917–27.
Rodriguez RL, Conrad RE, Viver T, Feistel DJ, Lindner BG, Venter SN, Orellana LH, Amann R, Rossello-Mora R, Konstantinidis KT. An ANI gap within bacterial species that advances the definitions of intra-species units. mBio. 2024;15:e0269623.
Fraser C, Alm EJ, Polz MF, Spratt BG, Hanage WP. The bacterial species challenge: making sense of genetic and ecological diversity. Science. 2009;323:741–6.
Hugenholtz P, Chuvochina M, Oren A, Parks DH, Soo RM. Prokaryotic taxonomy and nomenclature in the age of big sequence data. ISME J. 2021;15:1879–92.
Hill JE, Albert AYK, Group VR. Resolution and cooccurrence patterns of Gardnerella leopoldii, G. swidsinskii, G. piotii, and G. vaginalis within the vaginal microbiome. Infect Immun. 2019;87(12):e00532-19.
Potter RF, Burnham CD, Dantas G. In silico analysis of Gardnerella genomospecies detected in the setting of bacterial vaginosis. Clin Chem. 2019;65:1375–87.
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17:132.
Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil PA, Hugenholtz P. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol. 2018;36:996–1004.
Passarelli-Araujo H, Franco GR, Venancio TM. Network analysis of ten thousand genomes shed light on Pseudomonas diversity and classification. Microbiol Res. 2022;254:126919.
Rouli L, Merhej V, Fournier PE, Raoult D. The bacterial pangenome as a new tool for analysing pathogenic bacteria. New Microbes New Infect. 2015;7:72–85.
Brockhurst MA, Harrison E, Hall JPJ, Richards T, McNally A, MacLean C. The ecology and evolution of pangenomes. Curr Biol. 2019;29:R1094–103.
Kislyuk AO, Haegeman B, Bergman NH, Weitz JS. Genomic fluidity: an integrative view of gene diversity within microbial populations. BMC Genomics. 2011;12:32.
Henaut-Jacobs S, Passarelli-Araujo H, Venancio TM. Comparative genomics and phylogenomics of Campylobacter unveil potential novel species and provide insights into niche segregation. Mol Phylogenet Evol. 2023;184:107786.
Stelzner K, Vollmuth N, Rudel T. Intracellular lifestyle of Chlamydia trachomatis and host-pathogen interactions. Nat Rev Microbiol. 2023;21:448–62.
Arnold BJ, Huang IT, Hanage WP. Horizontal gene transfer and adaptive evolution in bacteria. Nat Rev Microbiol. 2022;20:206–18.
Soucy SM, Huang J, Gogarten JP. Horizontal gene transfer: building the web of life. Nat Rev Genet. 2015;16:472–82.
Austin C. Aristotelian essentialism: essence in the age of evolution. Synthese. 2017;194:2539–56.
Shtulman A, Schulz L. The relation between essentialist beliefs and evolutionary reasoning. Cogn Sci. 2008;32:1049–62.
Deloger M, El Karoui M, Petit MA. A genomic distance based on MUM indicates discontinuity between most bacterial species and genera. J Bacteriol. 2009;191:91–9.
Qin QL, Xie BB, Zhang XY, Chen XL, Zhou BC, Zhou J, Oren A, Zhang YZ. A proposed genus boundary for the prokaryotes based on genomic insights. J Bacteriol. 2014;196:2210–5.
Seshadri R, Paulsen IT, Eisen JA, Read TD, Nelson KE, Nelson WC, Ward NL, Tettelin H, Davidsen TM, Beanan MJ, et al. Complete genome sequence of the Q-fever pathogen Coxiella burnetii. Proc Natl Acad Sci U S A. 2003;100:5455–60.
Dewar AE, Hao C, Belcher LJ, Ghoul M, West SA. Bacterial lifestyle shapes pangenomes. Proc Natl Acad Sci U S A. 2024;121:e2320170121.
Falush D. The remarkable genetics of Helicobacter pylori. mBio. 2022;13:e0215822.
Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61-65.
Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006, Complex Systems:1695.
Raghavan UN, Albert R, Kumara S. Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E Stat Nonlin Soft Matter Phys. 2007;76:036106.
Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microb Genom. 2021;7:000685.
Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol. 2020;21:180.
Ferres I, Iraola G. An object-oriented framework for evolutionary pangenome analysis. Cell Rep Methods. 2021;1:100085.
Snipen L, Liland KH. micropan: an R-package for microbial pan-genomics. BMC Bioinformatics. 2015;16:79.
Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238.
Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30:772–80.
Criscuolo A, Gribaldo S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol. 2010;10:210.
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37:1530–4.
Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods. 2017;14:587–9.
Yu G. Using ggtree to visualize data on tree-like structures. Curr Protoc Bioinformatics. 2020;69: e96.
Konstantinidis KT, Tiedje JM. Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci U S A. 2005;102:2567–72.
Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, et al. API design for machine learning software: experiences from the scikit-learn project. ECML PKDD workshop: languages for data mining and machine learning 2013:108–122.
Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 2016:785–794.
Rossum V, Drake F. Python 3 reference manual. Scotts Valley: CreateSpace; 2009.
Ferri J, Pavel P, Hatef M. Comparative study of techniques for large-scale feature selection. Pattern recognition in practice, IV: multiple paradigms, comparative studies and hybrid systems 2001.
Lundberg S, Lee S. A unified approach to interpreting model predictions. 31st conference on neural information processing systems 2017.
Passarelli-Araujo H. Relating ecological diversity to genetic discontinuity across bacterial species. Zenodo 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.14120246.
Acknowledgements
We thank the Fulbright Brasil Commission for supporting HP-A studies at Harvard T.H. Chan School of Public Health in 2022 and 2023.
Peer review information
Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Review history
The review history is available as Additional file 4.
Funding
TMV research group was funded by Fundação Carlos Chagas Filho de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ; grant E-26/201.117/2022), Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES; Finance Code 001), and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).
Author information
Authors and Affiliations
Contributions
Hemanoel Passarelli-Araujo: conceptualization, investigation, software, methodology, visualization, and writing. Thiago M. Venancio: writing—reviewing and editing and supervision. William P. Hanage: conceptualization and writing—reviewing and editing and supervision. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
No ethical approval was required for this study.
Competing interests
The authors declare no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
13059_2024_3443_MOESM1_ESM.docx
Additional file 1: Supplementary figures. Fig. S1. Model evaluation through k-fold cross-validation using different metrics. Fig. S2. Jackknife resampling of genetic discontinuity across representative bacterial species.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Passarelli-Araujo, H., Venancio, T.M. & Hanage, W.P. Relating ecological diversity to genetic discontinuity across bacterial species. Genome Biol 26, 8 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-024-03443-z
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-024-03443-z