- Method
- Open access
- Published:
Bayesian multi-study non-negative matrix factorization for mutational signatures
Genome Biology volume 26, Article number: 98 (2025)
Abstract
Mutational signatures are typically identified from tumor genome sequencing data using non-negative matrix factorization (NMF). However, existing NMF techniques only decompose a single dataset, limiting rigorous comparisons of signatures across conditions. We propose a Bayesian NMF method that jointly decomposes multiple datasets to identify signatures and their sharing pattern across conditions. We propose a fully unsupervised “discovery-only” model and a semi-supervised “recovery-discovery” model that simultaneously estimates known and novel signatures, and extend both to estimate covariate effects. We demonstrate our approach on extensive simulations, and apply our method to answer questions related to colorectal cancer and early-onset breast cancer.
Background
The somatic mutations found in tumor genomes are the result of many mutational processes acting in concert. These processes can arise from a number of possible sources, including inaccuracies in DNA replication, DNA modification by enzymes, defective DNA repair, and exposure to mutagenic agents, such as tobacco smoking or ultraviolet light radiation [1]. Identifying and characterizing these processes is an important step in understanding how various types of cancers arise. Moreover, the particular processes that contributed to any given individual’s cancer development could have key implications for prognosis and/or treatment.
Different processes have been shown to give rise to different patterns of mutations, which are termed mutational signatures. A mutational signature expresses the propensity of a particular process to produce various types of mutations and can be mathematically represented as a vector of rates for each type of mutation under consideration. Alexandrov et al. [1] and Nik-Zainal et al. [2] introduced the usage of non-negative matrix factorization (NMF) to estimate these signatures from sequenced tumor genomes. NMF decomposes mutation count data into a product of two lower-dimensional matrices: one represents the signatures and the other represents the exposures, i.e., the extent to which each signature contributed to each tumor’s accumulation of mutations.
Since then, a number of approaches using NMF have been developed and employed for signature estimation. Some, like the original approach of [1, 2], use optimization techniques to identify the two lower-dimensional matrices [3]. Recent developments in this area include an approach that explicitly accounts for background noise and uses regularization to enforce sparsity [4], and an approach that combines the sampling likelihood with regularization and informative priors to fine-tune model complexity [5]. Fischer et al. [6] introduced a probabilistic variant assuming the Poisson likelihood, and [7] generalized this Poisson model using an empirical Bayesian approach. Other Bayesian approaches include [8]. There are also some methods that estimate signatures using topic models, e.g., through hierarchical Dirichlet processes [9,10,11], instead of NMF.
However, there has been very limited work on extending mutational signatures analysis to multiple matrices. This includes analyzing data from multiple studies, groups, or conditions to understand what signatures are shared or specific across these categories. For example, we may be interested in jointly analyzing breast cancer datasets from different populations, or datasets spanning multiple cancer types. Although grouping structure can be incorporated into models based on the hierarchical Dirichlet process [10, 11], existing NMF-based methods only consider a single data matrix. As such, analyzing multiple groups with NMF requires running separate analyses on each dataset and then subjectively comparing the identified signatures to determine what is shared and what is unique. Alternatively, multiple datasets can be concatenated into one and decomposed, but this again requires heuristic post-processing to understand how signatures are shared. These ad hoc solutions result in a loss of efficiency and fail to propagate important uncertainties from one step to the next.
In many cases, it is also of interest to compare the signatures estimated from one or more datasets to previously known signatures, such as those in the COSMIC database [3]. This again generally requires heuristic approaches, such as considering an estimated signature the same as a previously known signature if the cosine similarity exceeds a pre-determined threshold. One extension of the hierarchical Dirichlet process allows “freezing” nodes that represent previously known signatures to effectively encode priors around these signatures [11]. However, an equivalent based on NMF for the multi-study context is needed that both incorporates prior information about signatures and updates their estimates, allowing for variability around the provided values.
Finally, we may also often be interested in understanding the relationships between signatures and tumor-level covariates in one or more datasets. For example, we may want to know whether covariates such as the patient’s sex, smoking history, or the presence of homologous repair deficiency affect the expected exposure for a certain signature. Much existing work has correlated estimated exposures with patient features, but this does not allow the covariates to inform the matrix factorization, and does not propagate uncertainty. While the Tumor Covariate Signature Model [12] does estimate the effects of covariates within a topic model, this has not been extended to the multi-study context. Recent work [13] also estimates covariate effects for mutational signatures, but is also based in the single-study context and requires signatures to be specified ahead of time, rather than simultaneously discovering signatures and covariate effects.
Here, we introduce the first comprehensive multi-study NMF framework for mutational signatures estimation to address these gaps. In particular, we extend the Poisson model of [7] to the multi-study context by explicitly modeling the shared ownership of signatures. We propose two versions of this model: a “discovery-only” model in which all signatures are estimated de novo and a “recovery-discovery” model in which signatures are estimated both de novo and from informative priors built from existing signatures. Finally, we extend both of these models to incorporate tumor-level covariates, with a non-local prior based on the spike-and-slab prior of [14] to enforce sparsity in the coefficients. Because we simultaneously estimate signatures and covariate effects, this approach offers the opportunity for covariates to improve signature estimation. We compare these models to existing approaches on an extensive set of simulations, and demonstrate the utility of our method in two applications to colorectal cancer samples and to early-onset breast cancer.
Results
A comprehensive multi-study NMF framework for mutational signatures estimation
We show a schematic in Fig. 1 to illustrate our NMF framework. In the discovery-only model (Fig. 1A), any number of counts matrices are taken as input, where each corresponds to a different study (or, equivalently, condition or group). Our method jointly decomposes these studies using a Bayesian framework to estimate three key quantities: the signatures that appear in at least one study, a signature indicator matrix that identifies which signatures are present in each study, and the study-specific exposures for each sample in every study to each present signature. By estimating these quantities simultaneously, including an explicit representation of ownership over the signatures, this approach avoids having to run NMF on each study individually and compare results in an ad hoc manner.
Schematic of our multi-study NMF framework. A Under the discovery-only model, we take the counts matrices for any number of studies as input, and learn signatures, exposures, and indicators of which signatures belong to each study. In this illustration, two studies are decomposed into a total of three signatures; the top study is found to contain the first two signatures, whereas the bottom study is found to contain the first and third signatures. This sharing pattern is shown by the signature indicator matrix. B Under the recovery-discovery model, we expand the NMF decomposition to both recover previously known signatures, encoded via informative priors, and to discover new signatures. Here, we show an example decomposition for one study, where one signature (SBS 2) is recovered and one signature is discovered. C In both models, we can incorporate covariates, such as gender and age as illustrated here, to learn relationships with the exposures
In many settings, we are also interested in specifically determining whether any previously found signatures are present in the data. Under the recovery-discovery model (Fig. 1B), we again take any number of counts matrices as input for joint estimation. However, we now expand the NMF decomposition into two terms, which we label as recovery and discovery respectively. The discovery term represents the de novo signatures, and is identical to the decomposition from the discovery-only model. The recovery term represents the previously found signatures and encodes the COSMIC v3.2 single-base substitution signatures [3] through informative priors, unlike the priors used in the discovery term. This encourages the estimation of signatures that are very similar to those in the COSMIC v3.2 database, while still allowing their values to be updated based on the provided new data. Hence, under this model, we learn signature estimates, signature indicator matrices, and exposures for both a set of discovered signatures and a set of recovered signatures.
Finally, we allow both models to take in optional, sample-level covariates and learn their associations with the exposures (Fig. 1C). In particular, we model covariates as having multiplicative effects on the exposure distributions; since we use the Gamma distribution as the exposure prior, this can be interpreted as a multiplicative effect on the mean exposure. However, we choose a prior on the coefficients, specifically a non-local formulation of the spike-and-slab prior [14], that enforces sparsity in these effects. This improves interpretability as well as biological plausibility in the model by encouraging covariates to have meaningful effects on the exposures to only a small set of signatures.
We provide further details on our models and inferential procedures in the “Methods” section.
Our approach improves signature estimation in the multi-study setting
To evaluate the performance of our approach first in the simpler setting without covariates, we designed a simulation study motivated by exploratory analyses of real data from different cancer types. In particular, we considered two main scenarios: a three-study simulation and a ten-study simulation. For each scenario, we generated 10 datasets and ran both our discovery-only and recovery-discovery methods, alongside existing methods. In particular, we compared our performance to two popular single-study approaches, signeR [7] and sigProfiler [3]; a multi-study approach, HDP [11], that is outside of the NMF framework; and a semi-supervised version of HDP using frozen nodes to encode known signatures, which we refer to as HDP-frozen. We ran the single-study methods in two ways: carrying out separate runs on each study, or carrying out a single run on all studies concatenated together. As a comparison, we also ran our discovery-only approach in “single-study mode” on the concatenated studies, though this would not typically be recommended. Further details on simulation set-up and running existing methods are in the “Methods” section.
We show results for both simulation scenarios in Fig. 2 and Additional file 1: Fig. S1 and examine three metrics. Namely, we report sensitivity (the proportion of ground-truth signatures for which sufficiently similar signatures were estimated in the correct study), precision (the proportion of estimated signatures that were sufficiently similar to ground-truth signatures in the correct study), and cosine similarity of estimated signatures with ground-truth signatures.
Simulation results in the setting without covariates. A Sensitivity and precision for the three-study simulation, comparing our discovery-only and recovery-discovery approaches to existing methods. Points represent median sensitivity and precision for each approach across 10 simulated datasets; horizontal and vertical error bars indicate the 10th and 90th quantile values. Here, signeR and SigProfiler refer to running those methods on each study individually, while signeR concat and SigProfiler concat refer to running those methods on all studies concatenated. B Cosine similarities in the three-study simulation for a subset of signatures that are estimated under all approaches. C Sensitivity and precision for the ten-study simulation. D Cosine similarities in the ten-study simulation for a subset of signatures that are estimated under all approaches
In both the three-study and ten-study simulations, our discovery-only model outperformed both unsupervised single-study methods. In particular, our approach had better sensitivity with comparable precision than the single-study methods run on each study individually, and better precision with comparable or better sensitivity than the single-study methods run on all studies concatenated. This demonstrates the improvements of a multi-study approach in borrowing strength across datasets. When our discovery-only model was run in “single-study mode” on all studies concatenated, there was no longer an advantage in performance, highlighting that the improvements specifically come from jointly modeling multiple studies rather than simply pooling all samples together. We note that evaluating performance in the concatenated case can be challenging due to ambiguities in whether each signature should be considered as present in each study (details in the “Methods” section); however, these ambiguities further underscore the challenges of running concatenated analyses and the improved interpretability of a multi-study approach.
When compared to the other fully unsupervised multi-study approach, HDP, the discovery-only model had overall similar performance, representing a different trade-off between sensitivity and precision. Because this is a benchmark in the setting without covariates, the similar performance between the two is to be expected. Our recovery-discovery model had even better results than the discovery-only model, most notably in terms of sensitivity. Since the ground-truth signatures were taken from the COSMIC database, this model’s informative priors made it easier to capture weak signals such as SBS17a that were otherwise challenging to learn under any of the unsupervised approaches. While the results were very similar to HDP-frozen in the ten-study simulation, under the three-study simulation, the recovery-discovery model showed comparable sensitivity but better precision.
When examining a set of signatures that were consistently found across all methods under consideration, our recovery-discovery model produced estimates with the highest cosine similarity to the ground-truth signatures across the board. This performance was closely followed by HDP-frozen. Among the fully unsupervised methods, performance could be variable, but our discovery-only model had overall similar results to HDP. The single-study methods notably yielded higher cosine similarities when run on all studies concatenated than when run on each study individually. Thus, borrowing strength across samples improves the accuracy of the signature estimates themselves in both the single-study and the multi-study contexts.
Finally, we benchmarked the computational timing of our approaches against the other methods (Additional file 1: Table S1). In both the 3-study and the 10-study scenarios, our recovery-discovery model takes about twice as long to run as the discovery-only model, which can be attributed to the larger number of signatures under consideration. Nevertheless, our model runtimes are comparable to those of the single-study methods. In both scenarios, HDP and especially HDP-frozen have by far the longest runtimes. However, it is important to note that the runtimes of this method are highly dependent on the user-specified number of iterations; we specified a large number of iterations to avoid convergence issues observed at lower numbers of iterations (“Methods” section), but this may reflect conservatively long runtimes as a result.
We also conducted a simulation study to further characterize the properties of the informative priors used in the recovery-discovery model and to better understand their performance. In particular, for each of the 78 signatures in the COSMIC v3.2 database, we generated five single-study datasets where all signal comes from a perturbed version of that reference signature. For example, this could arise in a setting with batch effects. We then ran the recovery-discovery model on each of these single-study datasets.
In the majority of runs, signatures were found in both the discovery and the recovery components, sometimes even multiple of each. Hence, when the ground-truth generating signature is a perturbation of a reference signature, a combination of signatures is often estimated to explain the signal. When the generating signature represented only a minor perturbation from the reference signature, the mean exposure to the estimated reference signature in the recovery component (hereafter referred to as “R_ref”) was larger than to any estimated in the discovery component, and the discovery signature typically had low cosine similarity to the reference signature (Fig. 3). This suggests that in such cases, the most signal is still found on R_ref, with some extra additive signal in the discovery component to explain the perturbation. As the generating signatures became more dissimilar to the reference signatures, this pattern began to reverse, with the discovery signatures taking on greater exposures and, to an extent, more similarity with the reference signatures. This suggests that when the perturbation is large enough, our approach will estimate a signature in the discovery component that more closely resembles the generating signature, with some signal still remaining on R_ref. In some cases, R_ref was not found at all, indicating that there was enough deviation from the reference signature to escape shrinkage from the informative prior entirely.
Simulation results from running the recovery-discovery model on single-study data generated from perturbed versions of the reference signatures. (Left) Log-ratio of maximum mean exposure of any signature in the discovery component to the exposure of the reference signature in the recovery component, versus the cosine similarity of the generating signature to the reference signature. For runs where the reference signature was not found in the recovery component, the log-ratio is set to 2 for visualization. The color of each point indicates the maximum cosine similarity between the discovery signatures and the reference signature. (Right) Subsets of the left plot to simulation runs for three example reference signatures
Figure 3 also shows the results for three signatures, as specific examples. While the relative exposure to the discovery signature compared to R_ref tends to increase with the extent of perturbation, there is some variability both within and across signatures in this relationship. This suggests that some degree of noise may influence the outcome. Collectively, these results help characterize when and how the recovery-discovery model recovers known signatures in ambiguous scenarios, and shed insight into what estimation might reveal in real data settings where, hypothetically, a cancer- or tissue-specific version (i.e., a perturbation) of a signature may be present.
Our approach accurately detects and estimates covariate effects
In the setting with covariates, we conducted a simulation study to assess how well our model identifies which covariates have non-zero effects on which signatures, as well as how accurately the corresponding coefficients are estimated. This simulation study was motivated by the colorectal cancer application, thus containing signatures and covariate effects likely to be present in each. Further details are in the “Methods” section.
First, we describe estimation of the signatures and sharing pattern. In the discovery-only model, out of ten runs of simulations using the same data-generating set-up, the ground-truth set of signatures with exactly the correct sharing pattern across the three studies was found in six of the runs. Three of the other runs found a sparser solution, in which the signature SBS10b was omitted but the other signatures were still estimated with the correct sharing pattern, and the fourth run found the correct signatures and sharing pattern plus one extra signature. We ran the recovery-discovery model on the same data, and found the correct solution in all runs. The greater sensitivity to SBS10b in particular, which was difficult to detect, can be attributed to the informative priors of the recovery-discovery model.
Next, we examine identification and estimation of covariate effects. In this simulation, non-null effects from the covariates (gender and age) were only present in one of the studies. The study with non-null effects only has 16 samples, which offers an opportunity to demonstrate our approach’s utility in the small-sample setting. Under both the discovery-only and recovery-discovery models, covariate effects in all signatures for the other two studies had posterior inclusion probabilities (PIPs) under 0.1%, correctly indicating with high probability that there are no effects in those studies. Results for both the discovery-only and the recovery-discovery model in the study that does have non-null effects are shown in Fig. 4. Both models tend to find PIPs close to 1 for signatures with ground-truth non-null effects, and PIPs close to 0 for ground-truth null effects. This again shows that our sparsity-encouraging prior enables our approach to distinguish non-null and null signal effectively. We also show that both models accurately estimate the coefficients. It should be noted that the estimation of the gender effect for SBS10b tended to be more variable both in the posterior inclusion probability and in the coefficient. This can be attributed to the fact that the exposures to this signature were on average smaller than to SBS10a, which makes estimation more challenging especially in this small-sample setting. However, this variability was lower under the recovery-discovery model, suggesting that the semi-supervised inference of the signatures also lent more stability to estimating the covariate effects.
Results from a simulation with covariates, using 10 simulated datasets. A Under the discovery-only model, posterior inclusion probabilities for the two covariates (gender and age) in a study with ground-truth non-null effects in two signatures and a ground-truth null effect in the other signature. B Under the discovery-only model, estimated associations with gender and age for the two signatures with non-null effects. The true associations are indicated by the dark blue and dark red lines, and the lighter opacity lines indicate estimated associations from each simulation run. C Under the recovery-discovery model, posterior inclusion probabilities. D Under the recovery-discovery model, estimated associations
Our approach reveals shared and distinct mechanisms in colorectal cancer
We demonstrate our approach on real data by applying it to important questions in colorectal cancer (CRC). Many different mutational signatures have been previously identified in CRC samples, from those associated with aging to those associated with the failure of endogenous molecular processes [15]. In particular, two distinct classes of CRC samples, characterized by their high mutational burden as “hypermutated” and “ultra-hypermutated” respectively, have been found to contain specific mutational processes. Hypermutated samples are associated with microsatellite instability (MSI), and ultra-hypermutated samples are associated with POLE (polymerase epsilon) mutations, which affect proofreading in DNA replication [15]. Despite the characterization of these processes, much remains unknown about their relationship with epidemiological factors. For example, while POLE mutations have been associated with younger age and male sex [16], limited work has examined relationships between covariates such as these and the repertoire of mutational signatures in CRC.
To address questions like these, we applied our recovery-discovery model with covariates to 406 CRC samples from the TCGA Research Network. Specifically, our goals were to precisely identify which signatures are shared and distinct across the mutational burden classes, and to understand what relationships, if any, exist with gender and age. Our approach offers a unique opportunity to investigate these questions because of our explicit representation of the signatures sharing pattern across classes, as well as our ability to simultaneously estimate covariate effects in a way that can influence signature estimation too. We divided the samples into three “studies” or groups based on their tumor mutational burden: non-hypermutated (328 samples), hypermutated (69 samples), and ultra-hypermutated (9 samples), by fitting Gaussian mixture models following guidance from [15]. As covariates, we included gender coded as a binary variable and age coded as a centered and scaled continuous variable.
Results from our model are shown in Fig. 5A. Our model found a total of 34 signatures, four in the discovery component and 30 in the recovery component. Notably, many of the signatures only had high exposures for a relatively small subset of samples, suggesting that these reflect rarer mutational processes. When examining the signatures with a greater prevalence of high exposures, we find three main categories of signatures: clock-like signatures, MSI, and POLE mutations, all of which are expected in CRC based on previous work [15].
A (Top) Signature indicator matrix estimated using the recovery-discovery model with covariates in the colorectal cancer application. Signatures in the discovery component are labeled beginning with “D.” (Bottom) As above, but showing the proportion of samples in each study with more than 5% of total exposures attributed to each signature. B Posterior inclusion probabilities for each covariate and signature in the ultra-hypermutated group
Of the clock-like signatures, SBS1 notably has strong contributions to both the non-hypermutated and the hypermutated samples, but is not present in the ultra-hypermutated samples. SBS40, on the other hand, is only present in the non-hypermutated samples. We also found a signature in the discovery component, labeled as D2, that is unique to the non-hypermutated samples and has > 0.90 cosine similarity to SBS1. This could possibly represent another clock-like signature or a deviation from SBS1 specific to these samples. All together, these results suggest that clock-like signatures only play a major role in the non-hypermutated and hypermutated samples, with some processes unique to the non-hypermutated. The MSI signatures, by contrast, are primarily found in both the hypermutated and the ultra-hypermutated samples, which is consistent with previous reports implicating MSI as a feature of hypermutation [15]. However, our result adds greater nuance to this by finding both shared and unique mechanisms of MSI between the two groups. In particular, SBS6, SBS14, and SBS21 are all present in both groups, though SBS6 and SBS21 have particularly prevalent contributions in the hypermutated, whereas SBS14 is particularly prevalent in the ultra-hypermutated. On the other hand, SBS20 and SBS26 are only found in the hypermutated; notably, the presence of SBS20 only in hypermutated but not ultra-hypermutated samples was reported in a prior study [15]. While SBS15 in the recovery component is primarily prevalent in the ultra-hypermutated, both the hypermutated and ultra-hypermutated have unique signatures in the discovery component (labeled as D3 and D1, respectively) with high cosine similarity to SBS15. This suggests a potentially complicated relationship between SBS15 and these two groups of samples, with similar but not identical processes at play. Overall, these findings point to a substantial role of MSI in both groups, albeit with differing manifestations.
Finally, as expected, signatures associated with POLE mutations (SBS10a and SBS10b, as well as commonly co-occurring SBS28) are unique to the ultra-hypermutated samples and have high contributions. Interestingly, a signature in the discovery component (labeled D4) is shared by both the hypermutated and ultra-hypermutated samples and is most similar to SBS10b. However, this signature is much more prevalent in the ultra-hypermutated samples. This suggests that while POLE mutations are largely associated just with the ultra-hypermutated, there may still be some relevance to a small subset of the hypermutated samples. It is important to note as well that there is some degree of arbitrariness to distinguishing hypermutated from ultra-hypermutated samples.
When examining the effects of the two covariates, we found very low posterior inclusion probabilities (< 10%) for all signatures in the non-hypermutated and hypermutated groups. The one exception was SBS25 in the non-hypermutated samples, which had a high PIP for gender; however, this signature had very low exposures in most samples, and the few samples with high exposures were female. Thus, this effect results from a limited number of influential observations. More associations were found in the ultra-hypermutated samples (Fig. 5B). While several of these again represent signatures that only have high exposures in a few samples, there are some interesting relationships with more prevalent signatures. In particular, the PIPs for age in SBS10a and SBS14 are 41% and 27%, respectively, with corresponding median coefficients of −1.51 and −1.44. These suggest that there is decreasing exposure to these two signatures with age; while there is a fair amount of uncertainty in this result, potentially driven by the small sample size, this is consistent with previous findings [16]. There is also a low but non-zero PIP of 11% for gender in SBS10a, with a coefficient of 1.44 suggesting higher exposure in males than females. While uncertainty is again high, this finding would agree with previous reports [16]. Overall, these results show how our method can be used to assess associations with quantifiable uncertainty in highly challenging, small sample settings.
Our approach sheds insight into early-onset breast cancer
We also demonstrate our approach in a multi-study analysis to investigate the mutational processes underlying early-onset breast cancer. Previous studies have reported an increasing incidence of breast cancer before the age of 40, which is linked to more aggressive histology and worse survival rates [17]. Somatic mutations are thought to play a critical role in these early-onset cases, since fewer than 10% are attributed to heritable BRCA1 or BRCA2 mutations. Single-study approaches have previously been used to compare mutational signatures in breast cancer patients under or over the age of 40, and found some evidence for a differing mutational landscape [17].
To further interrogate this question especially in younger populations, we considered more finely stratified age groups and leveraged data from both TCGA and PCAWG [18, 19]. In particular, we applied our recovery-discovery model to four groups of breast tumors: seven TCGA samples diagnosed at ages 20–29, 59 TCGA samples at ages 30–39, 16 PCAWG samples at ages 30–39, and 107 PCAWG samples diagnosed over age 55. We refer to these groups as TCGA 20–29, TCGA 30–39, PCAWG 30–39, and PCAWG 55+, respectively. We note that this is a highly challenging scenario, well-suited to our approach for several reasons. First, our main group of interest, i.e., TCGA 20–29, consists of only seven samples, which would be very difficult for a conventional single-study approach to analyze. Our approach instead enables us to draw strength from other studies as well as our informative priors built from COSMIC. Second, these samples are derived from two technologies, with the TCGA samples representing whole-exome sequencing and the PCAWG samples representing whole-genome sequencing. This introduces both technical and biological differences that our multi-study approach can capture. Finally, likely driven largely by these differences, the total number of mutations in each sample varies widely across these groups (Fig. 6A), with fewer than 100 mutations per sample in TCGA 20–29 but sometimes many tens of thousands of mutations per sample in PCAWG 55+.
A Distribution of total numbers of mutations per sample in each of the four groups considered in the breast cancer application. B (Top) Signature indicator matrix estimated using the recovery-discovery model. Signatures in the discovery component are labeled beginning with “D.” (Bottom) As above, but showing the proportion of samples in each study with more than 5% of total exposures attributed to each signature. C (Left) Signature D1 from the discovery component found to belong to both PCAWG 30–39 and PCAWG 55+, compared to (right) SBS5 from COSMIC v3.2
Our recovery-discovery model found a total of 45 signatures, of which 41 were in the recovery component and 4 were in the discovery component, though as before a large portion have low prevalence among the samples (Fig. 6B). Many of the signatures found, particularly those with high prevalence, overlap with those reported in previous breast cancer analyses [3, 17], including the canonical defective homologous recombination-based DNA damage repair signature SBS3, the clock-like signature SBS1, and the APOBEC signatures SBS2 and SBS13. Broadly, we note that only two signatures (SBS1 and SBS2) are common to all four groups. Many signatures are shared by just the two PCAWG groups, which could reflect either technical or biological variation due to the differences in sequencing technology. Intriguingly, we also find several signatures shared by all except the TCGA 20–29 group, as well as one signature unique to the TCGA 20–29 group. This points to possible differences in mutational processes between this youngest on-set group and the others.
In particular, the TCGA 20–29 group did not contain several signatures traditionally expected in breast cancer, most notably SBS3. SBS3 is strongly associated with both germline and somatic BRCA1 and BRCA2 mutations as well as promoter methylation in these genes. This suggests either a reduced role of these mutations in this age group, or a possible alternative and more rapid pathway. Interestingly, this group contained several signatures with high prevalence that relate to environmental factors, such as smoking (SBS4) which was unique to this group. Given the small sample size, the results should be interpreted with caution, as they may or may not be generalizable to a larger population. Nevertheless, these findings point to a possible greater role of lifestyle and environmental factors in very early on-set cases.
When considering the other groups, we also note that the clock-like signature SBS5 was not found in the recovery component, while previous analyses of breast cancer attributed a large number of mutations to it [3]. Instead, however, we find a high-prevalence signature in the discovery component (labeled as D1) which was shared by both PCAWG groups and had > 0.70 cosine similarity to the COSMIC v3.2 SBS5 signature (Fig. 6C). While this could reflect noise, the fact that the informative prior for COSMIC SBS5 was “escaped” suggests a possible tissue- or technology-specific version of SBS5 present in these breast cancer samples. This ability to distinguish potentially novel signatures from noisy estimates of previously reported signatures again highlights a strength of our recovery-discovery approach.
Discussion
While our proposal represents a powerful and comprehensive approach to mutational signatures estimation, there are still some limitations. One key challenge lies in the multi-modality of the posterior. A naively implemented sampler can easily become stuck in a local posterior mode because of the high energy barrier required to move from one mode to another. In our approach, we employ tempering to more easily and rapidly explore the parameter space, which introduces an additional computational burden. While our approach still has overall comparable run times as other methods, we note that the applicability of our approach is limited in very large settings. Alternative approaches to characterize the multi-modal posterior could be very valuable.
There are also several opportunities for extensions of our modeling approach. For example, our signature sharing matrix imposes regularization through the binary elements of \(\mathcal {A}\). This can be highly beneficial for interpretability, since it implies a signature is either present in a study or it’s not, but can over-simplify sharing patterns in scenarios where a more nuanced relationship is present. Hence, our approach offers substantial benefits in settings where binary sharing patterns are likely, but may be less appropriate if patterns across groups are more continuous in nature. Conversely, we do not enforce any sparsity in the exposures of individual samples to each signature, conditional on the signatures present. While this carries some computational advantages, it can be difficult to interpret which signatures meaningfully contribute to each sample within a study.
Conclusions
We presented a comprehensive Bayesian multi-study NMF framework for mutational signatures to address three key challenges: extending NMF to the multi-study setting, implementing semi-supervised inference, and simultaneously estimating covariate effects. We showed the strengths of our approach against competing methods through a set of simulations. Finally, we demonstrated practical utility with thorough analyses of colorectal cancer and breast cancer samples, in which we both recapitulated previously known biology and proposed new insights. These applications particularly highlighted our approach’s ability to rigorously identify sharing patterns and associations that would have required ad-hoc analyses, and would have lacked systematic uncertainty assessment, using existing methods.
Methods
NMF for mutational signatures
In this work, we focus primarily on single-base substitution (SBS) mutational signatures. SBSs are mutations in which one nucleotide base is substituted for another. There are six possible such substitutions, namely C>A, C>G, C>T, T>A, T>C and T>G, where, for example, C>A denotes a C mutated into an A. However, the properties of these substitutions are often also affected by the bases on either side of the mutated site. Since there are four possible bases that could be on each side, we typically consider a total of \(4 \cdot 6 \cdot 4 = 96\) total mutational motifs [1]. Note that it is possible to consider other types of mutational signatures, such as by increasing the number of bases considered on each side, or by instead examining double-base substitutions or indels. Although we ground the context of this work in SBS signatures for expositional simplicity, our methods can be applied with any classification of mutational motifs.
Mathematically, each mutational signature is represented as a \(K \times 1\) vector of rates, where K is the number of mutational motifs. The key assumption in the NMF approach is that the mutations in each tumor sample arise from the additive effects of the N mutational signatures at play. In particular, suppose we have a \(K \times G\) matrix of mutational counts \(\varvec{M}\), where the (k, g)th entry indicates how many mutations of motif k were observed in sample g, for \(k = 1, \ldots , K\) and \(g = 1, \ldots , G\). Then, under the NMF model, we can decompose
for \(K \times N\) signatures matrix \(\varvec{P}\) and \(N \times G\) exposures matrix \(\varvec{E}\). Here, each column of \(\varvec{P}\) is a \(K \times 1\) mutational signature, and the (n, g)th entry of \(\varvec{E}\) quantifies how much signature n contributed to the mutations of sample g.
Under the classical NMF approaches to this problem, [1, 2] solved the optimization problem
for chosen N and norm. However, it can be shown for a specific choice of norm, this optimization is equivalent to maximum likelihood estimation of \(\varvec{P},\varvec{E}\) under the model
which was leveraged by [6] using an EM algorithm. Rosales et al. [7] then introduced an empirical Bayesian treatment to this model, signeR, by combining MCMC and EM techniques. This approach results in improved uncertainty quantification compared to previous methods by estimating the posterior distributions for \(\varvec{P}\) and \(\varvec{E}\).
Although NMF methods have been proven to provide powerful and interpretable results, these existing approaches still have several important limitations: (1) they can only decompose a single data matrix at a time, (2) they cannot simultaneously estimate the presence of previously known and de novo signatures in a way that also updates the known signatures’ estimates, and (3) they cannot incorporate tumor-level covariates into the matrix factorization. Hence, we propose a comprehensive NMF framework that addresses all of these limitations via two multi-study models, which we call the discovery-only model and the recovery-discovery model, that can both be extended to incorporate covariate effects.
Discovery-only model
Under the discovery-only model, S datasets are jointly decomposed in an entirely unsupervised manner to estimate de novo signatures. In particular, if \(\varvec{M}_s\) represents the \(K \times G_s\) matrix of mutational counts for the \(G_s\) samples in study s, then we build on the Poisson model of [6] to assume
Here, \(\varvec{P}\) is the \(K \times N\) matrix of signatures as defined in the previous subsection, where N is the total number of signatures present across all studies under analysis, and \(\varvec{E}_s\) is the \(N \times G_s\) matrix of exposures for study s. Note that \(\varvec{P}\) is a common parameter across all studies.
The \(N \times N\) study-level indicator matrix \(\varvec{A}_s\) is our key element to estimate the pattern of sharing of signatures, as introduced in the context of factor analysis by [20]. This matrix consists of all 0s, except for the diagonal entries, which are either 1 or 0. The nth diagonal entry of \(\varvec{A}_s\) is 1 whenever the nth signature is present in study s, so the product \(\varvec{P} \varvec{A}_s \varvec{E}_s\) will include the corresponding elements. Hence, \(\varvec{A}_s\) controls which signatures are included and which are not into the model for study s.
We denote by \(\mathcal {A}\) the overall \(S \times N\) signature indicator matrix whose sth row consists of the N diagonal entries of \(\varvec{A}_s\). For example, if the nth column of \(\mathcal {A}\) consists of all 1 s, the nth signature is shared by all; if it consists of exactly a single 1, this signature only belongs to one study; and if it consists of more than one 1 and at least one 0, this signature is shared by some but not all studies.
Our prior model builds on signeR [7]. In particular, we assume conjugate Gamma priors for the entries of \(\varvec{P}\) and \(\varvec{E}_s\), such that
We then further assume hyperpriors for these shape and rate parameters, specifically
Although the Indian Buffet Process (IBP) was used as the prior on \(\mathcal {A}\) in the original factor analysis context [20], we found in practice that mixing was more challenging with the IBP in this model. Instead, we use a simpler Bernoulli prior on each entry of \(\mathcal {A}\), i.e.
We fix the number of columns (i.e., the number of signatures) N to some large value, for example 50, and set q to encourage sparsity in \(\mathcal {A}\). If all entries in a given column are 0, the corresponding signature can be interpreted as not present or contributing to any of the studies under consideration. As a result, the expected number of signatures is typically much smaller than the dimension N, and estimating \(\mathcal {A}\) in this way allows us to automatically learn a posterior distribution over the total number of signatures that should be present.
In signeR, hyperparameter values are estimated from the data through an empirical Bayesian approach. While a similar approach could in principle be applied here, we instead found it sufficient to use default hyperparameter values as specified in Additional file 1: Section A. Because different studies can potentially have highly varying total numbers of mutations, we add further stabilization by fixing a study-specific normalizing constant \(\varvec{W}_s\) in the decomposition, i.e., as
where \(\cdot\) here indicates element-wise multiplication and \(\varvec{W}_s\) is a \(K \times G_s\) matrix whose elements are all set to the median number of total mutations across all samples in the study. This can effectively be interpreted as normalizing exposures across studies, but because Gamma distributions form scale families, the implied priors on the exposures still follow Gamma distributions.
To estimate these parameters, we use a Metropolis-within-Gibbs sampler extended from the sampler used in signeR [7]. To improve mixing, we use a tempering scheme for the first set of iterations to “warm up” the sampler and encourage faster exploration of the parameter space. Further details on the sampling steps and tempering scheme can be found in Additional file 1: Section A. Because the number of signatures as well as the sharing pattern can change from iteration to iteration, summarizing the posterior output is not straightforward. We choose point estimates to report by taking the five MCMC samples of \(\mathcal {A}\) that occurred most often in the chain and running the sampler conditional on each of these values of \(\mathcal {A}\). We estimate the BIC in each case, report the value of \(\mathcal {A}\) yielding the smallest BIC, and summarize the remaining parameters as their posterior medians from the conditional sampler output.
Recovery-discovery model
Under the recovery-discovery model, we jointly decompose S datasets to learn the presence of both de novo (discovered) and previously known (recovered) signatures. We implement this model using the known signatures from the COSMIC v3.2 database [3], but different or additional signatures can be easily incorporated as well. The recovery-discovery model expands the previously described model as
Here, the superscript R refers to parameters associated with the recovered (previously known) signatures, while the superscript D refers to parameters associated with the discovered (de novo) signatures.
The priors on \(\varvec{P}^D, \varvec{A}^D_s,\) and \(\varvec{E}^D_s\) are exactly as described in the previous section. We also use the same priors for \(\varvec{A}^R_s\) and \(\varvec{E}^R_s\). However, we instead build informative Gamma priors for each of the recovered signatures in the matrix \(\varvec{P}^R\). In particular, for each signature that we wish to include, we simulate a 500-sample dataset under our model using just the published estimate of that signature and a constant exposure of 100. Hence, this simulated dataset represents mutations that arise only from that signature. We then run the discovery-only sampler on this dataset with the number of signatures fixed to 1, and use the resulting learned posterior shape and rate parameters of the signature as its prior shape and rate parameters.
With these priors in place, the MCMC sampler for this model is a simple extension of the one developed for the discovery-only case. Point estimates are chosen as in the discovery-only model. We specify the default hyperparameters and describe the detailed steps in Additional file 1: Section A.
Extensions with covariates
We extended both the discovery-only and the recovery-discovery models to incorporate sample-level covariates. Suppose we have covariate vector \(\varvec{x}_{js}\) for sample j in study s. Then we modify the exposure prior as
in the discovery-only case and
in the recovery-discovery case.
Here, \(\varvec{\delta }_{ns}\) represents the vector of coefficients for the effects of each covariate on the exposure to signature n in study s. Nonzero values of \(\varvec{\delta }_{ns}\) can be interpreted as indicating multiplicative effects on the mean exposure. For example, if we had a single binary covariate \(x_{js}\), then a sample with \(x_{js}=1\) has \(\exp (\delta _{ns})\) times the mean exposure to signature n as a sample with \(x_{js}=0\).
It is important to enforce sparsity in the coefficients \(\varvec{\delta }_{ns}\), since we expect most covariates to have effects on the exposures of particular, but not all, signatures. To do so, we use a product-moment prior version [21] of the spike-and-slab prior from [14]. In particular, suppose the coefficient \(\delta _{fns}\) for covariate f arises from one of two possible distributions: a “spike” component, which is a low-variance normal distribution peaked sharply about 0, and a “slab” component, which is a higher variance normal distribution also centered at 0. We define indicator variables \(\gamma _{fns}\) that express which component \(\delta _{fns}\) belongs to, which allows us to specify the prior
for appropriately chosen c and \(\tau\). Note that without the \(\frac{\delta _{fns}^2}{c^2\tau ^2}\) term, this would match the specification of [14].
As a default, we set \(c = 10\) and \(\tau = 0.3\). To sample the coefficients, we leverage the results of [22], which show that nonlocal priors can be represented as mixtures of truncated normals, giving rise to a simple sampling scheme. In our case, each parameter can be updated in a Gibbs sampling step, except the coefficients from the slab component, for which we use adaptive rejection sampling as implemented in the R package armspp. Details are in Additional file 1: Section A.
Simulations
In order to build realistic simulations, in the setting without covariates, we based our simulated datasets on real data from the Pan-Cancer Analysis of Whole Genomes (PCAWG) study [18, 19]. In particular, we ran standard, single-study NMF approaches on 10 diverse cancer types with a range of sample sizes to identify signatures that are likely to be present in each. We then used the published signatures from COSMIC representing the closest matches to these estimates as ground-truth, and estimated exposures from the real data conditional on those signatures. With these signatures and exposures, we finally generated data under a Poisson likelihood according to our model.
In the 3-study simulation, we generated data in this way using the PCAWG data for lung adenocarcinoma, soft tissue leiomyosarcoma, and soft tissue liposarcoma, which resulted in data consisting of 38, 15, and 19 samples with 11, 8, and 5 ground-truth signatures respectively. Across these three simulated studies, five signatures were common, five were study-specific, and the remaining two were partially shared. In total, we generated 10 instances of this 3-study simulation with identical ground-truth parameters each time.
The 10-study simulation was built using these same three PCAWG datasets, as well as data for glioblastoma (41 samples and 5 ground-truth signatures), medullablastoma (146 samples and 6 ground-truth signatures), oligodendroglioma (18 samples and 4 ground-truth signatures), pilocytic astrocytoma (89 samples and 5 ground-truth signatures), renal cell carcinoma (144 samples and 8 ground-truth signatures), chromophobe renal cell carcinoma (45 samples and 8 ground-truth signatures), and squamous cell cancer (48 samples and 6 ground-truth signatures). Two signatures are common to all of the studies, nine are study-specific, and the rest, which comprises the majority, are partially shared with many various configurations. We again generated a total of 10 instances of this 10-study simulation.
In the setting with covariates, again to build realistic simulations, we based our simulated datasets on our colorectal cancer application, and generated three studies with 16, 62, and 328 samples, respectively. We applied single-study NMF approaches to these data to identify signatures with close matches to those in COSMIC, leading us to include a total of four signatures: two that are partially shared and two that are study-specific. We also looked for associations between covariates and exposures to these signatures, which motivated modeling effects from gender and age. Specifically, we generated gender indicators from a \(\text {Bernoulli}(0.5)\) distribution and standardized age from a \(\text {Normal}(0,1)\) distribution. We then generated exposures under our covariate model. For the study with 16 samples, male gender was positively associated (\(\delta = 1.5\)) with two signatures, and age was negatively associated (\(\delta = -1.5\)) with the same two signatures. For the other two studies, there were no associations for either covariate with any signature.
Further details are in Additional file 1: Section B and Additional file 1: Tables S1 and S2.
Evaluation and comparison to existing methods
In the setting without covariates, we ran these simulations under our discovery-only and recovery-discovery models as described in the previous sections, with the same default hyperparameters and procedure for selecting point estimates of the signatures and sharing pattern. We evaluated the performance of our approach through three metrics: sensitivity and precision for the sharing pattern, and cosine similarity for the signature estimates. To compute sensitivity and precision, we first computed the cosine similarity between each estimated signature and those in the COSMIC v3.2 catalogue. We labeled each signature by the name of its closest match in COSMIC, as long as the cosine similarity was over 0.70. If a signature estimate did not have any cosine similarities over that threshold, we considered that signature “unlabeled”. We then used these labels to compute sensitivity and precision. Sensitivity was reported as the total number of labeled signatures found in each study that should have been there according to ground truth, divided by the total number of, not necessarily unique, signatures that should be present across all studies. Precision was found using the same numerator, but divided by the total number of signatures, labeled and unlabeled, found in each study. Note that if multiple estimated signatures were given the same label within one study, at most one would be considered correct and the rest were considered incorrect. To compute the cosine similarities of the signature estimates reported in Fig. 2B and D, we took all signatures in each simulation run labeled in the process above as SBS1, SBS2, SBS5, SBS13, or SBS40 and reported their cosine similarities with the corresponding COSMIC signature. If more than one signature was given the same label, we used and reported the cosine similarities for all signatures of that label.
We also ran these same simulations under existing approaches and computed the same metrics on those results as a comparison. For HDP [11], we used a larger number of iterations than specified in the online tutorial in order to improve convergence. Specifically, we used four chains with 65,000 iterations each, of which the first 15,000 were considered burn-in and subsequently 250 posterior samples were taken at intervals of 200 iterations. For HDP-frozen, we again used a larger number of iterations, specifically four chains with 25,000 iterations, of which the first 15,000 were burn-in and subsequently 250 posterior samples were taken at intervals of 100 iterations. In both cases, prior specifications were otherwise as provided in the online tutorial. To follow the online tutorial, HDP-frozen was run in single-study mode on each study one at a time. Because it is not straightforward to produce point estimates in HDP or HDP-frozen, [11] suggests a post-hoc clustering algorithm on the MCMC samples to produce consensus signatures and corresponding estimates. Since our own results are reported using our posterior output without such clustering/merging, we chose not to apply this post-processing algorithm to the HDP output and to instead report the modal sensitivity and precision, i.e., the most frequently occurring pair of sensitivity and precision values when computed for each of the collected MCMC samples from the four chains in each run of HDP. For simplicity, however, we did label and use the post-processed signature estimates when reporting cosine similarities with the COSMIC signatures.
For the single-study approaches, signeR [7] and SigProfiler [3], we ran each approach in two ways. First, we applied these approaches to one study at a time within each simulation. Second, we applied these approaches to all of the studies concatenated together within each simulation. In both cases, these methods were run with default specifications (except we set estimate_hyper=TRUE for signeR), and we used their reported point estimates of the signatures when computing our evaluation metrics. In the latter case of concatenated studies, it was not straightforward to compute precision, since there is no clear notion of whether each signature belongs to a given study or not. For signeR, we computed the expected number of mutations attributed to each sample by each signature under the Poisson likelihood, and considered a signature to be present in a given study if at least one sample had at least one expected count from that signature. For SigProfiler, we analogously considered a signature to be present in a given study if at least one sample had non-zero activity for that signature. In all cases, when reporting cosine similarities with the COSMIC signatures, we again used all signatures from the point estimate(s) with the corresponding label.
Perturbation experiments
To assess the relative strengths of our informative priors in the recovery-discovery setting, we conducted a series of perturbation experiments. In particular, for each signature, we generated five “perturbed” versions of that signature by adding values randomly generated from a \(\text {Unifiorm}(0,a)\) distribution to each entry of the COSMIC v3.2 signature, where \(a = 0.01,0.03,0.05,0.07,\) or \(0.09\). We then generated a 50-sample dataset for each of these perturbed signatures under a Poisson model with exposures set to a constant value of 1000. Finally, we ran our recovery-discovery model on each of these datasets using the same default hyperparameters and procedure for selecting point estimates as previously described.
Real data analyses
Colorectal cancer analysis
We retrieved 406 colorectal cancer samples from TCGA using the TCGAmutations R package. The mutational counts values were tallied using the sig_tally function from the sigminer R package. Samples were divided into three groups – non-hypermutated, hypermutated, and ultra-hypermutated – based on Gaussian mixture models fit to their tumor mutational burdens as computed using the maftools R package. Covariates were then taken from the clinical data within the TCGAmutations R package; specifically, we set age values by z-scoring the CDR_age_at_initial_pathologic_diagnosis variable, and we set gender values by binarizing the CDR_gender variable. We then ran our recovery-discovery model with covariates on these data using the same default hyperparameters and procedure for selecting point estimates as specified previously.
Breast cancer analysis
Mutations from TCGA breast cancer samples were tallied using the sig_tally function from the sigminer R package. Samples were subsetted to just those in the 20–29 age range or 30–39 age range based on data available on the GDC portal. Mutational counts data from PCAWG was obtained from Synapse ID 11726620. Samples were subsetted to just those in the 30–39 age range or 55+ age range based on data available at the ICGC portal, and were excluded if they overlapped with TCGA samples. We ran our recovery-discovery model on these data using the same default hyperparameters and procedure for selecting point estimates as specified previously.
Data availability
Pan-Cancer Analysis of Whole Genomes (PCAWG) study data was retrieved from Synapse ID 11726620. Currently, the PCAWG data can be accessed at: https://docs.icgc-argo.org/docs/data-access/data-download. TCGA data for the colorectal cancer analysis were retrieved using the TCGAmutations R package.
Code availability
An R package implementing our methods can be found at www.github.com/igrabski/MultiStudyNMF [23] and is licensed under the MIT license. The version of the code used in the manuscript is also available at https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.14991239 [24].
References
Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21.
Nik-Zainal S, Alexandrov LB, Wedge DC, Van Loo P, Greenman CD, Raine K, et al. Mutational processes molding the genomes of 21 breast cancers. Cell. 2012;149(5):979–93.
Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Ng AWT, Wu Y, et al. The repertoire of mutational signatures in human cancer. Nature. 2020;578(7793):94–101.
Lal A, Liu K, Tibshirani R, Sidow A, Ramazzotti D. De novo mutational signature discovery in tumor genomes using SparseSignatures. PLoS Comput Biol. 2021;17(6):e1009119.
Li S, Crawford FW, Gerstein MB. Using sigLASSO to optimize cancer mutation signatures jointly with sampling likelihood. Nat Commun. 2020;11(1):3575.
Fischer A, Illingworth CJ, Campbell PJ, Mustonen V. EMu: probabilistic inference of mutational processes and their localization in the cancer genome. Genome Biol. 2013;14(4):R39.
Rosales RA, Drummond RD, Valieris R, Dias-Neto E, Da Silva IT. signeR: an empirical Bayesian approach to mutational signature discovery. Bioinformatics. 2017;33(1):8–16.
Kasar S, Kim J, Improgo R, Tiao G, Polak P, Haradhvala N, et al. Whole-genome sequencing reveals activation-induced cytidine deaminase signatures during indolent chronic lymphocytic leukaemia evolution. Nat Commun. 2015;6:8866.
Shiraishi Y, Tremmel G, Miyano S, Stephens M. A Simple Model-Based Approach to Inferring and Visualizing Cancer Mutation Signatures. PLoS Genet. 2015;11(12):e1005657.
Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical Dirichlet Processes. J Am Stat Assoc. 2006;101(476):1566–81. https://doiorg.publicaciones.saludcastillayleon.es/10.1198/016214506000000302.
Roberts ND. Patterns of somatic genome rearrangement in human cancer [Ph.D. thesis]. Cambridge: University of Cambridge; 2018.
Robinson W, Sharan R, Leiserson MD. Modeling clinical and molecular covariates of mutational process activity in cancer. Bioinformatics. 2019;35(14):i492–500.
Park JE, Smith MA, Van Alsten SC, Walens A, Wu D, Hoadley KA, et al. Diffsig: Associating Risk Factors With Mutational Signatures. bioRxiv. 2023;33(5):2023–02.
George EI, McCulloch RE. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993;88(423):881–9.
Díaz-Gay M, Alexandrov LB. Unraveling the genomic landscape of colorectal cancer through mutational signatures. Adv Cancer Res. 2021;151:385–424.
Puccini A, Berger MD, Zhang W, Lenz HJ. What we know about stage II and III colon cancer: it’s still not enough. Target Oncol. 2017;12(3):265–75.
Mealey NE, O’Sullivan DE, Pader J, Ruan Y, Wang E, Quan ML, et al. Mutational landscape differences between young-onset and older-onset breast cancer patients. BMC Cancer. 2020;20:1–18.
PCAWG. Pan-cancer analysis of whole genomes. Nature. 2020;578(7793):82–93.
ICGC. PCAWG mutation spectra. Synapse. 2018. https://doiorg.publicaciones.saludcastillayleon.es/10.7303/syn11726601.
Grabski IN, De Vito R, Trippa L, Parmigiani G. Bayesian combinatorial MultiStudy factor analysis. Ann Appl Stat. 2023;17(3):2212.
Avalos-Pacheco A, Rossell D, Savage RS. Heterogeneous Large Datasets Integration Using Bayesian Factor Regression. Bayesian Anal. 2022;17(1):33–66.
Rossell D, Telesca D. Nonlocal priors for high-dimensional estimation. J Am Stat Assoc. 2017;112(517):254–65.
Grabski I. MultiStudyNMF. GitHub. 2025. https://www.github.com/igrabski/MultiStudyNMF.
Grabski I. MultiStudyNMF Zenodo. 2025. https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.14991239.
Acknowledgements
Not applicable.
Peer review information
Tim Sands was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Funding
The authors gratefully acknowledge that this work was supported by the NIH-NIGMS training grant T32GM135117, NIH-NCI grant 5R01 CA262710 - 03, and NSF-DMS grant 2113707. ING was supported by National Science Foundation Graduate Research Fellowship under grant no. DGE1745303.
Author information
Authors and Affiliations
Contributions
GP identified the problem. IG, LT, and GP proposed and developed the method. IG implemented the method. IG wrote the draft manuscript, and LT and GP made revisions. All authors approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
While he does not consider any of these activities to be in conflict with the research in this committee, Giovanni Parmigiani wishes to disclose that he holds equity in Phaeno Biotechnologies and currently consults for Delphi Diagnostics.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
13059_2025_3563_MOESM1_ESM.pdf
Additional file 1: this includes further details on the methodology, simulated data generation, and supplementary figures and tables.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Grabski, I., Trippa, L. & Parmigiani, G. Bayesian multi-study non-negative matrix factorization for mutational signatures. Genome Biol 26, 98 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-025-03563-0
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-025-03563-0