SpaNorm: spatially-aware normalization for spatial transcriptomics data

Salim, Agus; Bhuva, Dharmesh D.; Chen, Carissa; Tan, Chin Wee; Yang, Pengyi; Davis, Melissa J.; Yang, Jean Y. H.

doi:10.1186/s13059-025-03565-y

Methodology
Open access
Published: 29 April 2025

SpaNorm: spatially-aware normalization for spatial transcriptomics data

Agus Salim^1,2,3,4,
Dharmesh D. Bhuva^2,5,6,7,
Carissa Chen^8,9,10,
Chin Wee Tan^2,7,11,
Pengyi Yang^9,10,14,15,
Melissa J. Davis^2,12,13 &
…
Jean Y. H. Yang^10,14,15

Genome Biology volume 26, Article number: 109 (2025) Cite this article

1132 Accesses
3 Altmetric
Metrics details

Abstract

Normalization of spatial transcriptomics data is challenging due to spatial association between region-specific library size and biology. We develop SpaNorm, the first spatially-aware normalization method that concurrently models library size effects and the underlying biology, segregates these effects, and thereby removes library size effects without removing biological information. Using 27 tissue samples from 6 datasets spanning 4 technological platforms, SpaNorm outperforms commonly used single-cell normalization approaches while retaining spatial domain information and detecting spatially variable genes. SpaNorm is versatile and works equally well for multicellular and subcellular spatial transcriptomics data with relatively robust performance under different segmentation methods.

Peer Review reports

Background

Advances in spatial profiling technology have transformed our comprehension of multicellular biological systems. The emergence of both spot-based spatial transcriptomics technologies (ST) such as 10x Genomics Visium [1] as well as subcellular spatial transcriptomics (SST) technologies, such as 10x Genomics Xenium [2], NanoString CosMx [3], BGI Stereo-seq [4], and Vizgen MERSCOPE [5], holds the promise to address previously inaccessible biological questions and enhance our understanding of intercellular communication by preserving tissue architecture. While these innovative spatial transcriptomics technologies offer the potential to uncover new insights into regional variations in cell density and composition, the challenge of effectively removing varying library sizes across regions (Fig. 1A and B) hinders our ability to detect spatial variation signals from the data. This can potentially impact downstream analyses such as clustering, regional segmentation, and identification of spatially variable genes (SVGs).

Currently, the removal of library size effects from spatial transcriptomics data is under debate. Those in favor would argue that total molecule counts represent technical unwanted variation from imperfect molecules capture and would typically use methods originally developed for single-cell RNA-seq (scRNA-seq) data [6,7,8] that ignore the spatial information. Many of these library size normalization approaches use global scaling factors and may suffer when the data are confounded by spatial region-specific library size biases. In particular, these normalization methods tend to remove signals associated with the spatial domain (Fig. 1C) and have led to arguments that library size normalization should not be performed prior to downstream analyses [9] or at least prior to spatial domain identification unless addressed using methods that take spatial information into account [10]. To this end, there is a need for normalization techniques that leverage spatial information to eliminate this region-specific library size bias while retaining biological signals for downstream analyses as effective library size normalization can improve spatial domain identification and other downstream analyses.

Here, we develop SpaNorm, a normalization method that utilizes spatial information and gene expression simultaneously, allowing optimal identification of spatial domains (Fig. 1D) and SVGs. We achieve this through three key innovations: (1) optimally decomposing spatially-smooth variation into library size associated and library size independent variation via generalized linear model (GLM); (2) computing spatially smooth location- and gene-specific scaling factors; and (3) using percentile-invariant adjusted counts (PAC) [11] as normalized data for downstream analyses. Figure 1E provides a detailed overview of the SpaNorm approach.

Results

Library size effects are region-specific in spatial transcriptomics data

We first establish evidence that library size effects vary across spatial domains. Comparing models with global and region-specific library size effects, we estimate the proportion of genes that exhibit spatial variations in their library size effects (see Methods section). From Fig. 2A, we can infer that the proportion of genes with region-specific library size effects varies from around 25% to almost 100% across datasets. Overall, Xenium and STOmics datasets have the highest proportion of genes with region-specific effects, followed by CosMx dataset, and finally the Visium dataset. To demonstrate that these results are not due to our manual region annotations, we performed a sensitivity analysis where each dataset was split into rectangular grids and estimated the proportion of genes that exhibit grid-specific library size effects. For the majority of the datasets, the results show an even higher proportion of genes that exhibit variation in their library size effects under the grid-based method (Additional file 1: Fig. S1).

SpaNorm preserves spatial domain signals

Next, we examine how the spatially-dependent library size effects (scaling factor) in SpaNorm can improve downstream analyses. For this purpose, we first compared SpaNorm to other normalization methods in terms of their ability to retain spatial domain information. We use the ratio of between-region to within-region variations to measure the strength of signals associated with the spatial domain in each gene. Comparing these ratios for the raw and differently normalized data (Additional file 1: Fig. S2), we found that SpaNorm retains the most signal (higher ratio), followed by scran and RUV-III-NB while sctransform and Giotto retain the least. Giotto particularly retains less signal for the Xenium datasets (Mouse Brain, ILC, and IDC).

Clustering has been used as one of the main tools to identify distinct spatial regions. To examine how better retention of spatial domain signals translates into improved identification of spatial regions, we benchmarked SpaNorm against several alternative normalization methods using our previously established benchmark [10]. Three clustering methods: graph-based, SpaGCN, and BayesSpace, were applied using a range of parameter settings (see Methods). Figure 2B shows that single-cell RNA-seq inspired graph-based methods that use expression data alone have lower clustering accuracy across all platforms compared to the spatially-aware methods BayesSpace and SpaGCN. Furthermore, with graph-based clustering, the choice of normalization method has little impact on clustering accuracy, with the exception of sctransform where lower performance is observed in the Mouse brain STOmics, Human DLPFC Visium, and Human NSCLC CosMx data.

Among the two spatially-aware clustering methods evaluated, BayesSpace produced significantly better clustering in 15 samples compared to SpaGCN which produced the best clustering in 6 samples. Across the different clustering methods, we observe that SpaNorm has the best performance (measured using maximum ARI) for 9 of the 25 samples, followed by a standard library size (LS) normalization which works best for 7 samples (Addtional file 1: Table S1). However, of these 7 samples, 6 were 10x Visium samples showing that standard library size normalization (LS) is not effective in normalizing sub-cellular resolution datasets. On the other hand, SpaNorm had balanced performance across technologies and clustering algorithms (Fig. 2B, Additional file 1: Fig. S3).

As K, which controls the complexity of the splines, is a key parameter of SpaNorm, we separately evaluated the performance of SpaNorm upon varying K. As BayesSpace outperformed all other methods in clustering, we performed the benchmark using BayesSpace alone. The results showed that increasing K is beneficial but only up to a certain point, beyond which the benefits of smoothness begin to be lost (Additional file 1: Fig. S4). This is particularly clear for the CosMx samples where the best clustering is achieved when $K=12$ and is poorer with smaller or larger values.

Finally, as expected, not normalizing the data (none) never produces the best clustering (Additional file 1: Table S1) and is rarely the best even in combination with specific clustering algorithms. This results highlights the need for appropriate library size normalization for downstream analyses of spatial transcriptomics data.

SpaNorm improves SVG detection and concordance

Beyond spatial domain identification, we show the benefits of SpaNorm normalization in consistently detecting SVGs. We demonstrate performance using simulated datasets where the true SVGs are known, and using serial replicates from real world datasets where SVGs identified should be consistent. For the former experiment, we generated realistic simulated CosMx, Visium, and Xenium datasets using scDesign3 [12] where 100 genes were designated as true SVGs. Additional file 1: Fig. S5 shows that among the top 100 SVGs identified, SpaNorm consistently calls the highest or joint highest proportion of true SVGs correctly, which also means that SpaNorm controls false discoveries among top 100 SVGs better than other methods.

SpaNorm is also better at detecting true SVGs in real datasets. Figure 3B and Additional file 1: Figs. S6–S7 show the expression of six true SVGs from Xenium Mouse Brain datasets [13]. While Giotto and no Normalization produce stronger signals for general neuronal subtype markers that distinguish granule neurons in the dentate gyrus (Prox1) from pyramidal neurons in CA1–3 (Neurod6) [14], SpaNorm produces stronger signals for detecting specific markers of pyramidalneurons in different CA regions, namely Wfs1 in CA1 region, Necab2 in CA2 region, and Slit2 in CA3 region (Fig. 3C).

Compared to raw data, normalization also produces a more meaningful SVG ranking. Figure 4 shows that for multicellular data, normalization does not result in higher concordance (Fig. 4A) and higher average relative ranking (Fig. 4B) for these genes compared to other genes, indicating the strong influence of library size effects on SVG signals of unnormalized data. Under the other methods, the concordance and average relative ranking of the stably expressed genes are lower than the other genes, which is expected given that these genes are unlikely to exhibit spatially variable expression. Overall, for multicellular data, RUV-III-NB and sctransform have lower concordance for both sets of genes. For subcellular data, the difference between no Normalization and the other methods is less striking. However, we still we observed the stably expressed genes exhibiting higher concordance under no Normalization (CosMx replicate) and higher average relative ranking (Human BRCA Xenium replicate). A closer look at the SVG statistic also shows that the signals for these stably expressed genes are much stronger in the raw data (see Additional file 1: Fig. S8), suggesting that the top SVGs from the unnormalized data reflect stronger library size effects.

SpaNorm enhances biological signals from lowly expressed genes

Lower library sizes due to technical effects can make it difficult to detect marker genes that are essential for identifying spatial domains. Though library size normalization can adjust these effects, lowly expressed genes are still difficult to detect. MOBP is one such marker gene that marks oligodendrocytes that are enriched in the white matter of the human brain (Fig. 5A) [15]. When analyzing the 10x Visium human DLPFC datasets, we saw that MOBP was lowly expressed in two of the twelve samples. In these datasets, we saw that the library size of spots from the white matter (WM) had particularly low library sizes (Fig. 5B). Not normalizing library size effects would lead to the conflicting conclusion that MOBP was excluded from the white matter (Fig. 5C). Giotto, scran, and sctransform were able to detect signals at the boundary of the white matter but not within. Only SpaNorm was able to detect signals both within and at the boundary of the white matter region; however, MOBP detection was relatively weaker at the core of the region. As SpaNorm models the expression of each gene spatially, it enables borrowing of information from surrounding regions, and this can be used to obtain a better region-specific estimate of each gene’s expression. Inspecting the mean estimate of MOBP, we saw that it was significantly higher in the white matter compared to other regions of the tissue (Mean Bio in Fig. 5C). This observation was also consistent across other samples from this dataset (Additional file 1: Fig. S9).

SpaNorm is robust to gene sampling, cell segmentation, and volume-based normalization

A common practice in spatial transcriptomics analysis is to filter out lowly expressed genes, and subsequently normalize data using cell volume rather than library sizes. As the cell volume is dependent on cell segmentation, volume-based normalization and differences in cell segmentation could alter downstream biological insights. To assess the robustness of SpaNorm to gene sampling, we ran a simulation experiment using the 10x Visium samples where we created sampled datasets and assessed the ranking of the top SVGs. Each simulated dataset was composed of 100 of the top SVGs identified using the whole dataset and 400 randomly sampled genes. Simulated datasets were then normalized using SpaNorm, and SVG calling was performed using MERINGUE. Across 10 repeats for each sample, we found that on average 83 of the 100 strongest SVGs ranked within the top 100 (Additional file 1: Fig. S10). These results showed that the gene sampling strategy did not strongly affect SpaNorm when studying true biological effects. This was expected as the only step where the gene composition matters is in the already robust empirical-Bayes approach used to estimate over-dispersion [16].

Next, we assessed the impact of cell segmentation and the difference in normalizing data using cell volume/area as opposed to library sizes. As cell segmentation differences are less likely to affect macroscopic effects such as spatial domains and SVGs, we focused our assessments on finer features such as cell types. We assessed the reproducibility of cell type proportions across the Xenium breast cancer samples processed using two segmentation approaches and using either library size normalization or cell volume/area normalization. Our results showed that cell type proportions were consistent when normalizing gene expression data using library size or cell volume/area (Additional file 1: Fig. S11). We noticed minor differences in proportion across the different segmentation choices. However, these trends were also seen across the two normalization approaches suggesting that the difference was not resulting from normalization but from the inclusion or exclusion of cell type markers through the adjustment of cellular boundaries.

Discussion

Here we present SpaNorm, a normalization method that recognizes the region-specific nature of library size effects and distributions. Using 27 realworld datasets, we benchmarked SpaNorm against other normalization methods and demonstrated that SpaNorm is better at retaining spatial domain signals for clustering and detecting true SVGs. SpaNorm’s running time increases only linearly as a function of the number of cells. For the datasets we used in our benchmarking study, the longest running time was around 9 min for Xenium Breast Cancer datasets with around 60,000 cells (Additional file 1: Fig. S12).

To maximize the potential of SpaNorm’s normalized data, we recommend using spatially-aware clustering algorithms such as BayesSpace and SpaGCN, for which the comparative advantage of SpaNorm is more pronounced. While SpaNorm can be used for both spot-based and subcellular spatial transcriptomics (SST) data, we observed that the relative benefit of using SpaNorm is more for SST data such as those from Xenium, STOmics, and CosMx platforms for which the proportion of genes exhibiting region-specific library size effect is higher.

For data generated using SST technologies, in order to extract cell-level data, segmentation to detect cell boundaries can be carried out prior to downstream analysis. An alternative is to use grid-based methods [10] whereby no segmentation is performed and instead molecule counts that fall into each grid are simply summed up. Our benchmarking consists of 25 grid-based and 2 segmentation-based datasets. Our empirical evidence shows that SpaNorm’s performance is not sensitive to this decision and the algorithms work equally well for segmentation-based data or grid-based data that consist of counts from multiple cells.

Optimal normalization of spatial transcriptomics (ST) data has been difficult to achieve because library size effects and distribution are potentially region-specific. These two unique features of ST data do not exist in single-cell RNA-seq (scRNA-seq) data. It is thus not surprising that direct applications of normalization methods developed for scRNA-seq data often results in the removal of spatial domain signals in addition to removing the library size effects. SpaNorm decomposes the spatially smooth variation into those related and unrelated to log library size and subsequently retains only variation unrelated to log library size. SpaNorm may become less effective in separating these two types of variation when the spatial autocorrelation in log library size is very high. However, for most datasets, we expect the spatial autocorrelation to be only moderate. For example, in our benchmarking datasets, the Moran’s I statistics for log library size are less than 0.2, with 10x Visium datasets exhibiting the highest spatial autocorrelation (Additional file 1: Fig. S13).

SpaNorm currently only deals with library size effect but can be extended to handle other unwanted variation such as “batch” effect introduced when data are acquired through multiple fields of views [17]. Fields of view (FOV) effect introduces discontinuity in the spatial patterns. Since SpaNorm relies on decomposing spatially smooth variation, the discontinuity could affect SpaNorm’s ability to separate real biology from the underlying unwanted variation. We are currently extending SpaNorm’s model to deal with and subsequently remove FOV effect. More generally, our approach for decomposing smooth spatial variation can be extended to accommodate other types of spatial omics data such as imaging mass cytometry data [18], although it would likely require adaptation of the underlying models beyond the negative binomial distribution.

Conclusions

In conclusion, the development of both spot-based and subcellular spatial transcriptomics technologies is revolutionizing molecular biology. We identified strong spatial variation of library size across many ST datasets, which challenges standard normalization methods developed for scRNA-seq data. To address this, we introduced the first spatially-aware normalization approach that performs local regional library size adjustment, providing a level of flexibility that is a common limitation of many global adjustment approaches. We illustrate that our novel method outperforms the current state-of-the-art normalization methods, allowing a more accurate identification of spatially variable genes as well as regional detection. Furthermore, SpaNorm works equally well with segmented cell-level data and spot-based data, where each spot contains multiple cells.

Methods

SpaNorm model

To develop SpaNorm, a normalization method that utilize spatial information while allowing optimal identification of spatial domains and spatially variable genes (SVGs), we model the count data using generalized linear model. Specifically, we assume that the count for gene g and spot (cell) c can be modeled as $z_{gc} \sim \text {Negative Binomial(NB)}(\mu _{gc},\psi _g)$ where $\psi _g$ is the gene-specific dispersion parameter. The library size (LS) and biology affect the mean parameter through a log-linear model

$$\begin{aligned} \log \mu _{c,g} = \zeta _g + f_g(x_c,y_c; \beta _{\textbf{g}}) + \{\alpha + h_g(x_c,y_c; \gamma _g)\} \log LS_c, \end{aligned}$$

where $(x_c, y_c)$ are the spatial coordinates and $LS_c$ is the library size for spot c. The two functions $f_g(x_c,y_c)$ and $h_g(x_c,y_c)$ are two-dimensional, gene-specific spatially-smooth functions constructed using 2D splines with K (default = 6) degree of freedom in each dimension, expressed as

$$\begin{aligned} f_g(x_c,y_c; \beta _{\textbf{g}}) = \sum \limits _{i=1}^K \sum \limits _{j=1}^K \beta _{g,ij} B_i(x_c)B_j(y_c); \end{aligned}$$

and

$$\begin{aligned} h_g(x_c,y_c; \gamma _{\textbf{g}}) = \sum \limits _{i=1}^K \sum \limits _{j=1}^K \gamma _{g,ij} B_i(x_c)B_j(y_c), \end{aligned}$$

where $B_i(.)$ and $B_j(.)$ are B-splines basis functions.

Using this model, we decompose the smooth spatial variation in each gene into two components: $f_g(\textbf{x,y}; \beta _{\textbf{g}})$ which represents biologically-relevant smooth spatial variation and $(\alpha + h_g(\textbf{x,y}; \mathbf \gamma _g)) \log LS$ which represents smooth spatial variation related to (log) library size. Here, $\alpha$ represents the global effect of library size shared by all genes while $\mathbf \gamma _g$ is the parameter that determines the gene-specific library size effect. When $\mathbf \gamma _g=0$, there is no gene-specific library size effect. To improve the fit, we also found that it is beneficial to “regularize” $\beta _{g,ij}$ and $\gamma _{g,ij}$ parameters using $L_2$ penalty using $\lambda = 10^{-4} N, N=$ number of cells as the default penalty. More details about the algorithm and parameter estimation can be found in Additional file 2.

Adjusted data—SpaNorm outputs a matrix of percentile-invariant adjusted count (PAC) that can be used for downstream analyses. For gene g and spot (cell) c, the PAC is calculated as quantile of a negative binomial distribution where the mean parameter does not contain library size effects,

$$\begin{aligned} F_{NB}^{-1}\left( \frac{l_{gc}+u_{gc} }{2}; \mu _{gc} = \exp \{\hat{\zeta }_g + f_g(x_c,y_c; \hat{\beta }_{\textbf{g}})\},\psi =\hat{\psi }_g\right) , \end{aligned}$$

where

$$\begin{aligned} l_{gc}= & F_{NB}(y_{gc}; \mu _{gc} = \exp \{\hat{\zeta }_g + f_g(x_c,y_c; \hat{\beta }_{\textbf{g}}) + \{\alpha + h_g(x_c,y_c; \hat{\gamma }_g)\} \log LS_c\},\psi =\hat{\psi }_g), and \\ u_{gc}= & F_{NB}(y_{gc}+1; \mu _{gc} \exp \{\hat{\zeta }_g + f_g(x_c,y_c; \hat{\beta }_{\textbf{g}})+ \{\alpha + h_g(x_c,y_c; \hat{\gamma }_g)\} \log LS_c\},\psi =\hat{\psi }_g) \end{aligned}$$

are the cumulative density functions of the negative binomial distribution which includes the library size effects. After obtaining the PAC, the log PAC was simply obtained as $\log (PAC + 1)$.

We use iterative reweighted least squares (IRLS) algorithm to estimate SpaNorm’s model parameters. More detailed information about the algorithm is provided in Additional file 2

Datasets

We use 6 datasets (see Additional file 1: Table S2 for details) encompassing 27 samples (25 grid-based and 2 segmentation-based), four different platforms (Visium, Xenium, STOmics, and CosMx), three tissues (brain, breast, and lung), and two species (human and mouse) to compare the performance of SpaNorm against no Normalization and four other state-of-the-art normalization approaches namely, Giotto, scran, RUV-III-NB, and sctransform normalizations.

For the grid-based datasets, transcript detection tables for the 10x Xenium breast cancer dataset (IDC and ILC), 10x Xenium mouse brain, the NanoString CosMx non-small cell lung cancer, and the BGI STOmics mouse brain were obtained from [10]. Independently acquired region annotations were available from this dataset. These were obtained through image registration of DAPI images to reference tissue atlases, or through annotation of immunoflourescence or histology images. The 10x Visium human DLPFC dataset [15] was obtained through the SpatialLIBD R/Bioconductor package [19].

The two segmentation-based datasets (Xenium Human Breast Cancer Xenium datasets 1 and 2) were downloaded from https://www.10xgenomics.com/products/xenium-in-situ/preview-dataset-human-breast and subjected to further quality control (QC) steps that can be found in [13].

Data preprocessing

Measurements from all datasets, except the 10x Xenium breast cancer dataset with replicates, were allocated to regular hexagonal bins using the SubcellularSpatialData R/Bioconductor package. The bins parameter was set to 200 for the 10x Xenium breast cancer and mouse brain datasets and 100 for the BGI STOmics and NanoString CosMx datasets. Bins where measurements spanned multiple regions were annotated based on the most frequent region annotation.

For the Xenium Human Breast Cancer Xenium datasets, segmentation was performed using BIDCell [13]. Default parameter values from the exemplar file for Xenium and the provided single-cell reference file were used (both files were downloaded from the official BIDCell repository). The model was trained end-to-end from scratch for 4000 iterations (i.e., using 4000 training patches). This amounted to a maximum of 22% of the entire image, thereby leaving the rest of the image unseen by the model during inference. Weights of the convolutional layers were initialized using He and colleagues’ approach [20]. We employed standard on-the-fly image data augmentation by randomly applying a flip (horizontal or vertical) and rotation (of 90, 180, or 270 degrees) in the (x,y) plane. The order of training samples was randomized prior to training. We employed the Adam optimizer [21] to minimize the sum of all losses at a fixed learning rate of 0.00001, with a first moment estimate of 0.9, second moment estimate of 0.999, and weight decay of 0.0001.

Normalization methods

Each dataset was normalized using the following methods:

No Normalization: Raw counts were log transformed. A pseudo count of 1 was added to all observations to avoid taking a logarithm of zero count.
scran normalization: A minimum size factor of $10^{-8}$ was imposed to avoid negative and zero size factor estimates [6].
sctransform normalization [7].
RUV-III-NB normalization [11] with $K=1$. Details of negative control features used and selection of pseudo-replicates can be found in [10].
Giotto normalization [8]: performs library size normalization, followed by z-scoring of the normalized data across genes and/or cells.
SpaNorm normalization (see above for details).

All normalization methods were applied using their default parameters. More details description about scran, sctransform, and RUV-III-NB can be found in [10].

Evaluation methods

Evaluating region-specific library size effects: annotation-based

For each dataset, we selected the top 1000 most abundant genes. For each gene g, we fitted the following two negative binomial (NB) regression models to the observed count:

Model 1 (M1): $\log \mu _{gc} = \zeta _g + \sum _{i=1}^R \beta _{g,i} I_{[c \in S_i]} + \alpha _g \log LS_c$ and
Model 2 (M2): $\log \mu _{gc} = \zeta _g + \sum _{i=1}^R \beta _{g,i} I_{[c \in S_i]} + \alpha _g \log LS_c + h_g(x_c,y_c; \gamma _{\textbf{g}}) \log LS_c$

where $h_g(x_c,y_c; \gamma _{\textbf{g}})$ is a smooth spatial function constructed using 2D B-splines in the same way as in the SpaNorm model, $\beta _{g,i}$ is the coefficient representing the relative biology of annotated region i in gene g, and $I_{[c \in S_i]}$ is the indicator function that cell c belongs to region i. We can see that $M_2$ is very similar to SpaNorm model, except that in $M_2$ the biology is assumed to be constant within each region, while SpaNorm allows the biology to vary within as well as between regions.

The only additional parameters in $M_2$ relative to $M_1$ is $\mathbf \gamma _g$ that controls the spot-specific library size effect. If $\mathbf \gamma _g=0$, the two models are equivalent. Therefore, to test for evidence of spot-specific library size effect, we compare the two models using the likelihood ratio test (LRT). We performed this test gene-by-gene and the associated p values were recorded. Using the p values from all genes as input, we estimate proportion of null genes (genes in which the spot-specific library size effect is not needed) using the qvalue function from qvalue Bioconductor package [22]. Finally, the proportion of non-null genes (genes in which the spot-specific library size effect is needed) is simply calculated as one minus the estimated proportion of null genes. This procedure for estimating the proportion of non-null genes does not directly place a threshold on the q value, which can be arbitrary. Instead, it considers the empirical distribution of the p values and compare this distribution to the theoretical distribution of p values when all genes are null genes, namely the uniform distribution. More details on the procedure can be found in [23].

Evaluating region-specific library size effects: grid-based

Each dataset was split into rectangular grids with the size of the grids being dataset-specific because we require a minimum of 300 spots (cells) per grid. This split was only performed to designate each cell to a grid as a proxy for a region. Cells retain their individual observed counts and spatial coordinates and the grid information was only used during the model fitting process (see below).

For each gene g, we fitted the following NB model to the observed counts,

$$\begin{aligned} \log {\mu _{gc}} = \zeta _g + \sum \limits _{j=1}^K \beta _{g,j} I_{[c \in G_j]} + \alpha _g \log LS_c + \sum \limits _{j=1}^K \gamma _{g,j} I_{[c \in G_j]} \log LS_c \end{aligned}$$

where $\beta _{g,j}$ is the coefficient representing the relative biology of grid i in gene g and $I_{[c \in G_j]}$ is the indicator function that cell c belongs to grid j. We test for heterogeneity of library size effect among the grids ($H_0: \gamma _{g,j} = 0 \forall j$) using Cochran’s Q test [24]. The resulting p values were recorded and the proportion of genes with heterogeneous library size effects were estimated in the same manner as in the region-based models above.

Analysis of variance

One-way analysis of variance (ANOVA) was fitted to each gene with normalized data as a dependent variable and the manually annotated regions as a factor (treatment) variable. The between-treatment and within-treatment variance estimates without and with a particular normalization were compared in log-scale.

Simulation studies

We used scDesign3 pipeline [12] for simulating SVGs (https://songdongyuan1994.github.io/scDesign3/docs/articles/scDesign3-DEanalysis-vignette.html) using Visium Human DLPFC dataset 1, Xenium Mouse Brain replicate 1, and CosMx Human NSCLC replicate 1 as the input datasets. For each dataset, we used scDesign3 to empirically estimate the SVG signals in the following manner:

We fitted two models for each gene using scDesign3::fit_marginal function: the first model contains both smooth spatial effects presenting the underlying biology and smoothly-varying library size effects (M1), while the second model only contains the smoothly-varying library size effects (M2). The deviance statistics of the two models were calculated and genes were sorted based on the difference of their deviance statistics (M2 deviance − M1 deviance) and the top 100 genes with the largest deviance difference were designated as the true SVGs.
We then simulated the ST data using the empirical M1 model for the designated true SVGs above the empirical M2 model for the others, non-spatially variable genes.

Stably expressed genes

For datasets without negative control (all Visium and the Human BRCA Xenium datasets), we used the list of stably expressed genes for humans and mice from the database of housekeeping genes and reference transcripts (https://housekeeping.unicamp.br/) [25]. For the other datasets, we used the negative control probes as stably expressed genes.

Spatial domain identification

The spatial domain identification benchmark outlined in [10] was performed to study the impact of SpaNorm normalization on spatial domain identification. Feature selection was performed on normalized datasets by identifying highly variable genes (HVGs). The top 1000, 2000, and 3000 genes were identified for datasets with genome-wide measurements. Where datasets were obtained using targeted panels, either genes with positive variance estimates from a fitted mean-variance trend, or all genes were selected. Dimensional reduction was performed using principal components analysis. Next, we used three clustering algorithms: the graph-based (Leiden, Louvain, or Walktrap) algorithm from the igraph R package [26], BayesSpace [27], and SpaGCN [28] to perform clustering using the normalized data as input.

For graph-based algorithm, the graph was built using buildSNN function from the scran package [6] by setting the number of nearest neighbors to 10, 20, 30, or 50. For the Louvain and Leiden algorithms, 8 evenly spaced resolution parameters in the interval [0.1, 1] were assessed. BayesSpace and SpaGCN require the number of clusters to be pre-specified. As this is often unknown, we tested performance with the correct number of clusters, and over-/under-clustering by perturbing this number by 25%. SpaGCN was deployed from R using the reticulate and zellkonverter packages.

The defined parameter space was assessed exhaustively by running all possible combinations (27, 971, except a few failed runs). The CellBench framework was used to deploy the benchmark. The performance of the clustering algorithms to recover spatial domains under different normalization strategies was compared by computing the Adjusted Rand Index (ARI) using the independently annotated spatial regions as ground truth.

SVG identification

MERINGUE [29] with default parameters was used to detect spatially variable genes. We select MERINGUE because it is currently the only method that uses normalized data as input and does not perform any additional normalization, allowing objective comparison of the impact of different normalization strategies on SVG identification. Additionally, it is based on nonparametric methods, which would not favor a particular normalization method based on their parametric assumptions.

The strength of SVG signals was calculated using a statistic defined as $\frac{\mid observed - expected \mid }{SD}$. Finally, the concordance of these statistics between a pair of datasets belonging to the same experiment was calculated using Spearman’s correlation coefficient and the gene-specific average relative ranking was calculated as the average ranking for the gene across the pair of datasets divided by the number of genes.

Replicates used to calculate concordance

Human DLPFC Visium set 1: Human DLPFC datasets 1–4
Human DLPFC Visium set 2: Human DLPFC datasets 7–8
Human DLPFC Visium set 3: Human DLPFC datasets 9–12
Mouse Brain Xenium: Mouse Brain Xenium datasets 1–3
Human NSCLC (Lung) CosMx: Human NSCLC (Lung) CosMx datasets 1–3
Human BRCA Xenium: Human BRCA Xenium datasets 1–2

Evaluating the impact of segmentation and volume-based normalization

The four Xenium breast cancer samples (from two datasets) were used to assess the impact of cell segmentation on SpaNorm normalization and to evaluate the difference between library size normalization and area/volume normalization. Data were preprocessed by removing empty cells and cells with library sizes or the number of genes detected in the $10^{th}$ percentile of the data. The vendor-provided cell segmentation was compared against the probabilistic segmentation algorithm PROSEG [30]. Default parameters were used for PROSEG with the number of components set to 23. In both cases, the cell area was available and was used as a proxy for volume to assess volume-based normalization.

The two segmentation approaches were then normalized using SpaNorm, with the parameters 18 degrees of freedom (the K parameter) except for the IDC sample where K = 17 (the maximum possible for the data). The model was estimated using 5% of the cells in each sample. Standard scran-based size factors were used to adjust for library size effects using SpaNorm, while the volume-based size factor for each cell was computed as the area divided by the average area of all cells.

An ensemble cell type annotation workflow which implements a majority consensus voting strategy was used. This approach used 3 annotations methods, namely Azimuth (v0.5.0 [31]), CelliD (v1.12.0 [32]), and CHETAH (v1.20.0 [33]). We used CIBERSORTx [34] to generate a single cell reference for the three methods based on a breast cancer scRNAseq dataset [35]. Cell typing was performed on the differently segmented and normalized datasets and consistency in proportions evaluated.

Data availability

SpaNorm is available as a Bioconductor package (https://doiorg.publicaciones.saludcastillayleon.es/10.18129/B9.bioc.SpaNorm) [36]. The codes and package are released under GNU General Public License 3.0.

The original data sets used in this study are available from the following sources: Human DLPFC Visum Dataset [19], Xenium Mouse Brain Dataset [37], CosMx Human NSCLC Dataset [38], Stereo-seq Human Brain Dataset [39] and Xenium Human BC Dataset [40] (also see Additional file 1: Table S2).

The processed data are deposited in the Zenodo repositories [41, 42]. The processed data can also be accessed through the SubcellularSpatialData R/Bioconductor data package [43]. Analysis code is deposited in the Zenodo repository [42].

References

Stahl PL, Salmen F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016;353(6294):78–82.
Article CAS PubMed Google Scholar
Janesick A, Shelansky R, Gottscho AD, Wagner F, Williams SR, Rouault M, et al. High resolution mapping of the tumor microenvironment using integrated single-cell, spatial and in situ analysis. Nat Commun. 2023;14(1):8353. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41467-023-43458-x.
Article CAS PubMed PubMed Central Google Scholar
He S, Bhatt R, Brown C, Brown EA, Buhr DL, Chantranuvatana K, et al. High-plex imaging of RNA and proteins at subcellular resolution in fixed tissue by spatial molecular imaging. Nat Biotechnol. 2022;40(12):1794–806.
Article CAS PubMed Google Scholar
Chen A, Liao S, Cheng M, Ma K, Wu L, Lai Y, et al. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell. 2022;185(10):1777–92.
Article CAS PubMed Google Scholar
Vahid MR, Brown EL, Steen CB, Zhang W, Jeon HS, Kang M, et al. High-resolution alignment of single-cell and spatial transcriptomes with CytoSPACE. Nat Biotechnol. 2023;41(11):1543–8. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41587-023-01697-9.
Article CAS PubMed PubMed Central Google Scholar
Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016;17(1):75. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-016-0947-7.
Article CAS PubMed Google Scholar
Hafemeister C, Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019;20(1):296. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-019-1874-1.
Article CAS PubMed PubMed Central Google Scholar
Dries R, Zhu Q, Dong R, Eng CHL, Li H, Liu K, et al. Giotto: a toolbox for integrative analysis and visualization of spatial expression data. Genome Biol. 2021;22(1):78. https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-021-02286-2.
Article CAS PubMed PubMed Central Google Scholar
Atta L, Clifton K, Anant M, Aihara G, Fan J. Gene count normalization in single-cell imaging-based spatially resolved transcriptomics. bioRxiv. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1101/2023.08.30.555624.
Bhuva DD, Tan CW, Salim A, Marceaux C, Pickering MA, Chen J, et al. Library size confounds biology in spatial transcriptomics data. Genome Biol. 2024;25(1):99.
Article PubMed PubMed Central Google Scholar
Salim A, Molania R, Wang J, De Livera A, Thijssen R, Speed T. RUV-III-NB: normalization of single cell RNA-seq data. Nucleic Acids Res. 2022;50(16):e96–e96. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/nar/gkac486.
Article CAS PubMed PubMed Central Google Scholar
Song D, Wang Q, Yan G, Liu T, Sun T, Li JJ. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat Biotechnol. 2024;42(2):247–52. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41587-023-01772-1.
Article CAS PubMed Google Scholar
Fu X, Lin Y, Lin DM, Mechtersheimer D, Wang C, Ameen F, et al. BIDCell: biologically-informed self-supervised learning for segmentation of subcellular spatial transcriptomics data. Nat Commun. 2024;15(1):509. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41467-023-44560-w.
Article CAS PubMed PubMed Central Google Scholar
Hamilton DJ, White CM, Rees CL, Wheeler DW, Ascoli GA. Molecular fingerprinting of principal neurons in the rodent hippocampus: a neuroinformatics approach. J Pharm Biomed Anal. 2017;144:269–78.
Article CAS PubMed PubMed Central Google Scholar
Maynard KR, Collado-Torres L, Weber LM, Uytingco C, Barry BK, Williams SR, et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat Neurosci. 2021;24(3):425–36.
Article CAS PubMed PubMed Central Google Scholar
Chen Y, Lun AT, Smyth GK. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research. 2016;5:1438.
Cook DP, Jensen KB, Wise K, Roach MJ, Dezem FS, Ryan NK, et al. A comparative analysis of imaging-based spatial transcriptomics platforms. bioRxiv. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.1101/2023.12.13.571385.
Milosevic V. Different approaches to imaging mass cytometry data analysis. Bioinform Adv. 2023;3(1):vbad046.
Pardo B, Spangler A, Weber LM, Page SC, Hicks SC, Jaffe AE, et al. spatialLIBD: an R/Bioconductor package to visualize spatially-resolved transcriptomics data. BMC Genomics. 2022;23(1):434.
Article CAS PubMed PubMed Central Google Scholar
He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: 2015 IEEE International Conference on Computer Vision (ICCV). Los Alamitos: IEEE Computer Society; 2015. pp. 1026–1034. https://doiorg.publicaciones.saludcastillayleon.es/10.1109/ICCV.2015.123.
Kingma DP, Ba J. Adam: a method for stochastic optimization. 2017. https://arxiv.org/abs/1412.6980.
Storey JD, Bass AJ, Dabney A, Robinson D. qvalue: Q-value estimation for false discovery rate control, 2023. R package version 2.34.0. https://doiorg.publicaciones.saludcastillayleon.es/10.18129/B9.bioc.qvalue.
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100(16):9440–5.
Article CAS PubMed PubMed Central Google Scholar
Cochran WG. The comparison of percentages in matched samples. Biometrika. 1950;37(3–4):256–66. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/biomet/37.3-4.256.
Article CAS PubMed Google Scholar
Hounkpe BW, Chenou F, de Lima F, De Paula E. HRT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets. Nucleic Acids Res. 2020;49(D1):D947–D955. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/nar/gkaa609.
Csárdi G, Nepusz T, Traag V, Horvát S, Zanini F, Noom D, et al. igraph: network analysis and visualization in R. 2024. R package version 4.2.2. https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.7682609.
Zhao E, Stone MR, Ren X, Guenthoer J, Smythe KS, Pulliam T, et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat Biotechnol. 2021;39(11):1375–84. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41587-021-00935-2.
Article CAS PubMed PubMed Central Google Scholar
Hu J, Li X, Coleman K, Schroeder A, Ma N, Irwin DJ, et al. SpaGCN: integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat Methods. 2021;18(11):1342–51. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/s41592-021-01255-8.
Article CAS PubMed Google Scholar
Miller BF, Bambah-Mukku D, Dulac C, Zhuang X, Fan J. Characterizing spatial gene expression heterogeneity in spatially resolved single-cell transcriptomic data with nonuniform cellular densities. Genome Res. 2021;31(10):1843–55.
Article CAS PubMed PubMed Central Google Scholar
Jones DC, Elz AE, Hadadianpour A, Ryu H, Glass DR, Newell EW. Cell simulation as cell segmentation. bioRxiv. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1101/2024.04.25.591218.
Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184(13):3573–87.
Article CAS PubMed PubMed Central Google Scholar
Cortal A, Martignetti L, Six E, Rausell A. Gene signature extraction and cell identity recognition at the single-cell level with Cell-ID. Nat Biotechnol. 2021;39(9):1095–102.
Article CAS PubMed Google Scholar
De Kanter JK, Lijnzaad P, Candelli T, Margaritis T, Holstege FC. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 2019;47(16):e95–e95.
Article PubMed PubMed Central Google Scholar
Newman AM, Steen CB, Liu CL, Gentles AJ, Chaudhuri AA, Scherer F, et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat Biotechnol. 2019;37(7):773–82.
Article CAS PubMed PubMed Central Google Scholar
Wu SZ, Al-Eryani G, Roden DL, Junankar S, Harvey K, Andersson A, et al. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet. 2021;53(9):1334–47.
Article CAS PubMed PubMed Central Google Scholar
Bhuva DD, Salim A, Mohammed A. Spatially-aware normalisation for spatial transcriptomics data. 2024. Bioconductor. https://doiorg.publicaciones.saludcastillayleon.es/10.18129/B9.bioc.SpaNorm.
Fresh frozen mouse brain replicates - in situ gene expression dataset by Xenium onboard analysis 1.0.2. 2023. https://www.10xgenomics.com/resources/datasets/fresh-frozen-mouse-brain-replicates-1-standard. Accessed June 2023.
CosMx SMI NSCLC FFPE dataset. 2023. https://staging.nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/nsclc-ffpe-dataset/. Accessed June 2023.
Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball patterned arrays. CNGB Database (Accession Number CNP0001543). CNGB Database (Accession Number CNP0001543). 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.26036/CNP0001543.
Janesick A, Shelansky R, Gottscho AD, Wagner F, Williams SR, Rouault M, et al. High resolution mapping of the tumor microenvironment using integrated single-cell, spatial and in situ analysis. Gene Expression Omnibus Database (Accession Number GSE243280). 2023. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE243280. Accessed June 2023.
Bhuva DD, Tan CW, Marceaux C, Pickering M, Salim A, Chen J, et al. Library size confounds biology in spatial transcriptomics data. Zenodo. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.10516814.
Bhuva DD, Salim A. SpaNorm: spatially-aware normalisation for spatial transcriptomics data (accompanying code and data). Zenodo; 2024. https://zenodo.org/records/14387157. Accessed 11 Dec 2024.
Bhuva DD. SubcellularSpatialData: annotated spatial transcriptomics datasets from 10x Xenium, NanoString CosMx and BGI STOmics. 2024. Bioconductor. https://doiorg.publicaciones.saludcastillayleon.es/10.18129/B9.bioc.SubcellularSpatialData.

Download references

Acknowledgements

The authors would like to thank Xiaohang Fu for processing the Xenium breast cancer datasets and Marni Torkel for assistance in the creation of Fig. 1E.

Peer review information

Zhana Duren, Veronique van den Berghe, and Kevin Pang were the primary editors of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.

Funding

A postgraduate scholarship from Australian Government Research Training Program and a Children’s Medical Research Institute postgraduate scholarship to CC.

Author information

Authors and Affiliations

Melbourne School of Population and Global Health, The University of Melbourne, Melbourne, 3010, VIC, Australia
Agus Salim
Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, 3052, VIC, Australia
Agus Salim, Dharmesh D. Bhuva, Chin Wee Tan & Melissa J. Davis
School of Mathematics and Statistics, The University of Melbourne, Melbourne, 3010, VIC, Australia
Agus Salim
Baker Heart and Diabetes Institute, Melbourne, 3004, VIC, Australia
Agus Salim
South Australian Immunogenomics Cancer Institute, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, 5005, SA, Australia
Dharmesh D. Bhuva
Precision Cancer Medicine, South Australian Health and Medical Research Institute (SAHMRI), Adelaide, 5000, SA, Australia
Dharmesh D. Bhuva
Frazer Institute, Faculty of Medicine, The University of Queensland, Woolloongabba, 4102, QLD, Australia
Dharmesh D. Bhuva & Chin Wee Tan
School of Medical Sciences, Faculty of Medicine and Health, The University of Sydney, Sydney, 2006, NSW, Australia
Carissa Chen
Computational Systems Biology Unit, Children’S Medical Research Institute, Westmead, 2145, NSW, Australia
Carissa Chen & Pengyi Yang
Sydney Precision Data Science Centre, The University of Sydney, Sydney, 2006, NSW, Australia
Carissa Chen, Pengyi Yang & Jean Y. H. Yang
Department of Medical Biology, Faculty of Medicine, Dentistry and Health Sciences, The University of Melbourne, Melbourne, 3010, VIC, Australia
Chin Wee Tan
School of Biomedicine, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, 5005, SA, Australia
Melissa J. Davis
Isomorphic Labs, London, UK
Melissa J. Davis
School of Mathematics and Statistics, The University of Sydney, Sydney, 2006, NSW, Australia
Pengyi Yang & Jean Y. H. Yang
Charles Perkins Centre, The University of Sydney, Sydney, 2006, NSW, Australia
Pengyi Yang & Jean Y. H. Yang

Authors

Agus Salim
View author publications
You can also search for this author inPubMed Google Scholar
Dharmesh D. Bhuva
View author publications
You can also search for this author inPubMed Google Scholar
Carissa Chen
View author publications
You can also search for this author inPubMed Google Scholar
Chin Wee Tan
View author publications
You can also search for this author inPubMed Google Scholar
Pengyi Yang
View author publications
You can also search for this author inPubMed Google Scholar
Melissa J. Davis
View author publications
You can also search for this author inPubMed Google Scholar
Jean Y. H. Yang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

AS and DDB conceptualized the study with input from JYY and MJD. AS and DDB implemented the algorithm, developed the R package, and performed the benchmarking studies. CC and PY performed the simulation studies and contributed to the benchmarking studies. CWT performed the cell typing analysis. AS wrote the first draft of the manuscript with input from JYY. All authors read and approved the final manuscript.

Authors’ X handles

X handles: @asalim_hint (Agus Salim); @bhuva_dd (Dharmesh D. Bhuva), @carissaynchen (Carissa Chen); @chinwee10 (Chin Wee Tan); @jeanyang21 (Jean Yang).

Corresponding authors

Correspondence to Agus Salim or Dharmesh D. Bhuva.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

MJD is an employee of the Isomorphic Lab, UK. The other authors declare no competing interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Agus Salim and Dharmesh D. Bhuva are joint first authors.

Supplementary information

Additional file 1. Supplementary Tables and Figures.

Additional file 2. Supplementary Methods.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Salim, A., Bhuva, D.D., Chen, C. et al. SpaNorm: spatially-aware normalization for spatial transcriptomics data. Genome Biol 26, 109 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-025-03565-y

Download citation

Received: 31 May 2024
Accepted: 31 March 2025
Published: 29 April 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-025-03565-y

SpaNorm: spatially-aware normalization for spatial transcriptomics data

Abstract

Background

Results

Library size effects are region-specific in spatial transcriptomics data

SpaNorm preserves spatial domain signals

SpaNorm improves SVG detection and concordance

SpaNorm enhances biological signals from lowly expressed genes

SpaNorm is robust to gene sampling, cell segmentation, and volume-based normalization

Discussion

Conclusions

Methods

SpaNorm model

Datasets

Data preprocessing

Normalization methods

Evaluation methods

Evaluating region-specific library size effects: annotation-based

Evaluating region-specific library size effects: grid-based

Analysis of variance

Simulation studies

Stably expressed genes

Spatial domain identification

SVG identification

Replicates used to calculate concordance

Evaluating the impact of segmentation and volume-based normalization

Data availability

References

Acknowledgements

Peer review information

Funding

Author information

Authors and Affiliations

Contributions

Authors’ X handles

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary information

Additional file 1. Supplementary Tables and Figures.

Additional file 2. Supplementary Methods.

Rights and permissions

About this article

Cite this article

Share this article

Genome Biology

Contact us