- Brief Report
- Open access
- Published:
A fully phased octoploid strawberry genome reveals the evolutionary dynamism of centromeric satellites
Genome Biology volume 26, Article number: 17 (2025)
Abstract
We systematically examine the application of different phasing strategies to decrypt strawberry genome organization and produce a fully phased and accurate reference genome for Fragaria x ananassa cv. “EA78” (2n = 8x = 56). We identify 147 bp canonical centromeric repeats across 50 strawberry chromosomes and uncover the formation of six neocentromeres through centromere turnover. Our findings indicate strawberry genomes have diverged centromeric satellite arrays among chromosomes, particularly across homoeologs, while maintaining high sequence similarity between homologs. We trace the evolutionary dynamics of centromeric repeats and find substantial centromere size expansion in wild and cultivated octoploids compared to the diploid ancestor, F. vesca.
Background
The garden or cultivated strawberry, Fragaria x ananassa, is an allo-octoploid species (2n = 8x = 56) and one of the youngest domesticated fruit crops [1, 2]. Its genome is complex due to its high genomic heterozygosity and ploidy level. Achieving chromosome-scale phased assembly is crucial for understanding the genomic basis of haplotype-specific sequence diversity, long-range regulatory landscape, and molecular interactions, etc. In 2019, a chromosome-scale genome assembly of cultivated strawberry “Camarosa” was released by assembling Illumina reads and PacBio CLR reads [3]. While this assembly served as a valuable resource for studying polyploid evolution and strawberry genetics, it was relatively fragmented. Recent advances in sequencing technologies and phasing assembly algorithms have allowed the generation of more contiguous and even haplotype-resolved genome assemblies (Additional file 1: Table S1). However, most studies to date have given little consideration to the effects of phasing strategies on decrypting the complex strawberry genome organization. Since the recent prevailing phasing assembly algorithms were mainly developed for diploids, the effectiveness of integrating various sequencing data on sorting haplotype-specific sequences at chromosome-scale level has not been systematically determined in polyploid species [4]. Moreover, producing a high-quality fully phased genome reference facilitates the characterization of structurally complex and rapidly evolving regions, such as centromeric satellites, a crucial aspect of plant polyploid genetics.
Centromeres, often referred to as the “dark matter of the genome,” are essential chromosomal components that ensure the accurate separation of chromosomes during mitosis and meiosis, playing a critical role in maintaining genomic stability in eukaryotes [5,6,7]. Centromeric DNA sequences are primarily composed of highly repetitive tandem satellites and interspersed transposable elements, yet their sequences, monomer copy numbers, and array organization can vary substantially [8]. The genomic locations of centromeres are not fixed, and centromere turnover has been observed in several species, including maize [9], potato [10], and Brachypodium [8], etc. Despite the considerable variation in centromere position and DNA sequences, their function is remarkably conserved across species, a phenomenon known as the “centromere paradox” [11, 12].
In plants, centromere function is mediated by the histone H3 variant protein (CENH3), which marks the sites of kinetochore assembly [12]. Centromeric regions can be identified using CENH3 Chromatin Immunoprecipitation Sequencing (ChIP-seq), a technique that enables precise localization of centromeres within the genome [6]. In allopolyploid species, meiotic segregation is particularly complex, as it involves the suppression of multivalent formation between homologous and homoeologous chromosomes, ensuring disomic inheritance [13]. Given its complex genetic composition, high chromosome numbers, and polyploid nature, the octoploid strawberry presents an intriguing model for studying centromere evolution.
Results and discussion
In this study, we generated 43.8 Gb PacBio HiFi reads, 114.5 Gb Nanopore R10 and 17.9 Gb Ultra-long reads, 105.2 Gb Hi-C and 59.9 Gb Illumina paired-end reads for an F1 individual (hereafter referred to as “EA78”) (Additional file 2: Fig. S1; Additional file 1: Table S2) that was derived from synthetic allopolyploids by artificial hybridization between two genetically-distinct F. x ananassa cultivars, “Albion” (paternal) and “Akihime” (maternal) (Fig. 1A). With these comprehensive data, we evaluated the accuracy of different phasing strategies on octoploid strawberry assembly and generated a nearly complete and fully phased octoploid strawberry genome using a two-step assembly strategy.
The octoploid “EA78” genome assembly and the turnover of centromeric satellites. A Plant materials and experimental design. The parental cultivars are both octoploids with four subgenomes (A, B, C, D), following the subgenome designation of Jin et al. (2023) [14]. Pat: paternal, Mat: maternal. B A two-step phased assembly strategy for achieving global phasing by integrating HiFi, Hi-C, ONT Ultra-long sequencing data, and pedigree information. Note: UL: ultra-long. PSK: paternal-specific k-mers. MSK: maternal-specific k-mers. HER: hamming error rate. C Construction of a nearly complete and fully phased assembly of the “EA78” genome. The genomic locations of putative centromeres were shown in each chromosome. Arrows highlighting six chromosomes lacking canonical centromeric repeats. D The characterization of canonical centromeric repeats. Left: sequence similarity (based on average nucleotide identity, ANI) among centromere repeats of F. vesca, paternal and maternal assemblies, and sequence for fluorescent prove design. Middle: “EA78” chromatin number statistics. Right: FISH assay using the type I centromeric satellites as probes (scale bar = 5 μm)
The haplotype phasing and Nanopore-assisted assembly capabilities were examined for octoploid strawberries. Firstly, four assembly pipelines that are three single-sample- and one trio-based were conducted to evaluate the phasing accuracy (Additional file 2: Fig. S2). For HiFi-only assemblies, both primary/alternative and dual assembly modes generally yielded relatively fragmented phased assemblies with high hamming errors (17.311% | 0.039%; 11.250% | 15.062%, respectively) (Additional file 2: Fig. S2). This contrasted apparently with the HiFi + trio binning assembly, which produced much more continuous and accurate assemblies of both haplotypes with a hamming error: 0.034% | 0.037% (Additional file 2: Fig. S2). Notably, the phasing quality of the HiFi + Hi-C assembly (hamming error: 0.041% | 0.052%) is comparable to that of the HiFi + trio binning assembly (Additional file 2: Fig. S2). However, without pedigree information, one innate ambiguity of the HiFi + Hi-C assembly strategy is that some haplotigs (haplotype-resolved contigs) may not be correctly partitioned into parental phases in the initial assemblies (referred to as local phasing) (Additional file 2: Fig. S2). In contrast, haplotigs generated by the HiFi + trio-binning strategy could achieve fully phased assembly (referred to as global phasing) with continuity compensation (Additional file 2: Fig. S2). Secondly, the impact of integrating Nanopore ultra-long and R10 reads on genome continuity was examined. While both types of Nanopore long reads significantly enhanced assembly continuity without introducing many extra phasing errors (< 0.02%), the sequence continuity of the HiFi + Hi-C + ultra-long mode is nearly twice as high as the HiFi + Trio-binning + ultra-long mode (N90: 21.32–22.55 Mb vs. 8.30–13.75 Mb) (Additional file 2: Fig. S3A). Compared with HiFi + Hi-C + ultra-long, the assembly graph generated by HiFi + Trio-binning + ultra-long encountered more error-prone nodes when separating haplotypes, potentially leading to a dramatic reduction in the continuity of phased blocks (Additional file 2: Fig. S3B). The impact of HiFi read sequencing depths on assembly quality was also evaluated, showing that as little as 40 × data could ensure a high-level genome continuity (Additional file 2: Fig. S4).
To generate a nearly complete and fully phased genome representation for octoploid strawberry “EA78,” we developed a two-step integrative phasing assembly strategy by using the continuous HiFi + Hi-C + Ultra-long assembly as the backbone and reassigning haplotigs based on trio information, followed by pseudochromosome construction via Hi-C interaction signals (Fig. 1B; Additional file 2: Fig. S6A). The remaining seven gaps were all filled using phased HiFi reads. These efforts resulted in two phased-resolved assemblies with genome sizes very close to their parental estimates (pat: 797.4 vs. 797.1 Mb; mat: 801.4 vs. 806.9 Mb) (Additional file 2: Fig. S5). The two haplotypes exhibited a high level of gene collinearity to the diploid reference, Fragaria vesca [15] (Additional file 2: Fig. S6BC).
The “EA78” genome comprises 98,225 and 96,552 gene models for the paternal and maternal assemblies, with 42.5–42.7% of them being transposable elements (Additional file 1: Table S3). Specifically, Copia and Gypsy retrotransposons account for 5.25–5.26% and 14.02–14.17% of the genome, respectively (Additional file 1: Table S4). Telomeric repeat motifs (“CCCTAAA”) were found in 106 out of 112 chromosome ends (Additional file 2: Fig. S6D). We further performed a series of measures to assess the genome continuity, accuracy, and completeness of the “EA78” genome, revealing 16.1–16.8 LAI values, 65.1 K-mer-based QV, 98.6% K-mer-based completeness, and less than 0.05% hamming error rates (Additional file 1: Table S3; Fig. 1B; Additional file 2: Fig. S6). All metrics indicated that “EA78” had achieved nearly complete and fully phased accurate genome references (Additional file 1: Table S3; Additional file 2: Fig. S6). Mapping HiFi reads from the American cultivar “Royal Royce” and the Asian cultivar “Benihoppe” to the two haplotypes of “EA78” revealed that “Royal Royce” had higher mapping rates to the paternal haplotype, while “Benihoppe” had higher mapping rates to the maternal haplotype (Additional file 2: Fig. S7). This suggested that the “EA78” genomes could serve as good references for genetically diverse strawberry cultivars.
An important question in polyploid biology is the role of centromere divergence in the evolution of homologous and homoeologous chromosomes [16, 17]. Characterization of tandem repeat content of the “EA78” genome revealed seven distinct repeat classes, five of which were shared between the maternal and paternal haplotypes (Additional file 1: Table S5). These repeats vary substantially in size, abundance, and chromosome distribution (Additional file 2: Fig. S8A). The most abundant tandem repeats, totaling 1.04% (16.69 Mb) of the whole genome, are putative centromeric satellites (Additional file 2: Fig. S8A). This tandem repeat class is characterized by a consensus repeat monomer of 147 bp, which showed high sequence similarity to the identified centromeric satellites in Fragaria vesca [18] (ANI = 97.0–99.9%) (Additional file 2: Fig. S8A). Interestingly, six chromosomes (1B_mat, 1C_mat, 1C_pat, 1D_mat, 1D_pat, and 7C_mat) lack this canonical centromeric satellites (Fig. 1C; Additional file 2: Fig. S8B). Examination of the assembly graphs, Hi-C signals and phasing correctness confirmed that all putative centromere regions were transversed, thereby ruling out the possibility of assembly artifacts (Additional file 2: Figs. S9, S10). Fluorescence in situ hybridization (FISH) experiments further confirmed that the absence of canonical centromeric satellites in six chromosomes of “EA78” (Fig. 1D) and in eight chromosomes of the maternal line “Akihime” (Additional file 2: Fig. S8C).
To verify the putative centromeric regions and identify potential neocentromeres for chromosomes lacking canonical repeats, we characterized eight CENH3 orthologs (identity: 90.0–95.2%) in “EA78” and designed peptide segments to produce a strawberry-specific CENH3 antibody (Fig. 2A; Additional file 2: Figs. S11, S12). CENH3-ChIP-seq experiments successfully identified all 56 centromeres in “EA78” (Fig. 2B; Additional file 2: Figs. S13, S14; Additional file 1: Tables S6). In most cases, CENH3 loading regions aligned well with the positions of canonical centromere satellites, with one notable exception: chromosome 4C_mat, where the centromeric satellite region (10.2–11.3 Mb) and CENH3 enrichment (11.7–11.8 Mb) were distinct, suggesting recent centromere repositioning (Additional file 2: Fig. S15). Six novel centromeric repeats were identified and grouped into three repeat classes (types II, III, and IV, with type I representing the canonical repeats) (Fig. 2D). As satellites 1B_mat and 7C_mat clustered within the same centromeric repeat class but exhibited relatively low sequence similarity (ANI = 87.9%), we further classified them as satellite IV-1 and satellite IV-2 (Fig. 2E; Additional file 1: Table S7). Interestingly, type I, II, III, and IV satellites showed some sequence similarity to one another (ANI: 80.0–90.0%) (Fig. 2E; Additional file 2: Fig. S17, S18), suggesting a gradual evolutionary trend in the centromeric satellites within octoploid strawberries.
Phylogenomic comparisons of centromeric repeat among wild and cultivated strawberries. A Investigation and production of a specific antibody for CEHN3 in octoploid strawberries. Eight CENH3 homologs were identified in octoploids using the F. vesca CENH3 protein sequence (LOC101294590). Eight CENH3 homologs were compared to determine the optimal polypeptide region for antibody production. B Multiple lines of evidence support the ability of the strawberry specific CENH3 antibody to recognize centromere regions. Using paternal chromosome 1A as an example, from top to bottom: CENH3 enrichment [log2(ChIP/Input); bin size: 1000 bp] using a common antibody (the light blue for original data, dark blue for smoothed data); CENH3 enrichment [log2(ChIP/Input); bin size: 1000 bp] using CENH3-specific antibody (light red for original data, dark red for smoothed data); distribution of type I centromeric satellite repeats on chromosomes (bin size: 1000 bp); distribution of Hi-C interaction signals (high interaction densities of interactions in red, low densities in white). C Sequence similarity matrix for 10 representative centromeres (including six neocentromere). D Identification and clustering of centromeric satellites. Letters A, B, C, and D represent subgenomes A, B, C, and D, respectively. Letters P and M represent paternal and maternal assemblies. E Average nucleotide identity (ANI) among four centromeric satellite types. Gray blocks indicate ANI values below 80. F ANI similarity matrix for centromere regions of homologous and homoeologous chromosomes. G Structural feature of the centromere region of Chr 5. The heatmap shows a high-resolution sequence similarity matrix of centromere regions (bin size: 2000 bp). H Identification and distribution of centromeric satellite repeats for each chromosome in F. vesca (FVES), F. chiloensis (FCHIL), F. virginiana (FVIRG), and “EA78” (EA). Different colors represent centromeric satellite types: red (type I), purple (type II), orange (type III), blue (type IV-1), and light blue (type IV-2). I FISH assay of F. chiloensis and F. virginiana using type I centromeric repeats as probes (scale bar = 5 μm). J Centromeric satellite similarity among F. chiloensis (Fchil), F. virginiana (Fvirg), and “EA78” (EA). Letters A, B, C, and D represent subgenome A, B, C, and D, respectively. Letters p and m represent paternal and maternal assembly, respectively. Numbers 1 and 2 represent haplotypes 1 and 2, respectively. K Ternary diagram showing the relative proportions of centromeric satellites of corresponding chromosomes in the three species. Haplotype 1 corresponds to the paternal assembly, haplotype 2 corresponds to the maternal assembly. Colors represent subgenomes: red (A), blue (B), orange (C), and cyan (D). Dot size represents the number of centromeric satellites. L Comparison of centromeric satellite numbers in the A subgenome of three octoploid species, with each data point representing one chromosome. F. vesca has only one consensus assembly (n = 7), while F. chiloensis, F. virginiana and “EA78” have both haplotypes (n = 14)
Centromeres in “EA78” exhibit striking variation in size among homologous and homoeologous chromosomes, ranging from 63 to 1101.7 kb, with a median size of 316.5 kb (Additional file 2: Fig. S16A). These regions are highly CG methylated (Additional file 2: Fig. S16D), consistent with typical sequence characteristics of centromere satellites observed in most plants [7;16]. The maternal and paternal chromosome pairs (n = 28) comprise heterogenous centromeric satellite arrays, with 22 pairs exhibiting size differences great than 10 kb and 6 pairs varying 1–10 kb (Table S6). Centromeric size significantly correlates (R2 = 0.94, p < 2.2e−16) with the number of centromeric monomers (Additional file 2: Fig. S16B). Moreover, 25 intact LTR retrotransposons were identified across the centromere regions, with over 80% as Gypsy class. The majority of these retrotransposons showed high sequence similarity between maternal and paternal haplotypes (ANI > 90%) (Additional file 2: Fig. S16C). A bimodal distribution of sequence similarity densities between canonical and neo-centromeres was observed, indicating substantial differences in sequence conservation between these regions (Fig. 2C).
Additionally, centromeres generally exhibited higher sequence similarity between homologs than among homoeologous chromosomes, particularly in Chr3A-D, Chr4A-D, and Chr5A-D (Fig. 2F; Fig. 2G). However, this pattern did not hold for all chromosomes, such as Chr1B, Chr2A, Chr6C, and Chr7C (Fig. 2F), implying that centromeric region similarity alone does not govern exclusive parental chromosome pairing and disomic inheritance during meiosis. Notably, the centromeres of Chr 5B-C and 6B-D were more similarity to each other than to the corresponding chromosome from the A subgenome (Fig. 2F; Fig. 2G; Fig. 2J), consistent with the proposed model that F. vesca is the ancestor of A subgenome, while B, C, and D subgenomes share a common ancestor with F. iinumae.
The evolutionary history of strawberry centromeric satellite turnover was studied using two wild octoploid progenitors (F. chiloensis and F. virginiana) and six F. ananassa cultivars or hybrids (Additional file 2: Fig. S20). To improve genome assembly for F. chiloensis and F. virginiana [14], we integrated ultra-long sequencing data, resulting in highly continuous assemblies (Additional file 2: Figs. S21, S22). Centromeric regions were identified (Additional file 2: Fig. S23, S24, S25; Additional file 1: Tables S8, S9), and complete centromere turnover was observed in four chromosomes of F. chiloensis and six chromosomes of F. virginiana (Fig. 2H). FISH experiments confirmed that the absence of canonical centromeric repeats (Fig. 2I). The timing of centromere turnovers in wild and cultivated strawberries were studied. In chromosome 1C, type III centromere repeats were detected in both haplotypes of F. chiloensis, F. virginiana, and “EA78,” suggesting centromere turnover in this chromosome occurred before the divergence of wild octoploids. In contrast, for chromosome 1D, type II centromeric repeats were detected both haplotypes of the F. chiloensis but not in F. virginiana (comprising type I repeats) (Fig. 2H). For chromosome 7C, no centromeric turnover was observed in either haplotype of F. chiloensis or F. virginiana (Fig. 2H), suggesting that the centromere turnover in 7C_mat (type IV repeats) likely occurred during strawberry breeding. Unexpectedly, for chromosome 1B, type IV centromere repeats were detected in F. chiloensis and F. virginiana and the maternal haplotype of “EA78” (Fig. 2H), suggesting shifts in centromere repeats in this chromosome. Examining centromere repeats in strawberry cultivars [19] further revealed that the turnover of canonical centromeric repeats in chromosome 1B appeared to be cultivar-dependent (Additional file 2: Fig. S19), suggestive of allelic variation among populations.
In comparing strawberry species with homologous chromosomes, most centromeric satellites are highly similar (ANI > 90%), but not identical, indicating minimal sequence divergence during strawberry polyploidization and domestication (Fig. 2J). In contrast, centromeric satellite amplification was evident in the cultivated strawberry “EA78” compared to the two wild octoploids (Fig. 2K). Notably, centromere length showed significant (p < 0.05) expansion in subgenome A of two wild octoploids relative to F. vesca, and this expansion persisted in cultivated strawberries [20,21,22] (Fig. 2L; Additional file 2: Figs. S26A, S27). This suggests that polyploidization and domestication might promote centromere expansion. Pairwise comparisons revealed that 15% of centromeric bins were identical in four species (Additional file 2: Fig. S28), further supporting centromeric expansion followed strawberry polyploidization, hybridization, and domestication. The mechanisms and effects of this centromere expansion merit further investigation.
Altogether, despite the remarkable karyotypic stability in wild and cultivated octoploid strawberries [14], centromeric satellites represent hotspots of structural variations including monomer copy number changes and satellite turnovers. This genome feature might be linked to the rapid centromere-mediated rediploidization in octoploids.
Conclusions
In summary, by leveraging substantial newly generated sequencing data and developing a two-step integrative phasing assembly strategy, we generated fully phased genome sequences for F. x ananassa cv. “EA78” with a low hamming error rate (0.042% | 0.042%), gapless, and chromosome-level contiguity. These assemblies could serve as good references in current strawberry genomics for population-level sequencing mapping and facilitates our understanding of the evolution of centromeric satellites in octoploid strawberries. However, since this approach relies on sequencing data from multiple platforms, further exploration is needed to develop single-platform and single-sample based genome assembly and incorporate additional allelic information, such as methylation state, for haplotype phasing—particularly in polyploid species.
Methods
Genome sequencing
The “EA78” line was derived from a cross between two commercial strawberry cultivars, F. x ananassa cv. “Albion” (an American breeding accession with day-neutral flowering habits) and “Akihime” (an Asian breeding accession with short-day flowering habits). Genomic DNA was extracted from young leaves of “EA78” using the CTAB method for whole-genome sequencing across multiple platforms. DNA was also extracted from the parental lines, “Albion” and “Akihime” for Illumina sequencing, as well as from the two wild octoploid progenitors, Fragaria chiloensis and Fragaria virginiana, to improve the continuity of genome assemblies from our previous study [14].
PacBio circular consensus sequencing (CCS)
The genome of “EA78” was first sequenced using the PacBio HiFi Sequel II platform. Briefly, EA78 genomic DNA was size-selected for 15–40 kb with AMPure PB beads and used to construct a SMRTbell library according to the manufacturer protocol for PacBio 15 kb library preparation (Pacific Biosciences, CA, USA). Sequencing was performed at Grandomics Biosciences Co. (Wuhan, China), generating approximately 43.8 Gb PacBio HiFi reads for downstream analyses (Additional file 1: Table S2).
Nanopore ultra-long and R10 sequencing
Libraries were constructed according to the standard protocols for Nanopore library preparation. Sequencing was performed on a PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK) at Grandomics Biosciences Co. (Wuhan, China). Approximately 17.2–18.7 Gb Nanopore ultra-long reads were generated for “EA78,” F. chiloensis and F. virginiana, respectively. Additionally, approximately 114.5 Gb Nanopore R10 reads were generated for “EA78” and those long reads (> 40 Kb) were retrieved using Filtlong v0.2.1 (https://github.com/rrwick/Filtlong).
Genome size estimation
K-mer frequency and flow cytometry were used to estimate the genome sizes of the F. x ananassa cv. “Albion” and “Akihime” and their F1 hybrid “EA78.” For k-mer-based analysis, an Illumina sequencing library was constructed for each sample and sequenced on the Illumina HiSeq 2500 platform, generating 23.4 Gb, 24.0 Gb, and 59.9 Gb paired-end reads, respectively (Additional file 1: Table S2). The k-mer (21-mers) frequency was calculated using jellyfish v.2.2.10 [23], and genome size was estimated using GenomeScope v1 [23]. Flow cytometry is also used to estimate genome size. Briefly, suspensions of each sample and the internal reference sample (Oryza sativa L. “Japonica”) were mixed, and a BD FACSCalibur flow cytometer was used to detect the stained cell nuclei in suspension samples. The ploidy level of the “EA78” was estimated using Smudgeplot [24].
Hi-C sequencing
Hi-C libraries were constructed from cross-linked chromatin of plant cells following a standard protocol. Briefly, fresh leaves were used for in vivo cross-linking with 2% formaldehyde supplemented into nuclei isolation buffer. The purified nuclei were digested with HINDIII enzyme, and the ligated DNA was sheared and size-selected into 300–600 bp fragments for library construction. Libraries were sequenced on a MGISEQ-T7 device, generating a total of 105.2 Gb Hi-C data (Additional file 1: Table S2).
Genome assembly
Genome assemblies with different sequencing depths
We randomly selected 10X, 20X, 30X, 40X, and 50X HiFi reads for genome assemblies using the HiFi + Hi-C mode implemented in Hifiasm v.0.19.5-r587 [4] with default parameters.
Genome assemblies with different phasing strategies
All sequenced “EA78” HiFi reads were used to perform de novo genome assembly using Hifiasm v.0.19.5-r587 [4] with four different haplotype phasing modes: mode 1 (the primary/alternative mode), with only the HiFi reads as input and mainly generating a continuous primary assembly and a fragmented alternative assembly; mode 2 (the dual assembly mode), with only the HiFi reads as input and generating two continuous assemblies; mode 3 (the HiFi plus Hi-C mode), with both the HiFi reads and Hi-C data as inputs [4]; mode 4 (the HiFi plus trio-binning mode), with the HiFi reads and trio data as inputs. All the assemblies were generated with default parameters.
Integration of Nanopore reads for phased assembly
We perform de novo genome assembly using Hifiasm v.0.19.5-r587 [4] with four different data combinations: combination 1, with the HiFi reads, Hi-C and Nanopore R10 data as inputs (HiFi + Hi-C + ONT R10); combination 2, with the HiFi reads, Hi-C and Nanopore ultra-long data as inputs (HiFi + Hi-C + ONT ULT); combination 3, with the HiFi reads, Trio-binning and Nanopore R10 data as inputs (HiFi + Trio + ONT R10); combination 4, with the HiFi reads, Trio-binning and Nanopore ultra-long data as inputs (HiFi + Trio + ONT ULT). All the assemblies were generated with default parameters.
Pseudochromosome construction
To investigate structural variations between homologs and homoeologs, we first classified Hi-C reads into three categories (paternal-specific, maternal-specific and shared) using the Hi-C trio-binning pipeline (https://github.com/BGI-Qingdao/HicTrioBinning). To construct the pseudochromosomes for “EA78,” haplotype-specific and shared Hi-C reads were mapped to the corresponding haplotype draft assembly by using BWA v0.7.14-r1188 [25], and the contigs were anchored into the 28 pseudochromosomes by using ALLHiC [26]. The Hi-C interaction signals in each chromosome were manually checked and adjusted with Juicebox [27]. The phased HiFi reads were used for final haplotype assembly gap filling by using LR_Gapcloser (https://github.com/CAFS-bioinformatics/LR_Gapcloser) and TGS-GapCloser [28].
Quality assessment of genome assemblies
Genome continuity, completeness and base accuracy
QUAST v.4.5 [29] was used to evaluate assembly continuity and report contig N50 size and gap numbers. Long terminal repeat retrotransposons (LTR-RTs) were identified by LTR_FINDER v.1.1 [30] and LTRharvest v.1.1 [31] and used to calculate the LTR Assembly Index [32, 33]. Merqury [34] was used to diagnose kmer spectra (19-mers) based on HiFi reads and report per-base quality values (QV) and genome completeness estimates.
Haplotype phasing estimates
Merqury [34] was used to identify the unique k-mers (19-mers) from the genetically distinct parental lines, F. x ananassa cv. “Albion” and “Akihime.” Then, the hamming error rate was calculated using the following formula: The Hamming error rate = ∑i min{pi, mi} / ∑i (pi + mi), where pi and mi are the numbers of paternal- and maternal-specific k-mers on each contig or chromosome [4]. Additionally, HiFi reads were aligned back to their corresponding assemblies (comprising both haplotypes) using minimap2 v.2.17-r941 [35] and generated a coverage histogram plot by purge_dups [36] to determine if the removal of false duplicated contigs was necessary.
Subgenome assignment
Assigning chromosomes into subgenomes was conducted using diagnostic kmers, following previously described method [14].
Genome annotation
A de novo transposable element (TE) library was constructed based on the “EA78” genome (including both haplotype assemblies) using RepeatModeler v.2.0.1 [37]. Repetitive sequences in the “EA78” genomes were then identified using RepeatMasker v.4.1.0 [38] with the combination of the de novo-built TE library and Repbase (v.20181026) with the parameter “-s.” The “EA78” genome were soft-masked before further analyses. The gene models of “EA78” genome were annotated using Liftoff (v1.6.3) [39]. The F. × ananassa cv. “Royal Royce” reference annotation, which ultilized 12.1 Gb of IsoSeq data and 508.0 Gb RNA-seq for transcriptome assembly and downstream annotation [19] was projected onto the assembly. BUSCO v.4.1.2 was used to evaluate the completeness of gene annotation based on the embryophyta_odb10 database [40].
Genome-wide synteny analysis and haplotype-specific variations
Genome-wide collinearity was conducted using the JCVI v.1.0.10 software package (https://github.com/tanghaibao/jcvi) with the default parameters. Whole genome alignment was performed using the minimap2 v.2.17-r941 [35], and the haplotype variations were identified by using SYRI v.1.6.3 [41] with the default parameters.
Identification of telomeres
Telomeric repeat satellites (3′-TTTAGGG/5′-CCCTAAA) were examined within 2-kb regions of “EA78” pseudochromosome ends using Tidk v.0.2.0 [42]. For chromosomes lacking telomeric repeats, we mapped phased long reads (HiFi, R10, Ultra-long reads) onto chromosomes, then extracted the mapped long reads close to the chromosome ends (200 bp region), and re-calculated the number of telomeric satellites. If reads were enriched for telomere satellites, we made attempts to extend the telomeric region using teloExtend.pl (https://github.com/tolkit/nemADSQ).
Centromere analysis
Identification and clustering of tandem repeats
HiFi reads and haplotype assembly as inputs to the computational pipeline Centromics, which is designed for identifying, clustering and displaying tandem repeats.
Identification “EA78” centromeric satellites
A comparison of sequence similarity of the seven tandem repeat clusters (CL1-CL7) was conducted using Blast and FastANI [43] with the parameter “–fragLen 100.” The tandem repeats of the CL3 cluster (147 bp) were most abundant and exhibited the highest average nucleotide identity (ANI) to the identified centromere repeats in Fragaria vesca [18]. Based on this, we identified candidate centromere regions by analyzing the distribution of CL3 repeats across the genome, designating this canonical centromeric satellites as type I centromeric repeat.
Analysis of the structural features of centromeres
The centromere region was cut into 100 bp bins and the sequence similarity matrix was calculated using the StainedGlass [44].
Methylation analysis
Fruits of cultivated strawberry “Albion” and “Akihime” at red fruit stage were collected, and total DNA was extracted from them for WGBS sequencing. Low-quality reads were filtered using fastp v.0.20.1 [45]. Genome indexes were constructed using the Bismark program and supporting scripts [46], reads were mapped to the corresponding haplotype reference genomes and duplicates were removed due to over-amplification by PCR, and finally methylation levels of the target regions were counted by ViewBS [47].
Fluorescence in situ hybridization (FISH) experiment
Chromosome preparation and FISH experiments for “EA78,” the maternal line “Akihime,” and the two wild octoploids, F. chiloensis and F. virginiana, were performed following established protocol [48]. Briefly, the treated root tips were fixed in Carnoy’s fixative. The root tips were then digested using an enzyme mixture containing 4% cellulase, 1% pectolyase Y23, and 2% pectolyase at 37 °C for 1.5 h, followed by squashing with a cover slip on a slide. The slides containing well-spread mitotic metaphase chromosomes were selected for FISH experiments. Centromeric DNA was labeled with Dig-dUTP by nick translation. The hybridization signal was detected using rhodamine-conjugated anti-digoxigenin for digoxigenin-labeled probe. FISH signals and chromosome images were captured using an Olympus DP80 CCD camera attached to an Olympus BX63 fluorescence microscope. All images were processed using the cellSens Dimension 1.9 software, and the final contrast of the grayscale images was further adjusted using Adobe Photoshop CC software.
CENH3 chromatin immunoprecipitation and sequencing (ChIP-seq)
Antigen identification
The CENH3 protein sequence of F. vesca (LOC101294590) was used as a reference to identify multiple homologs and homoeologs in octoploid “EA78” using BLAST v2.10.0 [49]. A comprehensive analysis of the CENH3 variants in “EA78” was performed, leading to the design of the polypeptide region “TGPPTQTQRKKRRNRPG” as an antigen. The polypeptide was evaluated for hydrophobicity, epitope exposure, and immunogenicity analysis.
Antibody production and validation
The corresponding peptide is synthesized, emulsified with an adjuvant, and injected into rabbits for immunization. After 7–8 weeks, blood samples were collected, and serum was harvested. Antibodies were purified via peptide-specific affinity chromatography. Their specificity and reactivity against the target antigen were validated using Enzyme Linked Immunosorbent Assay (ELISA).
CENH3-ChIP-seq
Cells were crosslinked with formaldehyde, nuclei were extracted, and chromatin was sheared via sonication. Immunoprecipitation was performed using the CENH3-specific antibodies, with antibody-chromatin complexes captured using Protein A/G beads. Following multiple washes to remove non-specific bindings, the complexes were de-crosslinked, and DNA was extracted for sequencing library construction. High-throughput sequencing was then performed for data acquisition.
Data processing
Raw CENH3-ChIP-seq data were cleaned using fastp v.0.20.1 [45] for quality control. Reads were aligned to their corresponding haplotype assemblies using Bowtie2 [50] with parameters “–very-sensitive –no-mixed –no-discordant –maxins 800.” The best aligned reads were retained for downstream analysis. The CENH3 enrichment for each 1000 bp bin was quantified using bamCompare from the DeepTools package [51], with the parameters “–binSize 1000 –outFileFormat bedgraph –operation log2 -p 5 –extendReads.”
New centromeric repeats
CENH3 enriched regions were analyzed to identify novel centromeric repeats, resulting in the discovery of six new centromeric satellite sequences. These repeats were grouped into three classes: type II (146 bp), type III (158 bp), and type IV (147 bp). Since the centromeric satellite clusters 1B_mat and 7C_mat were categorized into the same type but exhibited lower sequence similarity (ANI = 87.9), we further subdivided them into repeat types IV-1 and IV-2.
Reassembly of the two wild octoploid progenitors
The newly generated ONT Ultra-long reads and previously sequenced HiFi reads [14] were used to de novo assemble the F. chiloensis and F. virginiana genomes using Hifiasm v.0.19.5-r587 [4] with default parameters, integrating the Hi-C reads [14] to achieve haplotype-resolved and highly continuous assemblies. The construction of pseudochromosomes and the gap-filling process followed the same approach as in the “EA78” assembly.
Phylogenomic comparisons of centromere repeats in diploid and octoploid strawberries
The reference genome assembly of diploid F. vesca [18] was retrieved from the Genome Database for Rosaceae (GDR). To identify centromeric satellites in the genome assemblies of F. chiloensis, F. virginiana and F. vesca, five representative centromeric repeats (types I, II, III, IV-1, and IV-2) identified in “EA78” were used as references. Homologous searches were conducted using MegaBLAST with the parameter “-evalue 1e-4.” The BLAST output was analyzed to quantify the percentage of each centromeric repeat across these three wild strawberry species. To infer centromeric regions, the genome assemblies were divided into 1000 bp windows using BEDTools “makewindows” function [52], followed by the application of the BEDTools “instersect” function [52] to determine the distribution of centromeric satellites within each window.
Data availability
All the raw genome sequencing data and the genome files have been submitted to the National Genomics Data Center (https://ngdc.cncb.ac.cn/) under BioProject PRJCA015503 [53] and PRJCA032783 [54]. Public strawberry genome data were retrieved from the Genome Database for Rosaceae (GDR) [54]. The Genome assembly and annotation files of F. x ananassa ‘EA78’ and the updated assemblies of F. chiloensis and F. virginiana have been submitted to GDR as well.
References
Liston A, et al. Fragaria: a genus with deep historical roots and ripe for evolutionary and ecological insights. Am J Bot. 2014;101(10):1686–99.
Whitaker VM, et al. A roadmap for research in octoploid strawberry. Hortic Res. 2020;7:33.
Edger PP, et al. Origin and evolution of the octoploid strawberry genome. Nat Genet. 2019;51:541–7.
Cheng H, et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat Biotechnol. 2022;40:1332–5.
Comai L, et al. Plant centromeres. Curr Opin Plant Biol. 2017;36:158–67.
Liu Y, et al. Pan-centromere reveals widespread centromere repositioning of soybean genomes. Proc Natl Acad Sci U S A. 2023;120:e2310177120.
Naish M, Henderson IR. The structure, function, and evolution of plant centromeres. Genome Res. 2024;34:161–78.
Chen C, et al. Three near-complete genome assemblies reveal substantial centromere dynamics from diploid to tetraploid in Brachypodium genus. Genome Biol. 2024;25:63.
Chen J, et al. A complete telomere-to-telomere assembly of the maize genome. Nat Genet. 2023;55:1221–31.
Bao Z, et al. Genome architecture and tetrasomic inheritance of autotetraploid potato. Mol Plant. 2022;15:1211–26.
Henikoff S, et al. The centromere paradox: stable inheritance with rapidly evolving DNA. Science. 2001;293:1098–102.
Zhou J, et al. Centromeres: from chromosome biology to biotechnology applications and synthetic genomes in plants. Plant Biotechnol J. 2022;20:2051–63.
Lloyd A, Bomblies K. Meiosis in autopolyploid and allopolyploid Arabidopsis. Curr Opin Plant Biol. 2016;30:116–22.
Jin X, et al. Haplotype-resolved genomes of wild octoploid progenitors illuminate genomic diversifications from wild relatives to cultivated strawberry. Nat Plants. 2023;9:1252–66.
Edger PP, et al. Single-molecule sequencing and optical mapping yields an improved genome of woodland strawberry (Fragaria vesca) with chromosome-scale contiguity. Gigascience. 2018;7:gix124.
Soltis DE, et al. What we still don’t know about polyploidy. Taxon. 2010;59:1387–403.
Melters DP, et al. Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome biol. 2013;14:1–20.
Zhou Y, Xiong J, Shu Z, et al. The telomere-to-telomere genome of Fragaria vesca reveals the genomic evolution of Fragaria and the origin of cultivated octoploid strawberry. Hortic Res. 2023;10:uhad027.
Hardigan, et al. Blueprint for phasing and assembling the genomes of heterozygous polyploids: application to the octoploid genome of strawberry. BioRxiv. 2021.
Cauret CMS, et al. Chromosome-scale assembly with a phased sex-determining region resolves features of early Z and W chromosome differentiation in a wild octoploid strawberry. G3. 2022;12:jkac139.
Song Y, et al. Phased gap-free genome assembly of octoploid cultivated strawberry illustrates the genetic and epigenetic divergence among subgenomes. Hortic Res. 2023;11:uhad252.
Mao J, et al. High-quality haplotype-resolved genome assembly of cultivated octoploid strawberry. Hortic Res. 2023;10:uhad002.
Vurture GW, et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33:2202–4.
Ranallo-Benavidez TR, et al. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun. 2020;11:1432.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.
Zhang X, et al. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat Plants. 2019;5:833–45.
Dudchenko O, et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science. 2017;356:92–5.
Xu M, et al. TGS-GapCloser: a fast and accurate gap closer for large genomes with low coverage of error-prone long reads. Gigascience. 2020;9:giaa094.
Gurevich A, et al. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–5.
Ou S, Jiang N. LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons. Mob DNA. 2019;10:48.
Ellinghaus D, et al. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 2008;9: 8.
Ou S, Jiang N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 2018;176:1410–22.
Ou S, et al. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. 2018;46: e126.
Rhie A, et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020;21:245.
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100.
Guan D, et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics. 2020;36:2896–8.
Smit A, Hubley R. RepeatModeler Open-1.0. 2010. http://www.repeatmasker.org/RepeatModeler/.
Tarailo-Graovac M, Chen N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. 2009;25:4 .10.1-4.10.14.
Shumate A, Salzberg SL. Liftoff: accurate mapping of gene annotations. Bioinformatics. 2021;37:1639–43.
Simão FA, et al. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2.
Goel M, et al. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 2019;20:277.
Brown M, González De la Rosa PM, Mark B. A telomere identification toolkit. Zenodo. 2023.
Hernández-Salmerón JE, Moreno-Hagelsieb G. FastANI, mash and dashing equally differentiate between Klebsiella species. PeerJ. 2022;10: e13784.
Vollger MR, et al. StainedGlass: interactive visualization of massive tandem repeat structures with identity heatmaps. Bioinformatics. 2022;38:2049–51.
Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90.
Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27:1571–2.
Huang X, et al. ViewBS: a powerful toolkit for visualization of high-throughput bisulfite sequencing data. Bioinformatics. 2018;34:708–9.
Huang Y, et al. The formation and evolution of centromeric satellite repeats in Saccharum species. Plant J. 2021;106:616–29.
Altschul SF, et al. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.
Langmead B, et al. Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics. 2019;35:421–32.
Ramírez F, et al. deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 2014;42:W187–91.
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.
Jin X, et al. A fully phased octoploid strawberry genome reveals the evolutionary dynamism of centromeric satellites. Datasets. 2024. https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA015503.
Jung S, et al. 15 years of GDR: new data and functionality in the genome database for Rosaceae. Nucleic Acids Res. 2019;47:D1137–45. Datasets. https://www.rosaceae.org/species/fragaria/all.
Acknowledgements
The authors gratefully appreciate two colleagues from Yunnan Academy of Agricultural Sciences, Dr. Jiwei Ruan and Hong Wang, for their assistance with crossing experiments, sampling collection, and maintenance of strawberry germplasms.
Peer review information
Andrew Cosgrove was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.
Funding
This study was jointly supported by the High-Level Talent Program of Yunnan Province (YDYC20170138), the Key and Major Program for Basic Research Project (202401AS070094, 202401BC070001), and the National Natural Science Foundation of China (no. 32270245).
Author information
Authors and Affiliations
Contributions
A.Z. conceived the project and designed the research. X.J., A.Z., and H.D. performed experiments and conducted analyses and interpreted results. M.C., Y.H., and X.Z. provided analysis tools. X.J. and A.D. wrote the manuscript with inputs form all authors.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
13059_2025_3482_MOESM1_ESM.xlsx
Additional file 1: Supplementary Tables S1-S9. This file contains the supplementary tables referenced in the main text. Table S1. Summary of sequenced octoploid strawberry genomes. Table S2. Summary of sequencing platforms and raw data. Table S3. Summary of genome features and quality evaluation of the EA78 genome. Table S4. Summary of transposable elements identified in the “EA78” genome assemblies. Table S5. Identification and clustering of tandem repeats in “EA78.” Table S6. Identified centromere region of each chromosome of “EA78.” Table S7. Sequences of centromeric satellites. Table S8. Predicted F. chiloensis centromere region of each chromosome. Table S9. Predicted F. virginiana centromere region of each chromosome.
13059_2025_3482_MOESM2_ESM.pdf
Additional file 2: Supplementary Figs. S1-S28. This file contains the supplementary figures referenced in the main text. Fig. S1. “EA78” genome survey and sequencing. Fig. S2. Evaluation of different assembly strategies for phasing the octoploid “EA78” genome. Fig. S3. Evaluation of the effects of integrating Nanopore long reads on “EA78” phased assemblies. Fig. S4. Effects of different sequencing depths on assembly quality. Fig. S5. Two-step phased assembly strategy. Fig. S6. Evaluation of “EA78” genome assembly quality. Fig. S7. Mapping HiFi reads of American and Arian to the “EA78” reference assemblies. Fig. S8. Identification of putative centromeres in “EA78.” Fig. S9. Validation of canonical centromeric satellite turnover on chromosomes 1B, 1C, 1D, and 7C. Fig. S10. Validation of the absence of canonical centromeric satellite in maternal chromosome 1B and its lack of correlation with phasing errors. Fig. S11. Inspection of CENH3-occupied region assembly. Fig. S12. CEHN3 antigen survey and antibody production. Fig. S13. Identification of centromere regions in the paternal assembly of “EA78.” Fig. S14. Identification of centromere regions in the maternal assembly of “EA78.” Fig. S15. Centromere repositioning. Fig. S16. Genomic features of “EA78” centromeres. Fig. S17. Sequence similarity and divergence of four centromeric satellite types. Fig. S18. Sequence similarity of centromeric satellites among “EA78” subgenomes. Fig. S19. Comparison of the main satellite types across chromosomes among Asian, Asian and European-American, European-American varieties. Fig. S20. Simplified evolutionary history of hybridization and domestication of cultivated strawberry. Fig. S21. Reassembly of the two wild octoploid progenitors (F. chiloensis and F. virginiana) to improve genome continuity. Fig. S22. Heat map of Hi-C interactions of the two updated haplotype assemblies of F. chiloensis and F. virginiana. Fig. S23. Centromere structural features of the wild octoploid strawberry F. chiloensis. Fig. S24. Centromere structural features of the wild octoploid strawberry F. virginiana. Fig. S25. Centromere structural features of the cultivated strawberry “EA78.” Fig. S26. Centromere length variations among wild and cultivated strawberry species. Fig. S27. Comparison of the number of centromeric satellite counts in F. vesca and the A subgenome of octoploid strawberries. Fig. S28. Identical centromeric bin analysis.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Jin, X., Du, H., Chen, M. et al. A fully phased octoploid strawberry genome reveals the evolutionary dynamism of centromeric satellites. Genome Biol 26, 17 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-025-03482-0
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-025-03482-0