TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing

Qi, Junhai; Li, Zhengyi; Zhang, Yao-zhong; Li, Guojun; Gao, Xin; Han, Renmin

doi:10.1186/s13059-024-03423-3

Software
Open access
Published: 04 November 2024

TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing

Junhai Qi¹^na1,
Zhengyi Li¹^na1,
Yao-zhong Zhang²,
Guojun Li¹,
Xin Gao³ &
…
Renmin Han ORCID: orcid.org/0000-0003-4761-6526¹

Genome Biology volume 25, Article number: 285 (2024) Cite this article

2268 Accesses
4 Altmetric
Metrics details

Abstract

Oxford Nanopore Technologies (ONT) offers ultrahigh-throughput multi-sample sequencing but only provides barcode kits that enable up to 96-sample multiplexing. We present TDFPS-Designer, a new toolkit for nanopore sequencing barcode design, which creates significantly more barcodes: 137 with a length of 20 base pairs, 410 at 24 bp, and 1779 at 30 bp, far surpassing ONT’s offerings. It includes GPU-based acceleration for ultra-fast demultiplexing and designs robust barcodes suitable for high-error ONT data. TDFPS-Designer outperforms current methods, improving the demultiplexing recall rate by 20% relative to Guppy, without a reduction in precision.

Background

Recently, single-molecule sequencing based on ONT has emerged, offering freedom from long reads, point-of-care, and polymerase chain reactions (PCRs). Specifically, ONT has been widely applied in various research fields, including genome assembly [1,2,3,4,5,6], transcriptome assembly [7,8,9], methylation research [10,11,12], and mutation identification [13,14,15]. To efficiently utilize sequencing capacity and reduce sequencing costs, multiple DNA/RNA samples can be integrated with unique barcodes and sequenced simultaneously on a flow cell [16]. After sequencing, demultiplexing is necessary to classify the sequences according to their corresponding barcodes. To address the demultiplexing problem, several methods have been introduced in recent years, such as DeepBinner [17] and DeePlexiCon [18]. These methods utilize convolutional neural networks (CNNs) to directly process the native nanopore signals for demultiplexing, improving upon traditional sequence-based tools like Porechop. However, they do not explore which barcodes are most conducive to effective demultiplexing. Currently, ONT provides a barcode kit (EXP-PBC096) that supports the simultaneous sequencing of up to 96 samples. As the number of samples increases, an additional strategy is needed for large-capacity multiple-sample sequencing [19]. A direct solution is to design specified barcodes for accurate and large-capacity sample demultiplexing.

Barcode design can be viewed as an error-correcting code design problem, and related theories have been developed since the 1970s [20, 21]. To address the needs of high-throughput next-generation sequencing, Hamming codes and Reed-Solomon code barcodes have been introduced into DNA barcode design. Hamady et al. [22] developed a new set of barcodes based on error-correcting codes. Zorita et al. [23] described an exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. Hawkins et al. [24] presented and experimentally validated filled/truncated right-end edit (FREE) barcodes, which corrected substitution, insertion, and deletion errors for next-generation sequencing. Although numerous barcode schemes have been proposed, these schemes are designed on the prerequisite that the sequencing error rate is very low (less than 1%), which means that these schemes are likely not applicable to third-generation sequencing data with higher sequencing error (∼6–15% [5]). In the context of nanopore sequencing, [25] utilized an evolutionary model to design 96 “Molbit barcodes” that ensured dissimilarity in nanopore electrical signals. A specially trained convolutional neural network (CNN) was employed to accurately demultiplex these barcodes. However, this approach did not produce a kit with a larger capacity than the ONT barcode kit, limiting its utility for multi-sample sequencing involving a greater number of samples.

Barcode design must observe two key principles, i.e., large barcode capacity and high sequence difference. For ONT sequence data, the measure of sequence difference could be based on either the raw current signal or base-called nucleotides. Edit distance [26] can effectively measure the similarity between two DNA sequences. However, relying solely on edit distance for demultiplexing can result in the loss of a significant amount of useful data. To further improve edit distance, some approaches take into account the quality score of each base obtained after sequencing. This score is highly correlated with the probability that the base has been correctly sequenced. Some quality-aware probabilistic methods that account for these quality scores have been applied to sequence error correction [27] and demultiplexing problems [28] in next-generation sequencing (NGS). Many alignment-free similarity measures have also been proposed [29,30,31,32]. In contrast, signal-based approaches [17, 18, 33, 34] have been widely utilized in direct nanopore sequence analysis, most of which are based on the dynamic time warping (DTW) algorithm to measure the signal difference [35,36,37]. Just as probabilistic methods account for substitution errors in NGS, DTW addresses inherent error profiles by directly comparing raw nanopore signals.

In this study, we propose a Designer for a barcode kit that employs a well-defined Threshold to reduce the sampling space of the DTW-based Farthest Point Sampling algorithm (TDFPS-Designer) for accurate barcoded sample demultiplexing in nanopore sequencing. TDFPS-Designer selects barcodes within a given sequence space by the farthest point sampling algorithm, directly based on the comparison of nanopore signals. Additionally, a DTW distance-based demultiplexing strategy is designed to ensure accurate sample label assignment. Three barcode kits with different barcode lengths were designed by TDFPS-Designer. Experiments demonstrated that TDFPS-Designer is capable of designing barcode sets with $\ge 99\%$ demultiplexing accuracy, superior to the randomly selected barcodes and ONT official strategy. Specifically, there is almost no “collision” during the demultiplexing of TDFPS-Designer’s barcode set. When demultiplexing large-capacity samples with high sequencing error rates, the demultiplexing recall of TDFPS-Designer’s barcode kit is approximately 20% higher than that of current official ONT tools, which provides an alternative for the demultiplexing of barcode kits with high sequencing error.

Results

Algorithms overview

TDFPS-Designer selects barcode candidates from a specified set of an entire k-mer space or user-defined sequences. The workflow of TDFPS-Designer is illustrated in Fig. 1. The sampling space is first reduced to a subset of sufficiently distinct sequences, such that the DTW distance between any two sequences in the subset is greater than the threshold r (Fig. 1a). Here, we begin by randomly selecting a sequence, and the relationship between the randomly selected sequence and the final set of barcodes is explored in detail in Additional File 1: S1. The demultiplexing strategy of TDFPS-Designer is depicted in Fig. 1b, where demultiplexing is performed directly from the DTW distance matrix. To further enhance the robustness of the initially designed barcodes, our method can automatically simulate the demultiplexing process and filter out barcodes with poor demultiplexing performance, resulting in the final set of designed barcodes (Fig. 1c). The use of TDFPS-Designer is described comprehensively in Additional File 1: S2.

Benchmark datasets

All simulated datasets were generated by DeepSimulator1.5, squigulator [38], Badread [39], and our own multisample sequencing simulator. The simulated electrical signals could achieve ∼88–92% base-calling accuracy (Additional File 1: S3), while the real-world electrical signals based on MinION R9.4 can achieve ∼85–94% sequencing accuracy [5], which indicates that the difference between the simulated electrical current signals and the real electrical current signals is negligible. In addition, we carefully studied the construction process of the ONT multisample sequencing library to ensure maximum consistency between the simulated data and the real data. We also employed Badread to produce sequences with different sequencing error rates. This allowed us to examine the impact of sequencing error rates on demultiplexing, which influences both the capacity of the resulting barcode kit and the selection of demultiplexing strategies.

A few small datasets were generated for the initial evaluation of our barcode design strategy (Additional File 1: S4.1). On the other hand, we thoroughly investigated the library preparation process of ONT and integrated it into our data simulator. The detailed use of the simulator and the introduction of parameters are given in Additional File 1: S4.2. Based on our data simulator, we generated different types of datasets, and the details of these multisample sequencing datasets are shown in Table 1. The detailed data generation process can be found in Additional File 1: S4.3.

Table 1 Datasets for evaluating all methods

Full size table

Evaluation metrics

The goal of our approach is to design enough barcodes with different lengths that can be easily demultiplexed. For this purpose, we evaluate the demultiplexing performance of different demultiplexing algorithms (Guppy and our method) on our designed barcode kits using precision, recall, average accuracy and F1-score, which reflect whether the barcodes we designed can be easily demultiplexed. Precision measures how many instances are indeed positive given that the model predicted some instances to be positive. In simple terms, precision reflects the credibility of the model’s prediction of positive samples. Recall, also known as the true positive rate or sensitivity, quantifies the ability of a model to capture all positive examples from a dataset. Each barcode corresponds to a precision (recall, F1-Score). For example, after demultiplexing, assuming that the barcode label of sequence in $\{read_1, read_2...,read_n\}$ is barcode, the set of sequences that actually carry this barcode is B, then the formula to calculate recall for this barcode is $\frac{\left| \{read_1, read_2...,read_n\} \cap B \right| }{\left| B \right| }$. Once we obtain all the indicators corresponding to all barcodes, the average of all accuracy rates is recorded as the average accuracy, the minimum precision (recall, F1-score) is recorded as the minimum precision (recall, F1-score), and the second minimum precision (recall, F1-score) is recorded as minimum-2 precision (recall, F1-score). When working with numerous barcodes, a high average accuracy in demultiplexing results does not necessarily mean a consistently high accuracy across all barcodes. There might be instances where the algorithm performs well for most barcodes but poorly for specific ones. Relying solely on average accuracy might not offer a complete assessment of demultiplexing effectiveness. By considering minimum/minimum-2 precision (recall, F1-score) alongside average accuracy, we can gain a more comprehensive understanding of the algorithm’s performance. All metrics are calculated using the “sklearn” Python package.

Experimental environment

All the experiments were run on an Ubuntu 18.04.6 system with an Intel(R) Xeon(R) Platinum 8260 CPU, 1 Tb memory, and an A100-PCIE-40GB.

TDFPS-Designer can effectively extract the barcode region from the raw nanopore signal to ensure accurate demultiplexing results

We assessed the effectiveness of our barcode extraction strategy by calculating the DTW distance between the extracted barcode signals and the standard barcode signals. To generate experimental data, we obtained 12,000 extracted barcode signals and 1000 randomly intercepted signals, from which we obtained two distance matrices (Fig. 2a). Based on these matrices, we generated two different distance distributions (Fig. 2c). As shown in Fig. 2c (right), the probability that the distance between a signal and the standard barcode signal is less than 110 is very low (∼0.0061). In contrast, Fig. 2c (left) shows that 94.35% of the DTW distances between the extracted barcode signals and the standard barcode signals are less than 110, indicating that our extraction strategy is highly effective. In terms of efficiency, by using a single thread, we can extract the barcode regions of approximately 255 sequences in just 1 s.

TDFPS-Designer can design specialized barcodes for different sequencers

ONT offers various sequencers, such as the MinION sequencer and PromethION sequencer, each with different chemistries, such as R9.4 and R10.4. The R9.4 has been widely adopted, demonstrating mature and stable performance, while the R10.4 aims to further enhance sequencing accuracy, and they may generate different nanopore signals (Fig. 3a). For each sequencer, we designed 96 barcodes, each 20 bp in length, matching the capacity of the ONT barcode kit but with shorter lengths. We generated different types of nanopore signals (use Squigulator) based on these barcodes (with 100 simulated signals per barcode) and used TDFPS-Designer for demultiplexing. The results showed that these barcodes could be accurately demultiplexed (Fig. 3b), suggesting that our algorithm can customize barcode kits for different sequencers. In the subsequent analysis, we primarily discuss the barcodes designed for the MinION R9.4.

Barcodes designed by TDFPS-Designer are easier to demultiplex than randomly selected barcodes

In biological experiments, barcodes are often randomly selected as short DNA fragments using various methods, such as random nucleic acid synthesis or selection from existing barcode libraries. We evaluated the effectiveness of our barcode design strategy based on the accuracy of demultiplexing. We first used both a random strategy and TDFPS-Designer to design 100 barcodes, each 15 bp in length, and evaluated their demultiplexing performance (Fig. 4a). We found that some barcodes generated by the random strategy could not be demultiplexed accurately, with an precision of less than 0.84, which is ∼10% lower than the ones of TDFPS-Designer, and every barcode designed by TDFPS-Designer could be accurately demultiplexed. Additionally, we used both strategies to generate 96 barcodes, each 24 bp in length, and compared them with ONT barcodes (Fig. 4b). The results showed that all three types of barcodes had stable demultiplexing performance, which can be attributed to the large sequence space of the 24 bp barcodes, leading to a very low probability of collisions between randomly generated barcodes (i.e., one barcode being mistakenly demultiplexed as another barcode). To further validate our findings, we designed 500 barcodes, each 24 bp in length, using both the random strategy and TDFPS-Designer, and evaluated their demultiplexing performance (Fig. 4c). The results indicated that barcodes designed by TDFPS-Designer outperformed those generated randomly, suggesting a tendency for collisions between randomly generated barcodes in this case.

TDFPS-Designer can design large-capacity barcode kits with different lengths and ensure their stable demultiplexing

Based on TDFPS-Designer, we designed three final barcode kits, each derived from an initial kit that ensures the difference in DTW distance (see the “Methods” section). TDFPS-Designer provides demultiplexing functionality, and we conducted preliminary tests on the demultiplexing performance of TDFPS-Designer on these kits, comparing it with Guppy (Table 2). We can see that our demultiplexing method and Guppy achieve almost perfect demultiplexing results on the three datasets (S-ET_ONT12, S-ET_ONT24, and S-ET_ONT96) with ONT barcodes. Guppy’s demultiplexing method is specially designed for ONT barcodes, so its minimum F1-Score is slightly higher than TDFPS-Designer by 1% to 4%, and the average accuracy is almost the same. In demultiplexing both the initial and final barcode kits, TDFPS-Designer demonstrated higher demultiplexing accuracy, exceeding Guppy by 4% to 9%. Additionally, Guppy classified a large number of reads as unclassified, which is costly. We have constructed a more detailed analysis of this aspect below. On the other hand, by observing the minimum/minimum-2 F1-score, we can see that both Guppy and TDFPS-Designer do not perform well in demultiplexing certain barcodes in the initial kits. This indicates the necessity for TDFPS-Designer to further filter barcodes from the initial kits (see the “Methods” section, Fig. 1c). In the final designed barcode kits, TDFPS-Designer showed nearly perfect demultiplexing performance, with an accuracy greater than 99%, exceeding Guppy by 9%, and with minimum/minimum-2 F1-scores greater than 95%, surpassing Guppy by 8%. These results suggest that TDFPS-Designer can successfully demultiplex all barcodes in the final kits. It is worth noting that Guppy’s minimum F1-score was only ∼0.17, as it classified a large number of reads as “unclassified” when testing the final kits, leading to poor precision in the “noise class” (see Table 1) and resulting in a very low minimum F1-score. In terms of efficiency, when barcode regions in all sequences are extracted, our method is faster than Guppy, which benefits from a well-designed GPU acceleration mechanism [40].

Table 2 Classification performance of demultiplexing tools on benchmark datasets

Full size table

TDFPS-Designer is more robust than Guppy in handling sequencing errors

We evaluated Guppy’s and TDFPS-Designer’s demultiplexing performance on datasets with different sequencing error rates (Fig. 5a). Figure 5b shows Guppy’s demultiplexing performance on three datasets with initial barcode kits. We can see that sequencing errors severely impact the performance of Guppy, with a minimum recall of less than 65% on M-ESH_TD795 (Guppy R9.4), implying that some barcodes were not successfully demultiplexed, and we can see that almost all barcodes are effectively demultiplexed when the sequencing error rate is lower (Guppy R10.4). In addition, we can see from Fig. 5c that both Guppy and TDFPS-Designer exhibit high demultiplexing precision. However, Guppy shows relatively low recall when demultiplexing data with high sequencing errors, with the recall for some samples falling below 80% (Fig. 5d), ∼20% lower than TDFPS-Designer. This further suggests that Guppy struggles to handle sequencing errors effectively. More in-depth analysis reveals that Guppy classifies a large number of samples as “unclassified” under both types of sequencing data (Fig. 5e). This is because Guppy retains only the least ambiguous data, which ensures precision but causes a lot of data waste, whereas TDFPS-Designer effectively avoids this issue.

Despite the continuous improvements in ONT sequencing accuracy, uncertainties still shroud the sequencing error rate, particularly in the context of nonmodel organisms and RNA samples [41]. In these scenarios, our demultiplexing approach emerges as a viable alternative solution.

Discussion

In nanopore sequencing, pooling multiple samples together for sequencing can save time and cost. However, separating raw sequencing data from multiple samples can be challenging. Barcodes are crucial for this purpose, while ONT provides barcode kits that support simultaneous sequencing of up to 96 samples. To enable simultaneous sequencing of more samples, we propose TDFPS-Designer, a new tool for designing barcodes using the TDFPS algorithm. The TDFPS algorithm improves the farthest point sampling algorithm. It uses the DTW distance as a measurement and a well-designed threshold to reduce the sampling space. Based on the TDFPS algorithm, TDFPS-Designer selects sequences that are sufficiently different from each other in the sequence space to construct barcode sets with different length. For the barcode kit, TDFPS-Designer has an efficient demultiplexing strategy, starting directly from the DTW distance matrix and completing the demultiplexing process, which ensures that the demultiplexing F1-score of all barcodes is above 95%. Additionally, TDFPS-Designer adopts a GPU acceleration mechanism to improve the efficiency of demultiplexing and barcode design.

Although Guppy is the current state-of-the-art tool for demultiplexing problems, experiments have shown that Guppy’s demultiplexing performance is very susceptible to sequencing errors. In contrast, our method effectively overcomes this challenge, offering users a dependable demultiplexing solution for handling extensive sample demultiplexing issues. Our proposed barcode design strategy can design more barcodes while ensuring a stable demultiplexing effect, indicating that TDFPS-Designer has great development potential. To further enhance the performance of TDFPS-Designer, we plan to investigate more accurate barcode extraction strategies that can improve the accuracy of demultiplexing. This will be a focus of our future work.

Conclusions

In this study, we developed TDFPS-Designer, a new tool for designing barcodes using the TDFPS algorithm. The TDFPS algorithm enhances the farthest point sampling algorithm by employing the DTW distance as a measurement and implementing a well-designed threshold to minimize the sampling space. This method ensures that the sequences selected for barcode kits are sufficiently different from one another, enabling the construction of barcode kits with various lengths. Notably, the barcode kits designed by TDFPS-Designer are nearly 1.4 to 18.5 times larger than those provided by ONT, supporting the design of barcodes with arbitrary lengths. Experimental results demonstrate that the barcodes designed by TDFPS-Designer exhibit greater robustness compared to randomly generated barcodes. Moreover, the demultiplexing strategy employed by TDFPS-Designer is more effective in handling sequencing errors. Notably, under the condition of maintaining high demultiplexing accuracy, the recall rate of TDFPS-Designer is approximately 20% higher than that of Guppy. This suggests that the DTW algorithm in TDFPS-Designer is well-suited for handling the more common insertions and deletions in ONT, thereby ensuring a higher recall rate. This improvement ensures the feasibility and reliability of current multi-sample sequencing applications in non-model organisms and direct RNA sequencing.

Methods

TDFPS-Designer is developed using Python and C++. The primary function of this software is to design barcodes for ONT sequencing, facilitating the barcoding of a larger number of samples and enabling efficient demultiplexing. The use of TDFPS-Designer is described comprehensively in Additional File 1: S2 and https://github.com/junhaiqi/TDFPSDesigner.git. Next, we provide details for each part of TDFPS-Designer.

Barcode design strategy: the maximum capacity of the barcode kit

Given a demultiplexing system S, dataset D, and an accuracy value $p_{acc}$, we define the barcode kit as BK, and the dataset D integrates BK for multisample sequencing as $D_{BK}$. $p_{D_{BK}}^{min}$ represents the minimum accuracy of the demultiplexing system S under D, where the minimum accuracy is defined in the “Evaluation metrics” section. For a dataset D, if the demultiplexing performance of S on D only depends on |BK| (the size of BK), then there is a maximum capacity in theory:

$$\begin{aligned} C(D, p) = max (\{|BK| \; | BK, p_{D_{BK}}^{min} > p_{acc}\}). \end{aligned}$$

(1)

TDFPS-Designer tries to find the BK with a demultiplexing capacity close to the maximum capacity. If a brute force scheme is adopted, we need to find all possible BK and calculate $p_{D_{BK}}^{min}$, which is obviously an NP-hard problem. It is presumed that there should be relatively large differences between the barcodes in the BK with the maximum capacity to facilitate demultiplexing. TDFPS-Designer uses the DTW distance to specify the barcode differences.

Barcode design strategy: selection of barcodes

Our barcode design strategy supports two input modes: the sequence length of the barcode kit and a given set of sequences of the same length. These input modes determine the unique sequence space from which we spatially pick sequences to serve as barcodes. Unfortunately, the sequence space can be very large. For example, there are over one million ($4^{10}$) choices within a barcode space of 10 bp barcode length and 109.9 billion ($4^{20}$) choices within a barcode space of 20 bp barcode length. To improve computational efficiency, we apply a simple initial selection scheme when the sequence space exceeds 1 million. In addition, our algorithm supports filtering out certain sequences when determining the sampling space to design barcodes that meet specific biological criteria, these biological criteria include balanced guanine-cytosine (GC) content, minimal homopolymer runs, and no self-complementarity of more than two bases to reduce internal hairpin propensity [24]. Figure 1a shows an illustration of this scheme. We define a hash function H on the nucleotide alphabet $\sum =\{A,T,C,G\}$ of DNA sequences, where $H(A)=0$, $H(C)=1$, $H(G)=2$, and $H(T)=3$. We extend this function to DNA sequences, as defined in Eq. (2):

$$\begin{aligned} H(S) = H(s_1)\times 4^{k-1}+H(s_2)\times 4^{k-2}+...+H(s_n), \end{aligned}$$

(2)

where $S=s_{1}s_{2}...s_{n}$ represents a DNA sequence of length n.

Equation (2) reflects the relationship between sequences and their corresponding hash values. The greater the difference in hash values, the higher the probability that the two sequences have differences. We use this relationship to determine our initial selection strategy. We calculate and sort the hash values of all sequences and then use uniform random sampling to select one million items. We then select the sequences corresponding to these items to build the initial set of sequences. The final designed barcodes will all come from this initial set. Uniform random sampling selects samples across the entire range of sorted sequences, increasing the differences between the selected sequences. Uniform distribution in sampling reduces the probability of selecting similar (or adjacent) sequences, thereby enhancing the diversity of the sampled sequences.

To select the initial barcode set from the initially screened sequence set, we use a combination (called TDFPS algorithm) of the farthest point sampling algorithm and DTW algorithm and improve efficiency by incorporating a well-determined threshold r through experiments (in Fig. 6 below) . The goal is to ensure that the designed barcodes have enough differences to avoid sequencing errors affecting the demultiplexing results. We measure the difference between barcodes using the DTW distance between their corresponding signals. Specifically, the DTW distance between any two barcode signals in the final set should be greater than the threshold r.

Algorithm 1 outlines the selection of the initial barcode set. First, we convert the DNA sequence collection into a set of standard nanopore signals by the function seq2sig. We define the procedure DTWSetVersion to calculate the minimum DTW distance between a signal and a set of signals. A new signal is identified as a barcode signal if and only if the DTW distance between this signal and the barcode signal set is large enough. Selecting the barcode directly based on the farthest point sampling algorithm would require running the DTW algorithm $\sim n^3$ times, where n is the size of the signal set. When the candidate barcode set is very large, this approach would still require considerable computational resources. To overcome this limitation, we reduce the size of the signal set based on the threshold r. Whenever a new barcode signal is selected, if the DTW distance between the signal in the signal set and this new barcode signal is less than threshold r, it will be deleted, which could greatly reduce the size of the signal set and improve the screening efficiency. We also accelerate the calculation efficiency of the DTW matrix using CUDA and the diagonal parallel method, which improves the calculation efficiency of the DTW by $\sim 3$ orders of magnitude [40].

We selected the final barcode kits from the initial barcode set (Fig. 1c). The initial barcodes exhibited high DTW dissimilarity, ensuring they could be easily distinguished. To further enhance the robustness of demultiplexing these barcodes, TDFPS-Designer ultimately screened the final barcodes by simulating a demultiplexing pipeline. Specifically, after the user specifies the sequencing platform (e.g., MinION R9.4 or MinION R10.4) and the multi-sample sequencing library information (adapter sequences and flanking sequences), all barcodes in the initial set are automatically used to construct a multi-sample sequencing library and generate a small batch of sequencing data. This sequencing data is then automatically demultiplexed by TDFPS-Designer. Subsequently, TDFPS-Designer analyzes the demultiplexing results, calculating the demultiplexing precision, recall, and F1-Score for each barcode. Barcodes with low precision (recall and F1-score) suggest potential conflicts with other barcodes in the kit, and TDFPS-Designer filters these out to obtain the final barcode kit.

Barcode design strategy: threshold determination

In theory, the demultiplexing accuracy depends on the difference between barcodes. Here, we want to determine a DTW distance through experimentation so that under this distance, a simple demultiplexing scheme can achieve sufficient precision. The determined distance threshold r is used as the termination condition of the TDFPS algorithm (Fig. 1a).

We generate template sequences of different lengths (ranging from 10 bp to 20 bp) for a given DNA sequence. By specifying an edit distance d, we generate 1000 sequences from these templates, where the edit distance between each generated sequence and its corresponding template sequence is d. As the DTW distance is correlated with the edit distance, larger editing distances between DNA sequences correspond to larger DTW distances between the corresponding nanopore signals (Fig. 6a). For each template sequence, we generate a dataset containing subsets of sequences with different edit distances from the template sequence (Fig. 6b: (1) and (2)). Using DeepSimulator1.5 [42], we simulate nanopore signals from each template sequence and its corresponding dataset, calculate the DTW distance matrix between the template signal and signals in the dataset, and identify the demultiplexed result based on the row index of the smallest element in each column of the matrix. As shown in Fig. 6c, the demultiplexing accuracy exceeds 99% when the edit distance is 10 under different sequence lengths, indicating that the difference between barcodes is large enough. Moreover, we analyse the numerical distribution of the DTW distance for an edit distance of 10 under different sequence lengths and determine a linear function that determines the corresponding threshold (Fig. 6d).

Demultiplexing strategy

Figure 1b outlines our demultiplexing strategy. The first step involves detecting the barcode region in the nanopore signal. We design a heuristic strategy based on Oxford Nanopore’s official multisample sequencing library construction scheme and the semiglobal DTW algorithm [43] to extract the barcode signal. This strategy involves detecting the region of the adapter signal to determine the position of the barcode signal and estimating the length of the barcode signal. Specifically, we assume that the sequence length of the barcode is n (excluding flanking sequences), and the estimated barcode signal length is $10n+c$, where c defaults to 70, based on the structural division of the nanopore signal (see Fig. 2a and b).

After extracting the barcode signals, we calculate the DTW distance matrix between these sequenced signals and the standard barcode signals, and the row index of the minimum value in each column of the distance matrix corresponds to the demultiplexed result. Specifically, upon extracting the minimum value from each row of the distance matrix, we employ the $5-\sigma$ method to detect anomalies. Any signals with a distance exceeding the threshold of $mean + 5 \times std$ are classified as anomalous data, potentially devoid of associated barcodes. Here, mean and std denote the mean and standard deviation of all distances, respectively.

Determination of final barcode kits

We used TDFPS-Designer to design final kits with barcodes of different lengths: 20 bp, 24 bp, and 30 bp, resulting in 137, 410, and 1779 barcodes, respectively. Specifically, we first designed 795, 1093, and 2120 barcodes of 20 bp, 24 bp, and 30 bp, respectively, based on the TDFPS algorithm. These barcodes ensure sufficient DTW distance differences, forming initial barcode kits. We used these barcode kits to generate three medium-sized datasets (M-ESH_TD795, M-ESH_TD1093, M-ESH_TD2120). We then demultiplexed these datasets. Figure 7a shows the distribution of demultiplexing recall. We can see that there is a positive correlation between the demultiplexing recall and the barcode length, indicating that the maximum capacity of the barcode kit is positively correlated with the barcode length. Additionally, we delved into the relationship between the number of barcodes and the minimum recall, which directly affects the estimation of the maximum capacity of the barcoded kit (as shown in Fig. 7b). It can be seen in Fig. 7b that once the number of barcodes exceeds a certain threshold, the minimum recall will drop significantly. This drop means that there will be “collisions” between certain barcodes, meaning that the demultiplexing system will have difficulty distinguishing certain barcodes accurately. To address this issue, TDFPS-Designer can simulate the generation of small batches of multi-sample sequencing data based on the initial barcode kit, automatically perform demultiplexing, and select the final barcodes from the initial barcode kit based on the demultiplexing results. These barcodes ensure > 95% precision, recall, and F1-Score during this process, forming final barcode kits. All parameters and corresponding output files are available at [44].

Data availability

All Python/C++ code of TDFPS-Designer is published under the permissive MIT open source license and is available on GitHub at https://github.com/junhaiqi/TDFPSDesigner.git. Additionally, the source code for TDFPS-Designer has been deposited at Zenodo [45]. All sequences used to generate simulated data are in [46,47,48], and the codes are in https://github.com/junhaiqi/MSNANOSIM and https://github.com/JustLeeee/ONT-sequencing-data-library-preparation-pipeline.

References

Senol Cali D, Kim JS, Ghose S, Alkan C, Mutlu O. Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions. Brief Bioinforma. 2019;20(4):1542–59.
Article Google Scholar
Choi JY, Lye ZN, Groen SC, Dai X, Rughani P, Zaaijer S, et al. Nanopore sequencing-based genome assembly and evolutionary genomics of circum-basmati rice. Genome Biol. 2020;21:1–27.
Article Google Scholar
Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE, Bosworth C, et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol. 2020;38(9):1044–53.
Article PubMed PubMed Central CAS Google Scholar
Moss EL, Maghini DG, Bhatt AS. Complete, closed bacterial genomes from microbiomes using nanopore sequencing. Nat Biotechnol. 2020;38(6):701–7.
Article PubMed PubMed Central CAS Google Scholar
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65.
Article PubMed PubMed Central CAS Google Scholar
Xie H, Li W, Hu Y, Yang C, Lu J, Guo Y, et al. De novo assembly of human genome at single-cell levels. Nucleic Acids Res. 2022;50(13):7479–92.
Article PubMed PubMed Central CAS Google Scholar
Fang Y, Chen G, Chen F, Hu E, Dong X, Li Z, et al. Accurate transcriptome assembly by Nanopore RNA sequencing reveals novel functional transcripts in hepatocellular carcinoma. Cancer Sci. 2021;112(9):3555–68.
Article PubMed PubMed Central CAS Google Scholar
Sahlin K, Medvedev P. Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis. Nat Commun. 2021;12(1):2.
Article PubMed PubMed Central CAS Google Scholar
de la Rubia I, Srivastava A, Xue W, Indi JA, Carbonell-Sala S, Lagarde J, et al. RATTLE: reference-free reconstruction and quantification of transcriptomes from Nanopore sequencing. Genome Biol. 2022;23(1):153.
Article PubMed PubMed Central Google Scholar
Liu Y, Rosikiewicz W, Pan Z, Jillette N, Wang P, Taghbalout A, et al. DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation. Genome Biol. 2021;22(1):1–33.
Article Google Scholar
Tourancheau A, Mead EA, Zhang XS, Fang G. Discovering multiple types of DNA methylation from bacteria and microbiome using nanopore sequencing. Nat Methods. 2021;18(5):491–8.
Article PubMed PubMed Central CAS Google Scholar
Sakamoto Y, Zaha S, Nagasawa S, Miyake S, Kojima Y, Suzuki A, et al. Long-read whole-genome methylation patterning using enzymatic base conversion and nanopore sequencing. Nucleic Acids Res. 2021;49(14):e81–e81.
Article PubMed PubMed Central CAS Google Scholar
Cumbo C, Minervini CF, Orsini P, Anelli L, Zagaria A, Minervini A, et al. Nanopore targeted sequencing for rapid gene mutations detection in acute myeloid leukemia. Genes. 2019;10(12):1026.
Article PubMed PubMed Central CAS Google Scholar
Goenka SD, Gorzynski JE, Shafin K, Fisk DG, Pesout T, Jensen TD, et al. Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing. Nat Biotechnol. 2022;40(7):1035–41.
Article PubMed PubMed Central CAS Google Scholar
Capraru ID, Romanescu M, Anghel FM, Oancea C, Marian C, Sirbu IO, et al. Identification of Genomic Variants of SARS-CoV-2 Using Nanopore Sequencing. Medicina. 2022;58(12):1841.
Article PubMed PubMed Central Google Scholar
Church GM, Kieffer-Higgins S. Multiplex DNA sequencing. Science. 1988;240(4849):185–8.
Article PubMed CAS Google Scholar
Wick RR, Judd LM, Holt KE. Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks. PLoS Comput Biol. 2018;14(11):e1006583.
Article PubMed PubMed Central Google Scholar
Smith MA, Ersavas T, Ferguson JM, Liu H, Lucas MC, Begik O, et al. Molecular barcoding of native RNAs using nanopore sequencing and deep learning. Genome Res. 2020;30(9):1345–53.
Article PubMed PubMed Central CAS Google Scholar
Whitford W, Hawkins V, Moodley K, Grant MJ, Lehnert K, Snell RG, et al. Optimised multiplex amplicon sequencing for mutation identification using the MinION nanopore sequencer. bioRxiv. 2021;2021–09.
Peterson WW, Peterson W, Weldon EJ, Weldon EJ. Error-correcting codes, vol 2. Cambridge: MIT Press google schola; 1972. p. 208–213.
MacWilliams FJ, Sloane NJA. The theory of error-correcting codes, vol 2. Elsevier Science Publishers BV google schola; 1977. p. 9–47.
Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R. Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods. 2008;5(3):235–7.
Article PubMed PubMed Central CAS Google Scholar
Zorita E, Cusco P, Filion GJ. Starcode: sequence clustering based on all-pairs search. Bioinformatics. 2015;31(12):1913–9.
Article PubMed PubMed Central CAS Google Scholar
Hawkins JA, Jones SK Jr, Finkelstein IJ, Press WH. Indel-correcting DNA barcodes for high-throughput sequencing. Proc Natl Acad Sci. 2018;115(27):E6217–26.
Article PubMed PubMed Central CAS Google Scholar
Doroschak K, Zhang K, Queen M, Mandyam A, Strauss K, Ceze L, et al. Rapid and robust assembly and decoding of molecular tags with DNA-based nanopore signatures. Nat Commun. 2020;11(1):5454.
Article PubMed PubMed Central CAS Google Scholar
Marzal A, Vidal E. Computation of normalized edit distance and applications. IEEE Trans Pattern Anal Mach Intell. 1993;15(9):926–32.
Article Google Scholar
Kelley DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 2010;11:1–13.
Article Google Scholar
Galanti L, Shasha D, Gunsalus KC. Pheniqs 2.0: accurate, high-performance Bayesian decoding and confidence estimation for combinatorial barcode indexing. BMC Bioinforma. 2021;22:1–16.
Article Google Scholar
Lu G, Zhang S, Fang X. An improved string composition method for sequence comparison. BMC Bioinforma. 2008;9(6):1–8.
Article Google Scholar
Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence comparison (I): statistics and power. J Comput Biol. 2009;16(12):1615–34.
Article PubMed PubMed Central CAS Google Scholar
Aita T, Husimi Y, Nishigaki K. A mathematical consideration of the word-composition vector method in comparison of biological sequences. BioSystems. 2011;106(2–3):67–75.
Article PubMed Google Scholar
Dai Q, Liu X, Yao Y, Zhao F. Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison. J Theor Biol. 2011;276(1):174–80.
Article PubMed Google Scholar
Papetti DM, Spolaor S, Nazari I, Tirelli A, Leonardi T, Caprioli C, et al. Barcode demultiplexing of nanopore sequencing raw signals by unsupervised machine learning. Front Bioinforma. 2023;3:1067113.
Article Google Scholar
Guan X, Li Z, Zhou Y, Shao W, Zhang D. Active learning for efficient analysis of high-throughput nanopore data. Bioinformatics. 2023;39(1):btac764.
Article PubMed CAS Google Scholar
Loose M, Malla S, Stout M. Real-time selective sequencing using nanopore technology. Nat Methods. 2016;13(9):751–4.
Article PubMed PubMed Central CAS Google Scholar
Han R, Li Y, Gao X, Wang S. An accurate and rapid continuous wavelet dynamic time warping algorithm for end-to-end mapping in ultra-long nanopore sequencing. Bioinformatics. 2018;34(17):i722–31.
Article PubMed CAS Google Scholar
Han R, Wang S, Gao X. Novel algorithms for efficient subsequence searching and mapping in nanopore raw signals towards targeted sequencing. Bioinformatics. 2020;36(5):1333–43.
Article PubMed CAS Google Scholar
Gamaarachchi H, Ferguson JM, Samarakoon H, Liyanage K, Deveson IW. Simulation of nanopore sequencing signal data with tunable parameters. Genome Res. 2024;34(5):778–83.
Article PubMed PubMed Central CAS Google Scholar
Wick RR. Badread: simulation of error-prone long reads. J Open Source Softw. 2019;4(36):1316.
Article Google Scholar
Han R, Qi J, Xue Y, Sun X, Zhang F, Gao X, et al. HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing. Genome Biol. 2023;24(1):1–29.
Article Google Scholar
Liu-Wei W, van der Toorn W, Bohn P, Hölzer M, Smyth RP, von Kleist M. Sequencing accuracy and systematic errors of nanopore direct RNA sequencing. BMC Genomics. 2024;25(1):528.
Article PubMed PubMed Central CAS Google Scholar
Li Y, Wang S, Bi C, Qiu Z, Li M, Gao X. DeepSimulator1. 5: a more powerful, quicker and lighter simulator for Nanopore sequencing. Bioinformatics. 2020;36(8):2578–80.
Article PubMed PubMed Central CAS Google Scholar
Boža V, Brejová B, Vinař T. Improving Nanopore Reads Raw Signal Alignment. arXiv preprint arXiv:1705.01620. 2017;2017-05.
Qi J. TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing. 2024. Zenodo. https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.13927379.
Qi J. TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing. 2024. Zenodo. https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.8260659.
Li Z. TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing. 2024. Zenodo. https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.13208175.
Li Z. TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing. 2024. Zenodo. https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.13203290.
Li Z. TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing. 2024. Zenodo. https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.13923770.

Download references

Acknowledgements

The authors thank the reviewers and editors for their valuable feedback.

Review history

The review history is available as Additional file 2.

Peer review information

Andrew Cosgrove was the primary editor of this article at Genome Biology and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Funding

This research was supported by the National Key Research and Development Program of China (2020YFA0712400 and 2021YFF0704300), the National Natural Science Foundation of China Projects Grant (62072280, 11931008, 61771009, 32241027), the Natural Science Foundation of Shandong Province ZR2023YQ057, the King Abdullah University of Science and Technology (KAUST) Office of Research Administration (ORA) under Award No FCC/1/1976-44-01, FCC/1/1976-45-01, REI/1/5234-01-01, REI/1/5414-01-01, URF/1/4352-01-01, and the open project of BGI-Shenzhen BGIRSZ20220005.

Author information

Junhai Qi and Zhengyi Li are joint first author.

Authors and Affiliations

Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
Junhai Qi, Zhengyi Li, Guojun Li & Renmin Han
Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo, 108-8639, Japan
Yao-zhong Zhang
Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Makkah, 23955, Saudi Arabia
Xin Gao

Authors

Junhai Qi
View author publications
You can also search for this author inPubMed Google Scholar
Zhengyi Li
View author publications
You can also search for this author inPubMed Google Scholar
Yao-zhong Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Guojun Li
View author publications
You can also search for this author inPubMed Google Scholar
Xin Gao
View author publications
You can also search for this author inPubMed Google Scholar
Renmin Han
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

G. L., X. G., and R. H. conceived and managed the project. J. Q. implemented the algorithm. J. Q. and Z. L. collected all the datasets and performed all the analysis. R. H., G. L., X. G., and Y. Z. were involved in algorithm analysis and data analysis. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Guojun Li, Xin Gao or Renmin Han.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

13059_2024_3423_MOESM1_ESM.docx

Additional file 1: S1. The relationship between the randomly selected sequence and the final set of barcodes. S2. Usage of TDFPS-Designer. S3. Evaluate simulated signals. S4. Detailed description of the dataset.

Additional file 2: Review history.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Qi, J., Li, Z., Zhang, Yz. et al. TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing. Genome Biol 25, 285 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-024-03423-3

Download citation

Received: 14 November 2023
Accepted: 17 October 2024
Published: 04 November 2024
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-024-03423-3

TDFPS-Designer: an efficient toolkit for barcode design and selection in nanopore sequencing

Abstract

Background

Results

Algorithms overview

Benchmark datasets

Evaluation metrics

Experimental environment

TDFPS-Designer can effectively extract the barcode region from the raw nanopore signal to ensure accurate demultiplexing results

TDFPS-Designer can design specialized barcodes for different sequencers

Barcodes designed by TDFPS-Designer are easier to demultiplex than randomly selected barcodes

TDFPS-Designer can design large-capacity barcode kits with different lengths and ensure their stable demultiplexing

TDFPS-Designer is more robust than Guppy in handling sequencing errors

Discussion

Conclusions

Methods

Barcode design strategy: the maximum capacity of the barcode kit

Barcode design strategy: selection of barcodes

Barcode design strategy: threshold determination

Demultiplexing strategy

Determination of final barcode kits

Data availability

References

Acknowledgements

Review history

Peer review information

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary information

13059_2024_3423_MOESM1_ESM.docx

Additional file 2: Review history.

Rights and permissions

About this article

Cite this article

Share this article

Genome Biology

Contact us