Fig. 3
From: Towards accurate and reliable resolution of structural variants for clinical diagnosis

The role of AI in promoting SV detection. AI-powered natural language processing (NLP) for SV calling. Considering the suboptimal performance of different SV calling algorithms concerning completeness and accuracy, we suggest that deep learning may be an alternative worth further exploration. CNNs are the primary deep learning algorithm investigated for SV detection, which considers the BAM files as image. Rapid development of deep learning algorithms such as AI-powered NLP not only provided unprecedented innovation for information retrieve from free-text documents but repositioned in other type of biological information such as chemical structures and protein sequences [104]. Here, we developed a hypothesis by resembling chromosomes as paragraphs, the sequence reads as sentences, and different A, T, G, C combinations (e.g., tandem repeats and microsatellite) as vocabularies. Subsequently, the AI-powered language models such as different transformers [105,106,107] could be utilized to digest genome sequence as human beings read a book. The difference between the sample genome and reference genome (i.e., variants) could be extracted by compared transformer-based genome embedding, which is very similar to the rationale behind de novo assembly (A). Reinforcement learning optimizing meta-caller combination. There is the potential to integrate multiple callers using more sophisticated approaches than simple heuristic union/intersection rules for improving SV detection. Artificial intelligence (AI) may be a solution. The rapid evolution of emerging genomics technologies suggests that improved SV detection should be taking place. The ideal combination strategy for combining different SV callers is to take advantage of each SV caller and eliminate the false positives, which fits well the concept of reinforcement learning. Reinforcement learning is a branch of deep learning that focuses on how intelligent agents ought to take actions in an environment to maximize cumulative reward [108]. AlphaGo is an excellent example of reinforcement learning applications [109]. Reinforcement learning could be utilized to develop the intelligent ensemble SV callers to maximize SV detection performance (B). For each type of SV, the combination for each SV caller could be learned by minimizing the loss function that measures the divergence between called SV and ground truth. Ultimately, reinforcement learning-based ensemble SV callers allow the integration of any individual caller, incorporate different SV types, and incorporate the advantages of newly developed technologies. Generative adversarial network (GAN)-based SV simulation. A true set is the key to investigate the accuracy and reproducibility of SV detection. Unfortunately, the complexity of the SV events and associated genome properties are central to the whole picture of SV events in the sample, hampering objective evaluation. Many reports suggest that simulated ground truth cannot recapitulate genome and SV characteristics and mimic the actual patient situation. Therefore, the simulated SV truth sets with high commutability are urgently needed. The generative adversarial network (GAN) is a deep neural network framework integrating a generative model and discriminative model to generate new data similar to the statistical distribution of the training set [110]. GANs have been widely applied in image generation in fashion, art, and advertising and have attracted much attention in the scientific community. For example, one type of GAN model, named DeepFake, has been utilized to predict cell type-specific transcriptional states induced by drug treatment [111]. Here, we envision a generative adversarial network (GAN) to simulate the binary alignment map (BAM) file spiked in different SV types based on the actual data (C). The proposed GAN model collects the high-quality BAM files with varying SV and length types from the real data, such as SEQC-II and other consortium efforts, as a training set to generate the target SV spiked BAM file. The potential benefit of the proposed GAN model is the simulated SV spiked BAM file could maximize the preservation of the original matrix effect of real data such as VAF level and tumor purity