- Correspondence
- Open access
- Published:
Minimum information and guidelines for reporting a multiplexed assay of variant effect
Genome Biology volume 25, Article number: 100 (2024)
Abstract
Multiplexed assays of variant effect (MAVEs) have emerged as a powerful approach for interrogating thousands of genetic variants in a single experiment. The flexibility and widespread adoption of these techniques across diverse disciplines have led to a heterogeneous mix of data formats and descriptions, which complicates the downstream use of the resulting datasets. To address these issues and promote reproducibility and reuse of MAVE data, we define a set of minimum information standards for MAVE data and metadata and outline a controlled vocabulary aligned with established biomedical ontologies for describing these experimental designs.
Background
The emergence of high-throughput genomic technologies has revolutionized our ability to study the impact of genetic variants at a grand scale. A prominent example of these innovative methods is multiplexed assays of variant effect (MAVEs). MAVEs are a family of experimental methods combining saturation mutagenesis with a multiplexed assay to interrogate the effects of thousands of genetic variants in a given functional element in parallel [1, 2]. The output of a MAVE is a variant effect map quantifying the consequences of all single nucleotide (or single amino acid) variants in a target functional element, even variants not yet observed in the population. MAVEs have been applied to coding sequences as well as noncoding elements like splice sites and regulatory regions across various organisms. Variant effect maps have broad applications including clinical variant interpretation [2, 3], understanding sequence/structure/function relationships [4, 5], and investigating molecular mechanisms of evolution [6, 7]. The MAVE field is growing rapidly, leading to the formation of organizations such as the Atlas of Variant Effects (AVE). AVE consists of over 500 researchers from over 30 countries who perform, interpret, and apply MAVE experiments.
The rapid growth and adoption of MAVE technologies across many fields have led to an excess of overlapping definitions, complicating discovery and interpretation. Minimum information standards in other research areas have increased the reporting, archiving, and reuse of biological data [8,9,10,11]. To promote reuse and FAIR data sharing [12], minimum information standards and a controlled vocabulary for describing MAVE experiments and variant effect maps are needed. Here, we—members of the AVE Experimental Technology and Standards and AVE Data Coordination and Dissemination workstreams—provide a comprehensive structured vocabulary and recommendations for data release for MAVE datasets. Uptake of these recommendations by the MAVE community will greatly improve the usability and longevity of MAVE datasets, enabling novel insights and applications.
Results and discussion
All MAVEs share a core pipeline: generation of a variant library, delivery of the library into a model system, separation of variants based on function, quantification of variant frequency by high-throughput DNA sequencing, and performing of data analysis and score calculation [1, 2, 13]. Accurate and consistent metadata describing each of these steps is the basis for the interpretability of MAVE functional scores and is a requirement for any advanced quantitative analysis, such as comparing and combining scores. To systematize these metadata, we have defined and implemented a computable controlled vocabulary that covers the majority of current and emerging MAVE techniques (Fig. 1) [14]. This vocabulary captures the major steps of the MAVE experimental process including project scope, library generation, library integration/expression, assay type, and sequencing method. The vocabulary also contains terms to describe the biological and disease relevance of the assay. In addition to releasing scores and other datasets in published papers, we recommend sharing MAVE datasets through MaveDB, an open-source platform to distribute and interpret MAVE data [15, 16].
A structured vocabulary of terms relevant to the technical development, execution and recording of multiplexed assays of variant effect (MAVEs). Each category of controlled vocabulary terms is depicted (top, gray boxes) along with three examples from published MAVE datasets. From left to right, the figure includes (green boxes) [17], (blue boxes) [18], and (red boxes) [19]. Example files for each of these examples are available in the GitHub repository (see Availability of data and materials)
Researchers should communicate the target sequence, the method used to generate library diversity, and the method of variant delivery into the assay system using terms from the controlled vocabulary. Metadata about the variant generation method should include terms for either editing at the endogenous locus or in vitro variant library generation. It should also specify the model system as defined by NCBI Taxonomy ID [20] and Cell Line Ontology (CLO) [21] terms where available.
It is essential for the target sequence to be linked to a reference genome database or similar by including a versioned stable identifier from a widely used resource such as RefSeq [22], Ensembl [23], or UniProt [24]. We also recommend that researchers designing a study choose a reference-identical allele when it does not otherwise affect the study design, particularly for clinically relevant targets. The entire target sequence used in the assay must be provided to allow MaveDB and other systems to generate globally unique identifiers (sha512t24u computed identifiers [25]) as used by the Global Alliance for Genomics and Health (GA4GH) [26] refget [27] and Variation Representation Specification (VRS) [28] standards.
We recommend that variant libraries are exchanged using VRS and stored using a VRS-compatible information model, including the aforementioned computed identifiers, inter-residue sequence location data, and VOCA-normalized allele representation [28, 29]. This allows variants to be defined in terms of both the variant on the target sequence and the homologous variant on the linked reference sequences with an appropriate variant mapping relation, such as the homologous_to relation from the sequence ontology [30]. Descriptions of variants on target sequences should follow the MAVE-HGVS nomenclature conventions [16]. Homologous variants on linked reference sequences should describe variants following conventions typical for the target organism, e.g., using the Human Genome Variation Society (HGVS) variant nomenclature [31] for variants on human reference sequences. An example of these sequence variant recommendations in practice is described in Arbesfeld et al. [32], where they enable interoperability with downstream resources including the Ensembl Variant Effect Predictor (VEP) [33], UCSC Genome Browser [34], the Genomics to Proteins resource [35], the ClinGen Allele Registry [36], and ClinGen Linked Data Hub.
The phenotypic assay is the most unique aspect of a MAVE compared to other data types for which minimum information standards have been established. There is a tremendous diversity in functional assays in terms of both the assay readout and the biology the assay was designed to interrogate. For assay readout, we have identified a subset of phenotypic readouts in the Ontology for Biomedical Investigations (OBI) [37] that are commonly used in variant effect maps. Because OBI has over 2500 terms, we hope that this “short list” will help researchers identify the most relevant terms to describe their experiments. Nevertheless, we also welcome the use of other OBI terms if necessary to describe new assays. Assays that used variants with known effects to calibrate or validate the assay should include these variants, their effects, and the source of the information [38]. To promote interoperability, we suggest using a structured format such as a table or JSON document and applying the VRS standard as described above. Researchers should also detail any environmental variables (such as temperature or the addition of small molecules) in their experimental methods. We encourage experimenters to use publicly accessible resources like protocols.io to describe their assay protocols in detail and share them with the community.
Researchers should use the appropriate controlled vocabulary terms for describing the high-throughput sequencing method used for variant quantification. We strongly recommend that raw sequence reads be deposited in a suitable repository, such as the Sequence Read Archive (SRA) [39] or Gene Expression Omnibus (GEO) [40], along with a description of each file (e.g., time point and sample information).
We recommend that researchers investigating clinical phenotypes use terms from the Mondo Disease Ontology (Mondo) [41] or Online Mendelian Inheritance in Man (OMIM) [42] to help clinicians and other stakeholders retrieve relevant functional data. Particular care is needed for genes encoding proteins with multiple functional domains and where loss of function and gain of function variants are associated with different diseases. Ideally, each MAVE should be associated with a particular gene-disease entity that describes the mechanism of disease such as those defined by G2P [43] and how the MAVE assay recapitulates or is relevant to the mechanism of disease. Some genes or functional domains may require multiple MAVE assays, each probing a different function or attribute of the gene product, to accurately model different disease entities.
Although it is not within the scope of this controlled vocabulary, it is still crucial to detail the data analysis performed to produce a variant effect map. This includes steps to generate variant counts, including sequence read processing, quality filtering, alignment, and variant identification, as well as further statistical and bioinformatic processing to calculate scores and associated error estimates. Researchers should describe the analysis pipeline used for these calculations, including software versions. Several well-documented tools are available for this purpose and the field continues to advance rapidly [44,45,46,47]. Custom code should be shared using GitHub or a similar platform and archived using Zenodo or a similar archival service that mints a DOI.
In addition to processed variant scores, we urge researchers to share raw counts for each dataset, as these have tremendous utility for downstream users who want to reanalyze datasets or develop new statistical methods. Similarly, we recommend that researchers also report scores prior to normalization or imputation, and MaveDB supports the deposition of counts, scores, normalized/imputed scores, and sequence metadata for the same dataset (Table 1).
Conclusions
Minimum data standards are important to guide researchers who want their datasets to be used and cited broadly. We anticipate that this document will enhance the readability and discoverability of current and future datasets by defining a vocabulary that can be adopted across the many fields where MAVEs are being performed and where the resulting datasets are being used. Ensuring a minimum set of available metadata that uses a shared set of terms enables new types of analysis, such as machine learning methods to combine large numbers of disparate, high-dimensional datasets like MAVEs. Large-scale meta-analyses of multiple MAVE datasets have already been implemented in several contexts, including computational prediction of variant effects [48, 49] and clinical variant reclassification [50]. In the near term, the controlled vocabulary will be implemented as part of MaveDB records, creating a large set of rich metadata annotations that can be searched and mined. We believe that the MAVE community should share datasets and resources responsibly and that accessibility is real only when it ensures usability and reproducibility.
Methods
The initial draft of the controlled vocabulary was developed collaboratively using Google Docs. The controlled vocabulary schema is defined using JSON Schema Draft 2020-12.
Availability of data and materials
The controlled vocabulary implementation is available from the AVE Data Coordination and Dissemination workstream GitHub repository located at https://github.com/ave-dcd/mave_vocabulary and stably archived using Zenodo at https://doiorg.publicaciones.saludcastillayleon.es/10.5281/zenodo.10719897 [14]. The implementation is provided under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
References
Gasperini M, Starita L, Shendure J. The power of multiplexed functional analysis of genetic variants. Nat Protoc. 2016;11:1782–7.
Tabet D, Parikh V, Mali P, Roth FP, Claussnitzer M. Scalable functional assays for the interpretation of human genetic variation. Annu Rev Genet. 2022;56:441–65.
Starita LM, Ahituv N, Dunham MJ, Kitzman JO, Roth FP, Seelig G, et al. Variant interpretation: functional assays to the rescue. Am J Hum Genet. 2017;101:315–25.
Stein A, Fowler DM, Hartmann-Petersen R, Lindorff-Larsen K. Biophysical and mechanistic models for disease-causing protein variants. Trends Biochem Sci. 2019;44:575–88.
Kinney JB, McCandlish DM. Massively parallel assays and quantitative sequence–function relationships. Annu Rev Genomics Hum Genet. 2019;20:null.
Starr TN, Picton LK, Thornton JW. Alternative evolutionary histories in the sequence space of an ancient protein. Nature. 2017;549:409–13.
Gallego Romero I, Lea AJ. Leveraging massively parallel reporter assays for evolutionary questions. Genome Biol. 2023;24:26.
Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 2001;29:365–71.
Taylor CF, Paton NW, Lilley KS, Binz P-A, Julian RK, Jones AR, et al. The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol. 2007;25:887–93.
Brazma A, Ball C, Bumgarner R, Furlanello C, Miller M, Quackenbush J, et al. MINSEQE: Minimum Information about a high-throughput Nucleotide SeQuencing Experiment - a proposal for standards in functional genomic data reporting. 2012. Cited 2023 Apr 23. Available from: https://zenodo.org/record/5706412.
Füllgrabe A, George N, Green M, Nejad P, Aronow B, Fexova SK, et al. Guidelines for reporting single-cell RNA-seq experiments. Nat Biotechnol. 2020;38:1384–6.
Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018.
Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nat Meth. 2014;11:801–7.
Wagner AH, Rubin AF. Minimum information standards implementation for Multiplexed Assays of Variant Effect (MAVEs). Zenodo; 2024. Cited 2024 Feb 28. Available from: https://zenodo.org/record/10719897.
Esposito D, Weile J, Shendure J, Starita LM, Papenfuss AT, Roth FP, et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 2019;20:223.
Rubin AF, Min JK, Rollins NJ, Da EY, Esposito D, Harrington M, et al. MaveDB v2: a curated community database with over three million variant effects from multiplexed functional assays bioRxiv; 2021. 2021.11.29.470445. Cited 2023 Jun 30. Available from: https://www.biorxiv.org/content/10.1101/2021.11.29.470445v2.
Findlay GM, Daza RM, Martin B, Zhang MD, Leith AP, Gasperini M, et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature. 2018;562:217–22.
Matreyek KA, Starita LM, Stephany JJ, Martin B, Chiasson MA, Gray VE, et al. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nat Genet. 2018;50:874–82.
Seuma M, Lehner B, Bolognesi B. An atlas of amyloid aggregation: the impact of substitutions, insertions, deletions and truncations on amyloid beta fibril nucleation. Nat Commun. 2022;13:7084.
Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford). 2020;2020:baaa062.
Sarntivijai S, Lin Y, Xiang Z, Meehan TF, Diehl AD, Vempati UD, et al. CLO: the cell line ontology. J Biomed Semantics. 2014;5:37.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44:D733-745.
Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, et al. Ensembl 2022. Nucleic Acids Res. 2022;50:D988–95.
UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–31.
Hart RK, Prlić A. SeqRepo: a system for managing local collections of biological sequences. PLoS One. 2020;15:e0239883.
Rehm HL, Page AJH, Smith L, Adams JB, Alterovitz G, Babb LJ, et al. GA4GH: international policies and standards for data sharing across genomic research and healthcare. Cell Genom. 2021;1:100029.
Yates AD, Adams J, Chaturvedi S, Davies RM, Laird M, Leinonen R, et al. Refget: standardized access to reference sequences. Bioinformatics. 2021;38:299–300.
Wagner AH, Babb L, Alterovitz G, Baudis M, Brush M, Cameron DL, et al. The GA4GH Variation Representation Specification: a computational framework for variation representation and federated identification. Cell Genom. 2021;1:100027.
Holmes JB, Moyer E, Phan L, Maglott D, Kattman B. SPDI: data model for variants and applications at NCBI. Bioinformatics. 2020;36:1902–7.
Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, et al. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 2005;6:R44.
den Dunnen JT, Dalgleish R, Maglott DR, Hart RK, Greenblatt MS, McGowan-Jordan J, et al. HGVS recommendations for the description of sequence variants: 2016 update. Hum Mutat. 2016;37:564–9.
Arbesfeld JA, Da EY, Kuzma K, Paul A, Farris T, Riehle K, et al. Mapping MAVE data for use in human genomics applications bioRxiv; 2023:. 2023.06.20.545702. Cited 2023 Jun 30. Available from: https://www.biorxiv.org/content/10.1101/2023.06.20.545702v1.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biol. 2016;17:122.
Nassar LR, Barber GP, Benet-Pagès A, Casper J, Clawson H, Diekhans M, et al. The UCSC Genome Browser database: 2023 update. Nucleic Acids Res. 2023;51:D1188–95.
Iqbal S, Pérez-Palma E, Jespersen JB, May P, Hoksza D, Heyne HO, et al. Comprehensive characterization of amino acid positions in protein structures reveals molecular effect of missense variants. Proc Natl Acad Sci U S A. 2020;117:28201–11.
Pawliczek P, Patel RY, Ashmore LR, Jackson AR, Bizon C, Nelson T, et al. ClinGen Allele Registry links information about genetic variants. Hum Mutat. 2018;39:1690–701.
Brinkman RR, Courtot M, Derom D, Fostel JM, He Y, Lord P, et al. Modeling biomedical experimental processes with OBI. J Biomed Semantics. 2010;1(Suppl 1):S7.
Gelman H, Dines JN, Berg J, Berger AH, Brnich S, Hisama FM, et al. Recommendations for the collection and use of multiplexed functional data for clinical variant interpretation. Genome Med. 2019;11:85.
Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2011;39:D19-21.
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013;41:D991–5.
Mungall CJ, McMurry JA, Köhler S, Balhoff JP, Borromeo C, Brush M, et al. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2017;45:D712–22.
Hamosh A, Amberger JS, Bocchini C, Scott AF, Rasmussen SA. Online Mendelian Inheritance in Man (OMIM®): Victor McKusick’s magnum opus. Am J Med Genet A. 2021;185:3259–65.
Thormann A, Halachev M, McLaren W, Moore DJ, Svinti V, Campbell A, et al. Flexible and scalable diagnostic filtering of genomic variants using G2P with Ensembl VEP. Nat Commun. 2019;10:2373.
Bloom JD. Software for the analysis and visualization of deep mutational scanning data. BMC Bioinformatics. 2015;16:168.
Rubin AF, Gelman H, Lucas N, Bajjalieh SM, Papenfuss AT, Speed TP, et al. A statistical framework for analyzing deep mutational scanning data. Genome Biol. 2017;18:150.
Faure AJ, Schmiedel JM, Baeza-Centurion P, Lehner B. DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies. Genome Biol. 2020;21:207.
Soneson C, Bendel AM, Diss G, Stadler MB. mutscan-a flexible R package for efficient end-to-end analysis of multiplexed assays of variant effect data. Genome Biol. 2023;24:132.
Wu Y, Li R, Sun S, Weile J, Roth FP. Improved pathogenicity prediction for rare human missense variants. Am J Hum Genet. 2021;108:1891–906.
Frazer J, Notin P, Dias M, Gomez A, Min JK, Brock K, et al. Disease variant prediction with deep generative models of evolutionary data. Nature. 2021;599:91–5.
Fayer S, Horton C, Dines JN, Rubin AF, Richardson ME, McGoldrick K, et al. Closing the gap: systematic integration of multiplexed functional data resolves variants of uncertain significance in BRCA1, TP53, and PTEN. Am J Hum Genet. 2021;108:2248–58.
Acknowledgements
The authors would like to thank Michael Boettcher and Melissa S. Cline for thoughtful discussion of this work and comments on the manuscript. We would also like to thank Alex Hopkins for administrative support. The images in Fig. 1 were created in BioRender.
Review history
The review history is available as Additional file 1.
Peer review information
Veronique van den Berghe was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.
Funding
MC received funding from the Novo Nordisk Foundation (NNF21SA0072102) and NIH/NIDDK grant UM1DK126185. VNP received funding from NIH/NHLBI grants K08HL143185 and R01HL164675. AHW received funding from NIH/NHGRI grant R35HG011949. LAM, FPR, and AFR received funding from NIH/NHGRI grants UM1HG011969 and RM1HG010461. ANNB acknowledges funding from CIHR, NSERC, and the University of Toronto. KR receives funding from NIH/NHGRI grant U24HG009649. FPR received funding from a CIHR Foundation Grant. BB received funding from La Caixa Research Foundation grant LCF/PR/HR21/52410004 and the Spanish Ministry of Science, Innovation and Universities grants PID2021-127761OB-I00 and RYC2020-028861-I. AMG received funding from NIH/NHGRI grant R00HG010904 and NIH/NHLBI grant R01HL164675. This work was supported by the Australian government.
Author information
Authors and Affiliations
Contributions
MC, VNP, CJB, ANNB, FPR, DT, AMG, and AFR conceptualized the study. AHW, FPR, and AFR developed the methodology. AHW and KR developed the software. AHW performed the validation. MC, AHW, BB, AMG, and AFR performed the formal analysis. AHW conducted investigations. VNP and DT provided resources. VNP and ANNB curated data. MC, AHW, JAA, BB, AMG, and AFR wrote the original draft. MC, VNP, AHW, CJB, HVF, LAM, ANNB, DT, BB, AMG, and AFR reviewed and edited the manuscript. LAM, BB, AMG, and AFR supervised the team. VNP and LAM performed project administration.
Authors’ Twitter handles
Twitter handles: @MelinaClaussnit (Melina Claussnitzer), @handlerwagner (Alex H. Wagner), @vnparikh (Victoria N. Parikh), @LaraMuffley (Lara A. Muffley), @alex_nguyen_ba (Alex N. Nguyen Ba), @fproth (Frederick P. Roth), @Bennibolo (Benedetta Bolognesi), @amglazer (Andrew M. Glazer), @rubin_af (Alan F. Rubin).
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Review history.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Claussnitzer, M., Parikh, V.N., Wagner, A.H. et al. Minimum information and guidelines for reporting a multiplexed assay of variant effect. Genome Biol 25, 100 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-024-03223-9
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13059-024-03223-9