A microbiome-wide association study (MWAS), otherwise known as a metagenome-wide association study (MGWAS), is a statistical methodology used to examine the full metagenome of a defined microbiome in various organisms to determine if some feature (as example, gene or species) of the microbiome is associated with a host trait. MWAS has been adopted by the field of metagenomics from the widely used genome-wide association study (GWAS).
While MWAS is phonetically and conceptually tied to GWAS there are several key differentiations:
There are roughly 150 times more genes in the microbiome than in the human genome.[2] A GWAS must only find significantly associated genes along the predefined number of chromosomes of the species. On the other hand, the MWAS must analyze however many features are in an undetermined number of microorganisms. As a result, there is a far higher chance of running into the multiple testing problem.[3]
While host populations contain a relatively similar collection of genes on the genome, the genetic variation of any given microbiome can vary significantly between different hosts and environments.[4] The genome of the microbiome can also vary temporally in a given host [5] while the genome of the host in a GWAS is fixed across their lifespan.
The realized microbiome datasets are inherently compositional [6] and interactional. The assumption that the genes exist in a Euclidean space is violated by the non-linear nature of compositional data.[7]
There are several ways to classify which feature of the microbiome will be used in a MWAS. MWAS can be assessed using a specific taxonomic level (species, genus,[8] phyla, etc.), operational taxonomic unit (OTU) [1] or amplicon sequence variant (ASV), transcriptome,[9]proteome,[10] and more. The approach used depends upon the research hypothesis as each method will often give differing results.
Often, a taxonomic level or OTU/ASV based approach is used to determine the correlations between the specific microbiome feature and the desired phenotype. Several methods can be employed, such as machine learning approaches like random forests,[11] and deep learning.[12] Feature association can also be established with programs like DESeq2 and ANCOM. However, correlations established by the wide array of tools available may not always translate into causality. Researchers determine causality through sequential testing.[13] Newer methods have explored inference of digital twins of microbial ecosystem to address some modeling challenges arising from the diversity of microbes in such environments, inter-host variability, and compositionality of measurements.[14]