Comparison of finite and infinite mixture models for capturing compositional heterogeneity across sites
Phylogenetic modelling of the variation of the evolutionary process across sites from multi-species sequence alignments has garnered increasing attention over the last few decades. One of the main approaches, sometimes known as random effects modelling, adopts the view that the heterogeneity across observations is a result of the data set having been emitted from several different models, each drawn from a distribution. When little is known about the form of the across-site heterogeneity, finite mixture models provide discretizations of the unknown distribution into a pre-determined set of sub-models, or components. Choosing a level of discretization that is sufficiently fine-meshed to reflect the underlying heterogeneity is typically done from a set of likelihood-based model comparisons using different numbers of components. In the infinite mixture framework, accounting for the uncertainty regarding the number of components is another layer built into the model formulation (i.e., a hierarchical modelling framework), providing a rich non-parametric fitting of the distribution of across-site heterogeneity. Here, we use Bayesian cross-validation to compare a wide range of finite mixture models, along with the infinite mixture modelling approach known as 'CAT', and gamma-distributed rates-across-sites approach. We apply the comparison to both simulated and real multi-gene alignments. Our findings indicate that the potential improvement in model-fit, as reported by cross-validation score, from finite mixture models is attained when the number of components of the mixture is between 20 and 60. In all cases that we considered, the fit of the CAT-GTR+Γ model matched or exceeded the best-fitting finite mixture model. When testing these models on data sets prone to long branch attraction our findings indicate that finite mixture models with low numbers of components result in topologies that align with those found from analyses using simple models. Finite mixture models, with many components at their disposal, on the other hand, produce topologies that match those obtained with the infinite mixture model CAT-GTR.