How to BEAST (Louis du Plessis)
- Build-a-BEAST model
- Crash course in MCMC
- BEAST2 workflow
Bayes' Thm: P(M|D) = P(D|M)P(M)/P(D)
We don't care about P(D).
Base data structure in BEAST is a rooted-time tree.
What can go into a BEAST model?
- Genetic sequences - duh.
- Genealogy - what are the ancestral relationships between the sequences in our datasets?
- demographic model - describes how the tree grows over time, P(tree|demographic model)). Diff population dynamics generate different trees. Usually coalescent or birth-death. Coalescent (goes backwards in time)- assumes Wright-Fisher-like population dynamics, given effective pop size (N_e). Birth-death (goes forward in time). Lambda = infection rate, delta = becoming non-infectious rate, p = sampling probability.
- site model - how sequences evolve along the tree. We observe sequences at their tips, not their histories. Assumes every site evolves independently, and substitutions are Markovian. See Chapman-Kolmogorov thm, for interest. Note: K80 (transition/transversion) always eventually converges on a uniform distribution of nucleotides. Also does not have a symmetric state transition matrix.
Gamma-distributed rate variation is not flexible enough to model differences between different loci. Use a separate substitution model (partitions) 5. Molecular clock model - dates the tree. Scales branch lengths to calendar time. Different branches may have different clock rates. Priors on different internal nodes can help to calibrate the clock.
Note that these 5 choices are not really independent of each other!
- site models don't have to be on nuc data - we can also use aa data, morphological traits, roots of words, etc.
- BEAST2 doesn't always use trees!
How does BEAST compute the Posterior (spoiler: MCMC)
- MCMC performs a random walk on the posterior, preferentially sampling high-density areas
- draws samples from the posterior, outputs a list of values that can approximate the posterior
- We need only compare which posterior density is higher, so we only need the ratio of posterior
- Alexei: if we leave this to run forever, the random walk would eventually explore everywhere.
- Alexei: MCMC runs so that the equilibrium distribution == the posterior distribution.
How good is an approximation? - If we knew what the posterior was, then we'd want approximately q% of points inside q% of contours ("contours" are like "percentiles").
Target distribution - this is the posterior in BEAST2. MCMC steps through param state space and samples the target distribution. proposal distribution - used to decide where to step to next. This choice only affects the efficiency of the algorithm. Operators - part of the MCMC algo, not the proposal df.
Input - MCMC chain length, and interval (how often to record samples so that they're not correlated). >10,000 MCMC samples is a waste of space; the trick is to sample at the right frequency. After - discard burn-in, assess convergence and mixing. Is the chain mixing well? (i.e. does it look like a nice caterpillar? Should look like white noise, i.e. uncorrelated.) Are samples uniformly drawn from all over the stationary distribution? The MCMC algo might be getting stuck in a hill, then making a jump to another hill.
Protip - do multiple runs, and see if they converge on the same posterior. - the posterior may be bimodal, or not a nice bell shape!
BEAST Best Practices
(Only a guideline; each analysis is unique) - Know thy data, to plan the five different inputs. - Run analysis with multiple chains. Combine chains, assess convergence and autocorrelation.