PAUP* activity

table of contents

expected learning outcome

The objective of this activity is to help you become familiar with using PAUP* for several different analysis types including parsimony, distance, and likelihood. In addition you will also gain understanding of numerous fundamental aspects of phylogenetic analysis and models of molecular evolution.

getting started

The data set we will use for this PAUP* exercise is called simulated-dna-data.nex. It is a simulated data set consisting of DNA sequences for 9 organisms and 2500 sites (taxon 9 is the outgroup). You can assume that the simulation was a homogeneous Markov process (i.e., all sites evolving according to the same model, and the model does not change over the tree.

exercise 1: basic data manipulation and parsimony analysis

  1. Start PAUP*, executing the simulated-dna-data.nex data file.

    $ paup simulated-dna-data.nex; [ "$" represents the Linux/MacOS prompt ]
    (or)
    $ paup
    paup> exec simulated-dna-data.nex;

  2. Conduct an exact tree search to find the most parsimonious tree under simple parsimony with equal weights.
    paup> bandb; (for example; could also do alltrees)
    What is the length of the optimal tree?
    [show answer]
  3. Examine the topology of the tree.

    paup> showtrees; (for example; could also do describe)

  4. Change the outgroup status so that the only outgroup taxon is taxon 9 and re-examine the tree topology.

    paup> outgroup 9;
    (or)
    paup> outgroup taxon_9;

    1. What happens to the shape of the tree?
      [show answer]
    2. Inspect the parsimony scores.
      paup> pscores; (could also do describe)

      What happens to the tree length?
      [show answer]

  5. Do a heuristic search using the default seetings in PAUP*. Then perform another heuristic search using the random-addition-sequence method with 100 replicates.

    paup> hsearch;

    paup> hsearch/addseq=random nrep=100;

    Does it look like the optimal trees are hard to find for this data set?
    [show answer]

  6. "Describe" the most-parsimonious tree as a phylogram.

    describe/plot=p;

  7. Perform a bootstrap analysis with 1,000 replicates and 10 random-addition-sequence replicates per bootstrap replicate.

    paup> bootstrap nreps=1000/addseq=random nreps=10;

    Which grouping on the most-parsimonious tree is least well supported?
    [show answer]

  8. What is the shortest tree that makes taxa 1 and 6 monophyletic with respect to the remaining taxa? (hint: use a constraint tree)
    [show answer]

    Equivalently, but more typing: paup> constraints 1and6 = ((1,6),2,4,5,7,8,9);

    paup> hsearch enforce constraints=1and6;

    Answer: 4273 steps

exercise 2: distance and likelihood analyses

  1. Reset all program options to their "factory defaults".

    paup> factory;

  2. Perform neighbor-joining analyses using uncorrected ("p"), Jukes-Cantor, Kimura 2-parameter, HKY85, Tamura-Nei, and GTR distances. Examine the distance matrix in each case.

    paup> dset dist=jc; showdist; nj;
    paup> dset dist=k2p; showdist; nj;
    paup> dset dist=hky; showdist; nj;
    paup> dset dist=tamnei; showdist; nj;
    paup> dset dist=gtr; showdist; nj;

    1. How does the distance chosen affect the tree found by NJ for this data set?
      [show answer]
    2. How does the distance chosen affect the magnitude of the pairwise distance estimates?
      [show answer]
    3. Are smaller distances less affected or more affected by the choice of a distance?
      [show answer]
  3. Using the Tamura-Nei distance, search for a tree under the minimum evolution (distance) criterion.
    paup> set criterion=distance;
    paup> dset objective=me dist=tamnei;
    paup> hsearch;

    What is the tree score under this criterion?
    [show answer]

    Now evaluate the tree using the Fitch-Margoliash weighted least-squares criterion (inverse squared weighting).
    paup> dscores/objective=ls power=2;
    What is the tree score under this criterion?
    [show answer]

  4. Perform a heuristic search for an optimal tree under the likelihood criterion using the default model settings in PAUP*.
    paup> set criterion=likelihood;
    paup> hsearch;

    "Describe" the tree and examine the topology. How does it compare to the trees found previously using parsimony and distance methods?
    [show answer]
  5. Set up a model with unequal base frequencies and two substitution types with ti/tv ratio estimated from the data (LSet command). Use the "Likelihood Scores" (LScores) command to estimate the ti/tv ratio on the tree found in the previous step.

    paup> lset tratio=estimate basefreq=estimate;
    paup> lscores;

    (or you can do it all in one command: lscores/tratio=estimate basefreq=estimate;)

    1. What is the estimate of the ti/tv ratio?
      [show answer]
    2. How does it compare to the previous value (i.e., the default setting)?
      [show answer]
    3. What effect did optimization of the ti/tv ratio have on the likelihood score for the tree?
      [show answer]
  6. Fix the ti/tv ratio and base frequencies to the values you estimated in the step above. Perform another heuristic search using the modified settings.
    lset tratio=previous basefreq=prev;
    hsearch;
    (note: in a real analysis, you would probably want to work harder than this)
    describe;
    Did the tree topology change?
    [show answer]
  7. Explore several models that include among-site rate variation.
    lset tratio=estimate;
    lscores/rates=gamma shape=estimate basefreq=estimate pinv=0;

    gamma shape ==> 1.046, ln L = -18742.18

    lscores/rates=equal pinv=estimate;
    pinv ==> 0.200, ln L = -18857.50

    lscores/rates=gamma shape=estimate pinv=estimate;
    pinv ==> 9 ( 10-6, shape ==> 1.046, ln L = - 18742.18
    Based on the tree topology found in the previous search, does it appear that all sites are evolving at the same rate?
    [show answer]

    If not, what model of among-site rate variation do you think best explains the data?
    [show answer]

  8. Try using simpler DNA substitution rate matrices (e.g., JC, F81, K2P).
    lset rates=gamma shape=est pinv=0; (i.e., model for rates selected above)
    lscores/nst=1 basefreq=equal;
    lscores/nst=1 basefreq=estimate;
    lscores/nst=2 basefreq=equal tratio=estimate;

    Can the data be explained adequately using a simpler DNA substitution rate matrix?
    [show answer]
  9. Evaluate Tamura-Nei and GTR models and perform likelihood ratio tests; helpful: Chi-Square Calculator.
    lscores/basefreq=estimate rates=gamma shape=estimate nst=6 rmatrix=estimate; (GTR)
    lscores/rclass=(a b a a c a); (Tamura-Nei)

    GTR model: ln L = -18731.11
    TamNei model: ln L = -18732.66
    HKY model: ln L = -18742.18 (from above)

    delta(TamNei vs. HKY) = 2(18742.18 - 18732.66) = 19.04
    df = 1, P-value < 0.00001

    delta(GTR vs. TamNei) = 2(18732.66 - 18731.11) = 3.10
    df = 3, P-value ˜ 0.0783

    Is a more complex model than HKY85 (with ASRV) justified according to a likelihood ratio test?
    [show answer]

  10. Because the data were simulated on a known phylogeny, we know what the correct tree topology is. Based on all of the previous analyses, which topology do you believe is the correct one? (hint: getting this right may require additional tree searching!)
    [show answer]

    lscores/nst=6 rclass=(abaaca) rmat=estimate basefreq=est rates=gamma shape=estimate;
    lset basefreq=prev rmat=prev shape=prev;
    set criterion=likelihood;
    hsearch;

    This search gets the true tree that generated the data:

    true tree = (((((1,6),8),4),5),(2,(3,7)))

    (To continue the successive approximations approach, we re-estimate parameters on this new tree, and repeat the search using the new parameters.)

    lscores/rmatrix=estimate basefreq=estimate shape=estimate;
    <
    ln L = -18732.50

    lset rmatrix=previous basefreq=previous shape=previous;
    hsearch;

    <
    Same tree is found, so we stop.