| Amino Acid Substitution Models |
 |
 |
The divergence among sequences can be modeled with a mutation
matrix. The matrix, denoted by M, describes the probabilities
of amino acid mutations for a given period of evolution.
This corresponds to a model of evolution in which amino acids
mutate randomly and independently from one another but according to
some predefined probabilities depending on the amino acid itself. This
is a Markovian model of evolution and while simple, it is one of the
best models. Intrinsic properties of amino acids, like hydrophobicity,
size, charge, etc. can be modeled by appropriate mutation
matrices. Dependencies which relate one amino acid characteristic to
the characteristics of its neighbors are not possible to model
through this mechanism. Amino acids appear in nature with different
frequencies. These frequencies are denoted by fi and correspond
to the steady state of the Markov process defined by the matrix
M., i.e., the vector f is any of the columns of or the eigenvector of M
whose corresponding eigenvalue is 1 (Mf=f). This model of
evolution is symmetric, i.e., the probability of having an i
which mutates to a jis the same as starting with a j
which mutates into an i.
The following is a list of amino acid substitution models which use matrices.
Empirical substitution models
In contrast to DNA substitution models, amino acid replacement
models have concentrated on the empirical approach. Dayhoff and
coworkers developed a model of protein evolution which resulted in the
development of a set of widely used replacement matrices (Dayhoff et
al. 1978). In the Dayhoff approach, replacement rates are derived
from alignments of protein sequences that are at least 85% identical;
this constraint ensures that the likelihood of a particular mutation
being the result of a set of successive mutations is low. One of the
main uses of the Dayhoff matrices has been in database search methods
where, for example, the matrices P(0.5), P(1) and P(2.5) (known as the
PAM50, PAM100 and PAM250 matrices) are used to assess the significance
of proposed matches between target and database sequences. However,
the implicit rate matrix has been used for phylogenetic
applications.
PAM matrices
In the definition of mutation the matrix M implies certain
amount of mutation (measured in PAM units). A 1-PAM mutation matrix
describes an amount of evolution which will change, on the average, 1%
of the amino acids. In mathematical terms this is expressed as a
matrix M such that 
The diagonal
elements of M are the probabilities that a given amino acid
does not change, so (1-Mii) is the probability of mutating away
from i. If we have a probability or frequency vector
p, the product Mp gives the probability vector or the
expected frequency of p after an evolution equivalent to 1-PAM
unit. Or, if we start with amino acid i (a probability vector
which contains a 1 in position i and 0s in all others) M*i
(the ith column of M) is the corresponding probability
vector after one unit of random evolution. Similarly, after k
units of evolution (what is called k-PAM evolution) a frequency
vector p will be changed into the frequency vector Mk
p. Notice that chronological time is not linearly dependent on PAM
distance. Evolution rates may be very different for different species
and different proteins.
Dayhoff matrices
Dayhoff et
al. (1978) presented a method for estimating the matrix M
from the observation of 1572 accepted mutations between 34
superfamilies of closely related sequences. Their method was
pioneering in the field. A Dayhoff matrix is computed from a 250-PAM
mutation matrix, used for the standard dynamic programming method of
sequence alignment. The Dayhoff matrix entries are related to
M250 by .
JTT matrices
Recently, Jones et
al. (1992) and Gonnett et
al. (1992) have used much the same methodology as Dayhoff, but
with modern databases. The Jones et al. model has been implemented for
phylogenetic analyses with some success. Jones et al. (1994) have also
calculated an amino acid replacement matrix specifically for membrane
spanning segments. This matrix has remarkably different values from
the Dayhoff matrices, which are known to be biased toward
water-soluble globular proteins.
Other empirical models
Adachi and
Hasegawa (1995, 1996)) have
implemented a general reversible Markov model of amino acid
replacement that uses a matrix derived from the inferred replacements
in mitochondrial proteins of 20 vertebrate species. The authors show
that this model performs better than others when dealing with
mitochondrial protein phylogeny.
Blosum (Block substitution matrices)
A different approach was used by Henikoff and
Henikoff (1992). They used local, ungapped alignments of distantly
related sequences to derive the BLOSUM series of matrices. Matrices of
this series are identified by a number after the matrix
(e.g. BLOSUM50), which refers to the minimum percentage identity of
the blocks of multiple aligned amino acids used to construct the
matrix. It is noteworthy that these matrices are directly calculated
without extrapolations, and are analogous to transition probability
matrices P(T) for different values of T, estimated without reference
to any rate matrix Q. The BLOSUM matrices often perform better than
PAM matrices for local similarity searches, but have not been widely
used in phylogenetics.
Poisson models
A simple, non-empirical model of amino acid replacement was
proposed by Nei
(1987). This model implements a Poisson distribution, and gives
accurate estimates of the number of amino acid replacements when
species are closely related.
|