A key element in evaluating the quality of a pairwise sequence alignmentis the "substitution matrix", which assigns a score for aligning any possiblepair of residues. The theory of amino acid substitution matrices is describedin [1], and applied to DNA sequence comparison in [2]. In general, differentsubstitution matrices are tailored to detecting similarities among sequencesthat are diverged by differing degrees [1-3]. A single matrix may neverthelessbe reasonably efficient over a relatively broad range of evolutionary change[1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among thebest for detecting most weak protein similarities. For particularly longand weak alignments, the BLOSUM-45 matrix may prove superior. A detailedstatistical theory for gapped alignments has not been developed, and the bestgap costs to use with a given substitution matrix are determined empirically. Short alignments need to be relatively strong (i.e. have a higher percentageof matching residues) to rise above background noise. Such short but strongalignments are more easily detected using a matrix with a higher "relativeentropy" [1] than that of BLOSUM-62. In particular, short query sequencescan only produce short alignments, and therefore database searches withshort queries should use an appropriately tailored matrix. The BLOSUM seriesdoes not include any matrices with relative entropies suitable for the shortestqueries, so the older PAM matrices [5,6] may be used instead. For proteins,a provisional table of recommended substitution matrices and gap costs forvarious query lengths is:
Query length Substitution matrix Gap costs ------------ ------------------- --------- <35 PAM-30 ( 9,1) 35-50 PAM-70 (10,1) 50-85 BLOSUM-80 (10,1) >85 BLOSUM-62 (11,1)
The raw score of an alignment is the sum of the scores for aligning pairs ofresidues and the scores for gaps. Gapped BLAST and PSI-BLAST use "affine gapcosts" which charge the score -a for the existence of a gap, and the score -bfor each residue in the gap. Thus a gap of k residues receives a total scoreof -(a+bk); specifically, a gap of length 1 receives the score -(a+b).
To convert a raw score S into a normalized score S' expressed in bits,one uses the formula S' = (lambda*S - ln K)/(ln 2), where lambda and K areparameters dependent upon the scoring system (substitution matrix and gapcosts) employed [7-9]. For determining S', the more important of theseparameters is lambda. The "lambda ratio" quoted here is the ratio of thelambda for the given scoring system to that for one using the same substitutionscores, but with infinite gap costs [8]. This ratio indicates what proportionof information in an ungapped alignment must be sacrificed in the hope ofimproving its score through extension using gaps. We have found empiricallythat the most effective gap costs tend to be those with lambda ratios in therange 0.8 to 0.9.
[1] Altschul, S.F. (1991) "Amino acid substitution matrices from an information theoretic perspective." J. Mol. Biol. 219:555-565.[2] States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70.[3] Altschul, S.F. (1993) "A protein alignment scoring system sensitive at all evolutionary distances." J. Mol. Evol. 36:290-300.[4] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919.[5] Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of evolutionary change in proteins." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3," M.O. Dayhoff (ed.), pp. 345-352, Natl. Biomed. Res. Found., Washington, DC.[6] Schwartz, R.M. & Dayhoff, M.O. (1978) "Matrices for detecting distant relationships." In "Atlas of Protein Sequence and Structure, vol. 5, suppl. 3," M.O. Dayhoff (ed.), pp. 353-358, Natl. Biomed. Res. Found., Washington, DC.[7] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268.[8] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth. Enzymol. 266:460-480.**[9] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.