PhylogeneticTree - BioHackersNet

진화적연관관계규명을 위한 계통발생도. 여러개의 BioSequence들을 MultipleAlignment하고, 그 정보로 부터 PhylogeneticTree를 그리고자하는 BiologicalSequenceAnalysis문제의 한가지.

Introduction

MultipleAlignment의 문제는 서열들 간의 진화적 연관성을 설명한다. ProgressiveMulipleAlignmentAlgorithm에서 이용되는 tree를 [Clustering] process를 지시한다는 의미에서 guide tree 라고 한다.

building trees general method is

distance methods
parsimony

The tree of life

phylogeny : relationship of set of species --> presented by phylogentic tree(speciation)
gene duplication : another mechanism
- orthologous: genes diverged by speciation
- paralogous : genes diverged by gene duplication

Background on TreeGraph

edge : split into two daughter edges
node : endpoint of as edge
edge length(ti) ~= evolutionary time period

Counting and labelling trees

rooted tree :
- n leaves , (2n-1) nodes, (2n-2) edges
- (2n-3)!! rooted trees
unrooted tree :
- n leaves, (2n-2) nodes, (2n-3) edges --> (2n-3) rooted trees produced
- (2n-5)!! unrooted trees

making a tree from pairwise distances

JukeCanterDistance
- dij = -3/4log(1-4f/3)
- to infinity as equilibrium f value approached.(75 % of residues different)

[Clustering] Methods : [UPGMA]

define dij : distance between two cluster Ci and Cj
average distance between pairs of sequences from each cluster: |Ci| and |Cj| are the number of sequences in cluster i and j

            1
dij =  ---------     Sum dpq
        |Ci||Cj|pinCi,qinCj

Ck = Ci U Cj, another cluster cl

       dil|Ci|+djl|Cj|
 dkl =-------------------
        |Ci| + |Cj|

Molecular clocks and the ultrametric property of distances

Molecular clock
- edge length 는 같은 비율의 molecular clock에 의해 측정된 시간의 길이.
- 서열의 분화는 tree내 모든 지점에서 같은 비율로 일어난다는 가정.
- path choice 에 관계없이 any node 에서 leaves까지의 time sum은 같다.
- molecular clock 의 조건에서 edge length 의 합으로 distance data를 얻었다면, correct tree.
ultrametric condition - reconstruction이 옳은지 그른 지의 척도.
1. 가장 가까운 leaves가 neighbouring leaves가 아닌 경우 : 공통의 parent node를 가지지 않는다.
2. distance dij is ultrametric; any xi, xj, xk 에서 distance dij, djk, dik 의 거리가 모두 같거나, 두개가 같고 나머지는 작을 경우.

Additive and neighbor-joining

Additive :

distance between any pair of leaves = sum of the length of edges on the path
[UPGMA] tree가 만들어짐으로써 자동적으로 additivity가 만들어진다.
additive lengths {d.}를 가진 tree T가 주어졌다면 leaves {dij}의 pairwise distances 로부터 tree를 reconstruct할 수 있다.
- 같은 parent node k를 가지는 neighbouring leaves의 pair를 찾는다. (Fig 7.6)
- leaf node list에서 i, j를 삭제, k를 current lists of node에 첨가.
- k와 leaf m의 거리를
```
     dkm = 1/2(dim + djm - dij)
```
- leaves를 제거하고 한 쌍의 서열 pair만 남을 때 까지 반복.
- reconstruct a tree with additive lengths exactly.
closest pair leaves가 neighoring leaves가 아닌 경우(Fig 7.7) 이를 극복하기 위해, 모든 leaves에서 average distances를 빼는 trick을 이용하여 long edges를 보완.
```
     Dij = dij - (ri+rj), ri = 1/(|L|-2)Sum dik
```
Dij 가 minimal인 경우, i와 j는 neighbor leaves 이다.
만약에 length가 additive하지 않을지라도 neighbour-joining을 사용할 수 있다. 단지 정확한 tree를 확신하지는 못한다.
additivity의 test를 위해 four-point condition 이용: 네 개의 leaves i, j, k, ,l의 모든 set에서 dij + dkl, dik +djl과 dil + djk 의 길이중 두가지의 값은 같고, 나머지보다 커야한다.이 조건을 만족할 경우, pairwise distances 가 edge length의 합에 의한 것임을 나타냄. leaves의 쌍들을 연결하는 bridge의 길이를 포함하는 합계를 나타내는 것이 두 개있기 때문이다.

Rooting trees

neighbour-joining은 unrooted tree를 생성한다. root를 찾는 것은 outgroup이나 먼 연관성이 있는 species를 추가하여 수행함.(Fig 7.1 axolotl)
outgroup이 없을 경우, consecutive edges의 가장 긴 chain의 중간점을 선택.

Parsimony

tree building algorithm으로 가장 많이 이용됨.
최소의 substitution 으로 관찰 서열을 설명할 수 있는 tree를 찾는 방법.
tree를 만드는 대신에 주어진 tree에 cost를 정한다.
best tree를 만들기 위해서 가능한 모든 topologies를 찾는 것이 필요하다.
- 주어진 tree T 의 cost를 계산.
- 모든 trees를 검색하여 이 cost의 최소 값을 찾음.
각 site를 독립적으로 처리하고, 모든 site에 substitution추가.
기본적으로 주어진 topology의 한 site에서 필요한 변화의 최소수를 세고, 그 residue를 leaves에 할당한다.
weighted parsimony
- substitution의 수를 세는 대신 cost S(a,b)를 더한다.
  - post-order traversal(algorithm starts at the leaves up to root)
  - ancestal assignments of residues를 찾기 위해서는 recursion step에서 pointer를 저장하는 단계필요.(root에서 leaves로 pointers를 이용하여 trace back하여 ancestral residue assignment를 얻는다.)
    - lk(a) = argminb(Si(b) + S(a,b)), rk(a) = argminb(Sj(b) + S(a,b))
서열이 기존의 tree에 더해지는 순서에 따라 tree형성 결과가 달라진다. : Phylip :DNAPARS
minimum cost는 root의 위치에 관계없이 edge 독립적이 되므로 unrooted tree 만을 고려가능 : parsimony 로 search 해야하는 tree 수가 감소될 수 있다. 그러나 leaves 수가 증가할 수록 topology수 증가. --> 새로운 search strategy가 필요함.
- stochastically tree searching program : random chosen branch 의 위치를 바꾸어 원래의 tree와 score를 비교하여 우수한 것 선택.
- 한번에 하나씩의 edge를 더하여 tree를 더하는 strategy.
- 위의 방법들은 best tree를 보장하지 못함.
parsimony를 가지고 best tree를 확신할 수 있는 방법 : tree의 substitutions의 수는 extra edge를 더하는 것에 의해서만 증가할 수 있다는 사실에 근거.
- branch and bound :
  - leaves의 증가하는 수를 가지고 체계적으로 tree를 만들기 시작한다.
  - 현재 incomplete tree가 지금까지의 complete tree에 의해 얻은 최소 cost를 넘어가면 값을 버린다.
    - branching을 하여 tree를 만들되 일정 cost를 넘어가면 다음 tree를 고려.(branch and bound)

Assessing the trees: the bootstrap

Phylogenetic feature의 significance를 얻는 방법.
- alignment를 구성하는 data set에서 임의로 각 column 들을 뽑아 치환한 동일 크기의 artificial data set를 형성.(original dataset의 한 column 이 여러번 반복될 수도 있다.)
- 새로운 dataset에 tree building algorithm적용.
- 1000 번 이상 반복.
- phylogenic feature가 나타나는 확률을 그 feature를 얻었을 때의 신뢰도로 활용.
  - phylogenic festure F 의 bootstrap frequency는 posterior distribution P(F|data)으로 나타냄.

Simultaneous alignment and phylogeny

서열을 align하면서 동시에 적절한 phylogeny를 찾는 방법.
두가지 parsimony type algorithms

Sankoff & Cedergren's gap-substitution algorithm

ancestral sequences를 찾고 tree-based, parsimony type cost를 최소화하는 ancestral sequences와 leaf sequences의 정렬을 알아 냄.
- dynamic programming method for aligning a set of N sequences.
- p.141의 max를 min으로 바꿈.
- S(score)대신 σ(weighted parsimony cost)사용.
- Fig. 7.12 Dynamic programming matrix, 각 transition이 weighted parsimony에 의해 계산된 minimal cost를 가짐.
- N(2n)^N computation step필요.(n 서열의 길이, N 서열 set의 크기.)

Hein's affine cost algorithm

more fast and realistic
current practical algorithm to align sequences and find effective phylogenies.
ancestral서열들의 선택에 있어 가장 parsimonious choice를 항상 하지는 않는다는 가정.

아래조건을 만족하는, daughter node 의 두 서열인 x, y 모두에 align되는 z서열을 현재 node에서 찾는 것.

   S(x,z) + S(z,y) = S(x,y) , S는 total cost for a given alignment

gap의 처리를 위해 dynamic programming method 이용.

   V^M, V^X, V^Y in the (i, j)th cell, minimum cost for alignments
    V^M(i,j) = min{V^M(i-1, j-1), V^X(i-1,j-1), V^Y(i-1, j-i)} + S(xi, yj),
    V^X(i,j) = min{V^M(i-1,j)+d, V^X(i-1,j)+e},
    V^Y(i,j) = min{V^m(i,j-1)+d, V^Y(i,j-1)+e}.

Fig 7.13 CAC and CTCACA 의 서열이 주어었을 경우 ancestral sequences는 CTC 와 CACACA 가 가능함. (1+d+2e cost)

data type \ computational method	Optimality criterion	Clustering algorithm
Characters	Parsimony, MaximumLikelihood	.
Distances	Minimum evolution, LeastSquares	[UPGMA], NeighborJoining