HiddenMarkovModel - BioHackersNet

HMM. Extention to MarkovChain. StochasticRegularGrammar와 동등 하다.

MarkovChain을 이용한 CpG_island여부판별은 의미가 있다. 만들어진 모델을 가지고, unknown sequence에 대하여 해당값을 곱하여, 더해주기만 한다음, 양수냐, 음수냐만 본다면 CpG_island여부판별이 가능하다.

그러나, Window size를 얼마를 주어야 하는지를 모를 뿐 아니라, CpG_island가 명백한 boundary를 가지고 있을 때, 이를 가려내지 못한다. 따라서, 이를 보완하고자 HiddenMarkovModel을 lolpololpolop사용하며, MarkovChain과의 차이점은 각 Symbol의 State들이 가려져있다는 것이다.

http://www.cse.ucsc.edu/research/compbio/html_format_papers/hughkrogh96/node4.html

Contents

Hiddel Markov Model
1. Formal definition of an HMM
2. Find state path (decoding)
Parameter estimation for HMMs
HMM model structure
More complex Markov chains
Numerical stability of HMM algorithm

Hiddel Markov Model

Formal definition of an HMM

The i th state in the path is called πi, the chain is characterised by parameters (TransitionProbability)

a(kl) = P(πi=l | π(i-1)=k)

emitted symbol b from state k (EmissionProbability)

ek(b) = P(xi=b | πi=k)

The JointProbability of an observed sequence x and a state sequence π

P(x, π) = a(0 π1) Production(i=1,L) { eπi(xi) a(πi π(i+1)) }

위식은 path를 모르기 때문에 비실용적이다. 따라서 PosteriorDistribution의 방법으로 estimation한다.

Find state path (decoding)

Decoding : 관측된서열이 무엇을 의미하는지 underlying state를 고려함으로써 알아내는것. decoding의 방법으로 다음의 알고리즘들이 있다.

ViterbiAlgorithm을 사용하여 CpG_island를 예측할 수 있다. (+)state인것들만 엮으면, 그것이 CpG_island. 41 sequence set으로, putative CpG_island에 대해 돌려봤더니, 2개의 아닌것(FalseNegative)과 121개의 새로운 부위들(FalsePositive)들이 발견되었다. 실제 CpG_island는 길기 때문에 500미만인것들을 버렸더니 67개로 FalsePositive감소.

PosteriorDecoding방법으로는 236개의 FalsePositive. 500미만버리면, 83개로 감소.

따라서, PosteriorDecoding방법이 좀더 작은 island를 예측해낸다는것 말고는, 두 방법에 큰 차이가 없다. 여기서의 FalsePositive는 아마도, 잘못된 label이었을 수 있다.

실제 활용된 예로, PairHmm, ProfileHmm이 있다. 각 모델에서는 path를 찾아야 하고, 이를 위해 ViterbiAlgorithm등을 활용한다.

Parameter estimation for HMMs

HMM을 구현하는데 있어서의 가장 어려운 문제는 초기에 모델을 specify하는것이다. 두가지 방법이 있다.

design of structure : what states there are and how they are connected,
assignment of parameter values, TransitionProbability akl, EmissionProbability ek(b)

When the state sequence is known

Traning sequence로 부터, Akl과 Ek(b)를 얻고, MaximumLikelihood estimators for akl and ek(b) are given by

           Akl                          Ek(b)
akl = -------------  and  ek(b) = ---------------
      SUM(l'){AKl'}               SUM(b'){Ek(b')}

만일, MaximumLikelihood를 위한 data set이 불충분할 경우, overfitting될 수 있다. 이문제를 해결하기 위해 PseudoCount를 사용할 수도 있다.

Akl = number of transitions k to l in training data + rkl,
Ek(b) = number of emissions of b from k in training data + rk(b)

여기서의 PseudoCount rkl, rk(b)는 PriorKnowledge를 반영하고, DirichletDistribution에 의한다.

When paths are unknown : Baum-Welch and Viterbi traning

BaumWelchAlgorithm. 실제 활용된 예로, MultipleAlignment를 HiddenMarkovModel로 해내기 위해서는 모델의 각 parameter들(TransitionProbability, EmissionProbability)을 계산해야하며, 이때 이 알고리즘을 활용한다.

Modelling of labelled sequences

HMM model structure

Choice of model topology

이상의 fully connected HiddenMarkovModel은 실제로 traning data가 많음에도 불구하고, 나쁜모델을 만들기 쉬운데, 그 이유는 OverFitting이 아니라, LocalMaxima때문이다. 따라서, 특정 transition과, state를 더하거나 빼줌으로써, model topology를 적절히 바꿀 수 있다.

Duration modelling

자기 자신으로 되돌아오는 TransitionProbability p를 정의하면, 나가는 확률은 (1-p). 따라서,

P(l residues) = (1-p) p^(l-1).

여기에 몇몇 state들을 추가하면, 모든 가능한 path의 total Probability합은

P(l) = { l-1 } p^(l-n) (1-p)^n.
       { n-1 }

This is negative BinomialDistribution. 작은 서열길이에 대해서 모델을 만족하는 path의 갯수는 GeometricalDistribution decay보다 더욱 빨리 증가한다. 따라서, path의 갯수는 model topology에 의존하며, 더욱 일반적인 model을 만드는 것이 가능하다. 연속적인 Markov process는 ErlangDistribution, 혹은 더욱 일반적인 PhaseTypeDistribution에 의해 얻어질수 있다. (See example Asmussen, 1987)

Alternatively, DurationModeling

Silent states

SilentState : 어떤 symbol도 emit하지 않는 state. (예, begin and end states)

More complex Markov chains

See MarkovChain