ProfileHmm - BioHackersNet

BiologicalSequenceAnalysis에서, MultipleAlignment되어있는 Protein family 연구. 이를 통해 unknown protein서열이 특정 패턴과 관련있는가를 알 수 있다.

We are given a correct MultipleAlignment, from which we will build a model that can be used to find and score potential matches to new sequences.

Contents

Introduction

Introduction

functional biological sequence typically come in families
sequence analysis : based on identifying the relationship of an individual sequence to a sequence family
PairwiseAlignment : may not find sequences distantly related to the ones you have already
alternative approach : use statistical feature of the whole set of sequences in the search, MultipleAlignment
our approach to consensus modelling - ProbablisticModel : develop a particular type of HiddenMarkovModel "Profile HMMs"
Chapter 5의 목적 : correct MultipleAlignment가 주어졌을 때 model을 만드는 것
이 모델을 통해서 new sequence에 potential match scores을 찾는 것 : Chapter 6 (MultipleAlignment)

Ungapped score matrices

PositionSpecificScoreMatrix (PSSM)

Adding insert and delete states to obtain profile HMMs

gap
position specific gap score : HMM로 해결 가능
transition 을 줄이는 방법 : SilentState
D -> D transition 은 확률이 틀릴 수 있으나, I -> I transition 은 same cost를 갖는다.

Profile HMMs generalise pairwise alignment

Deriving profile HMMs from multiple alignment

만들고자 하는 model 은 특정한 sequence 하나가 아니라 family 의 consensus sequence 를 나타낸다.

Non-probabilistic profiles

문제점
1. anomalies : sequence 가 많이 관찰될수록 확실성이 증가하지만 이 경우에는 하나에서 관찰되는것과 100개에서 관찰되는것이 동일한 score
2. gap : 모든 deletion 이 동일한 score

Basic profile HMM parametrisation

parametrisation의 목표 : 모든 sequence 의 space 에서 모델이 나타내는 sequence 의 disribution 이 family의 member들에서 peak를 만들도록 함.
control 할 수 있는 parameter
1. values of probabilities : sequence 들의 수가 작을때 zero probability 때문에 문제가 됨. LaplaceRule.
2. length of the model : 어떤 column 을 match 로, 어떤 column 을 insert 로 둘 것인가? (sequence 가 하나가 아니기 때문에...) : 잘 작동하는 간단한 heuristic rule 은 gap 이 반이 넘으면 insert 로 하는 것.

Searching with profile HMMs

어떤 sequence 의 family에서의 potential membership을 detection : 점수(score)만 있으면 됨.
HMM에 대하여 match를 score하는 방법
1. Viterbi equations : sequence x의 최적의 경로 찾음 (ViterbiAlgorithm)
2. Forward equations : P(x｜M) 으로 가는 모든 경로들의 합 (ForwardAlgorithm)

Viterbi equations

Insert state 에서의 emission 은 background 와 같은 것으로 가정한다.

Forward algorithm

Viterbi 에서의 max function 을 sum 으로 대신

Alternatives to log-odds scoring

LL (log likelihood) score : length dependent 하기 때문에 length 로 나눠줘야 되고, standard deviation 으로 regularize 한 Z-score 사용
lod-odds 가 더 나음

Alignment

Viterbi variable 을 tracing back 하면 됨.

Profile HMM varients for non-global alignments

local, repeat, overlap

Optimal model construction

model construction
L column 을 marking 하는데 2^{L combination 존재 : 2}L 개의 서로 다른 ProfileHmm 존재
marking 방법
1. manual construction
2. heuristic construction
3. DynamicProgramming 에 의한 MAP construction

MAP match-insert assignment

i column 이 match state 로 mark 되어 있을때, j를 i 다음에서 sequence 끝까지 증가시켜나가면서 어디에서 j를 match state 로 잡을때, likelihood 가 최대화되는지를 결정한다.

Weighting training sequences

Simple weighting schemes derived from a tree

evolutionary tree 에 기반
두가지 방법이 있음.

Root weighed from Gaussian parameters

weight 는 root distribution 에 대한 leaves 의 영향을 나타낸다. -> root distribution 의 mean 은 각 leaves 에 대한 weight 에 의해 좌우된다.

Voronoi weights

tree 에 의하지 않는 방법
sequence space 에서 어떤 것들은 몰려 있고, 어떤 것들은 흩어져 있다.
joining neighbor 들을 가지고 polygon (Voronoi diagram)을 만듬. 어떤 sequence 주위의 empty space 양에 비례해서 sequence 에 weight를 준다.

Maximum discrimination weights

간접적인 방법
discrimination D 를 최대화하면 family 에서 멀리 떨어진 member 들을 강조하는 효과를 낳는다.
weight 는 1 - P(M|xi)
ExpectationMaximisation 의 일종
misclassification 있을 때는 문제가 됨

Maximum entropy weight

각 column 에 대해 구한 weight를 averaging 하는 방법
entropy 를 사용해서 서로 다른 column 에서 나온 정보를 결합할 수 있음. uniform 할수록 entropy가 높기 때문에 entropy를 maximize 시키는 weight를 고름.
Maximum discrimination weights 과 유사하며, 마찬가지로 outlier 가 있을때는 문제가 됨.