RnaSecondaryStructurePrediction

BiologicalSequenceAnalysis에서 RNA 서열을 이용한 secondary RnaStructure 예측문제. NoamChomsky의 TransformationalGrammar에서 ContextFreeGrammar를 적용할 수 있다.

RNA는 서열내에 자체의 2차구조를 보존하고 있기때문에, 이것으로인해 Protein과 DNA와는 다른 좀더 복잡한 문제이다. RNA와 관련된 문제는 크게 두가지.

RNA secondary structure prediction for single sequence
- DynamicProgramming : NussinovRnaFoldingAlgorithm, ZukerAlgorithm
- StochasticContextFreeGrammar(SCFG)
MultipleAlignment of families of related RNAs
- ProfileHmm(ProfileHMM), CovarianceModel(CMs)
- Comparative RNA sequence analysis : Consensus structure prediction from RNA MultipleAlignment

아직 SCFG-based RNA analysis는 널리 사용되지 않는다. ComputationalComplexity문제가 남아 있으며, 개선된 방법들이 연구중에 있다.

Contents

RNA
RNA secondary structure prediction
Covariance models: SCFG-based RNA profiles

RNA

Terminology of RNA secondary structure

complementary pair(canonical pair) : A-U, G-C
stem : contiguous stacked base pairs, generally form a regular DoubleHelix
hairpin loop : single stranded subsequences bounded by base pairs
bulge : single stranded bases occurring within a stem
multi-branched loops : from which three or more stems radiate
non-canonical pairs : G-U --> distort regular A-form RNA helix --> favoured target of proteins specialised for recognising RNA
nested fashion : if we draw arcs over an RNA sequence connecting the base pairs, none of the arcs need to cross each other
pseudoknot : base pairs between a loop and positions outside the enclosing stem
- ZukerAlgorithm, NussinovRnaFoldingAlgorithm등의 DynamicProgramming기반의 방법뿐만 아니라, StochasticContextFreeGrammar algorithm 역시 pseudoknot를 고려하지 않는다. 이것은 ContextSensitiveGrammar의 방법으로 가능하다. 2차구조까지는 pseudoknot없이도 의미있는 결과를 볼수 있으나, 3차구조를 보려면 반드시 필요하다.

RNA sequence evolution is constrained by structure

compensatory mutations에 의해, 서열유사성을 별로 공유하지 않아도, 구조적유사성을 지니는 RNA들이 많다. 그 예로, R17이라는 bacterial virus의 RNA의 consensus서열을 보면 알 수 있다. (LyticCycle일때, R17 coat protein이 이부분과 binding하여 translation을 repression한다.)

NNN'N'N'N'N'RN'N'ANYANYAN'N'N'N'N'N'N'

N	{A, C, G, U}
Y	{C, U}
R	{A, G}
N'	complementary base pair

위의 consensus sequence와 매치되는 서열을 SequenceAlignment방법으로 하는것은 의미가 없다. InformationTheory에 의하면,

{A, C, G, U} : 2 bit
{R, Y} : 1 bit
N : 0 bit
N' : 2 bit

따라서, 위의 consensus sequence는 총 6 bit의 정보를 지니며, 64(2⁶)개의 RNA서열과 매치시킬 수 있다. 여기에 핵산 일곱개가 추가되면 총 20 bit가 되어 2²⁰중 하나꼴로 매치확률을 줄일 수 있다. 이것을 실제 virus genome (GenBank MS2CG)로 검색해보면, 6bit일때 38개나 매치를 얻고 20 bit로 정보량을 늘려야만 하나의 매치를 얻는다.

이러한 RNA pattern-matching program으로 RnaMot가 있다.

Inferring structure by comparative sequence analysis

RNA 서열의 MultipleAlignment를 통해서, frequent correlated complementary mutation을 찾는 방법도 유용하다.

정확한 구조를 Comparative analysis로 유추하려면,

정확한 MultipleAlignment 정보가 필요하다.
그러기 위해선 정확한 구조를 알아야 한다.
따라서 iterative refinement process가 필요하다.

정량적인 pairwise sequence covariation을 측정하기 위해서 InformationTheory의 mutual information Mij가 사용된다.

Mij = SUM(xi, yi){ fxixj log(2, fxixj/fxifxj) }

fxi : frequency of one of the four bases observed in column i
fxixj : joint frequency observed in columns i and j

따라서, Mij는 두 열사이에얼마나 많은 joint가 생길 수 있는가를 측정한다. 0에서 2사이의 bit를 가지며, 완전랜덤일때 0, base pair일때 2 bit가 된다. Mij를 contour plot으로 그려보면 stem이 있을법한 부위가 예측된다. 더불어서, 추가적인 3차구조 폴딩역시 예측된다.

RNA secondary structure prediction

서열길이가 길어질수록 가능한 2차구조는 기하급수적으로 증가한다. 200 bp일때, 10⁵⁰개의 가능한 구조가 존재한다. 이중에서 생물학적으로 의미있는 것을 가려내는일이 중요하다.

Base pair maximisation and the Nussinov folding algorithm

NussinovRnaFoldingAlgorithm based in DynamicProgramming

An SCFG version of the Nussinov algorithm

ScfgNussinovRnaFoldingAlgorithm : StochasticContextFreeGrammar version of the NussinovRnaFoldingAlgorithm

Energy minimisation and the Zuker folding Algorithm

사실, RNA folding은 base pair의 갯수세기나 최대화라고 하기보다는 biophysics적인 현상이다. 물리학적 중요성을 반영한 ZukerAlgorithm은 현존하는 가장 복잡한 RNA 2차구조 예측 알고리즘이다.

Suboptimal RNA folding

ZukerAlgorithm은 최적의 구조를 찾는다. 그러나 생물학적인 정확한 구조는 계산된 최적의 구조가 아닌경우가 종종 있다. 따라서, suboptimal인것도 찾는 방법이 연구되었다.

ZukerSuboptimalFoldingAlgorithm은 CYK algorithm과 유사하며, inside and outside direction모두 포함한다.

One matrix : 모든 subsequence i, j with i, j paired에 대해 최적구조의 GibbsFreeEnergy를 찾는다.
Second matrix : i,j paired and i+1, j-1을 제외한 subsequence에 대해 최적구조의 GibbsFreeEnergy를 찾는다.

Base pair confidence estimates

energy minimisation folding algorithm을 위한 particlar base pairs or structure의 확률계산을 위해 Partition function calculation이 소개되었다. McCaskillAlgorithm은 GibbsFreeEnergy를 GibbsBoltzmannEquation으로 바꾸고, ForwardAlgorithm처럼, 한가지의 최적구조를 뽑는것이 아닌 모든 구조의 총합을 사용한다.

SCFG관점에서 볼때, McCaskillAlgorithm은 기본적으로 InsideOutsideAlgorithm이며, ZukerAlgorithm은 CYK algorithm이다.

Covariance models: SCFG-based RNA profiles

RNA family들 역시 공통된 consensus secondary structure를 공유할 것이다. Protein이나 DNA의 경우, ProfileHmm로 profile을 모델링할 수 있으나, RNA의 경우에는 앞에 소개된대로 잘 들어맞지 않는다. 따라서 StochasticContextFreeGrammar기반의 방법을 사용하며, 이것을 CovarianceModel(CMs)라고 한다.

ProfileHmm이 MultipleAlignment에 맞춰진 repetitive linear HMM architecture인데 비해, CovarianceModel은 repetitive tree-like SCFG architecture이다.

SCFG models of ungapped RNA alignments

Production rules :

transformation	name	description
P --> aWb	pairwise	16 pair emission probabilities)
L --> aW	leftwise	4 singlet emission probabilities
R --> Wa	rightwise	4 singlet emission probabilities
B --> SS	bifucation	probability 1
S --> W	start	probability 1
E --> e	end	probability 1

위 rule로 SCFG를 만들게 되면, CM nonterminal 을 HiddenMarkovModel의 state로 생각할 수 있고, 따라서 ungapped SCFG는 EmissionProbability per pairwise state, EmissionProbability per leftwise or rightwise state, TransitionProbability 1을 가진다. gap이 있을 경우, TransitionProbability를 조절함으로써 insertion, delection을 고려할 수 있다.

Design of covariance models

ProfileHmm is a linear string of a single type of three-state node.
CovarianceModel is a branched tree of nodes, where there are different types of nodes with different numbers of states.

Complete CM is a directed graph of states, main line of the CM is exactly the consensus tree

Construction of a CM from an RNA alignment

주어진 RNA MultipleAlignment에서 CovarianceModel만들기.

non-insertion column을 이용해서 consensus structure tree를 먼저 만든다.
tree의 node들은 CM states로 채워진다.
states는 state transition에 의해 연결된다.
Based on this assignment of alignment columns to tree nodes, individual symbols can be assigned a unique CM parse tree.
이들 parse tree에서 emission과 transition event가 각각 세어진다.
이 count가 symbol emission과 state TransitionProbability의 estimation에 사용된다. 주로 Dirichlet incorporating 혹은 mixture Dirichlet prior and doing mean posterior estimation의 방법이 사용된다.

CM structure는 consensus RNA structure를 정확하게 반영하기 때문에 이들 과정은 명백하고 매우 빠르다.

CM alignment algorithm

The seven state types of a CovarianceModel

State (sv)	Production	∆Lv	∆Rv	Emission	Transition
P	Wv --> xi Wy xj	1	1	ev(xi, xj)	tv(y)
L	Wv --> xi Wy	1	0	ev(xi)	tv(y)
R	Wv --> Wy xj	0	1	ev(xj)	tv(y)
D	Wv --> Wy	0	0	1	tv(y)
S	Wv --> Wy	0	0	1	tv(y)
B	Wv --> Wy Wz	0	0	1	1
E	Wv --> e	0	0	1	1

Scoring : InsideAlgorithm
OutsideAlgorithm

Inside-outside expectation maximisation for CM parameters

만일 모델의 구조는 알려져 있는데, probability parameters를 모를때 InsideOutsideAlgorithm이라 불리우는 ExpectationMaximisation 방법으로 estimation할 수 있다.

CM database search와 structural alignment에는 CykAlgorithm이 사용된다.

Automated comparative sequence analysis usings CMs

RNA family를 가지고 있는데, MultipleAlignment도 모르고, consensus secondary structure도 모를때, 두방법을 조합한 자동화된 Comparative sequence analysis 방법.

build an optimal (or nearly optimal) CM structure given the current alignments
build an optimal MultipleAlignment given the current CM.
iterate.

An example of a practical application of CM algorithms

대부분의 Genome에서, 가장 큰 gene family는 Protein gene family가 아니라, structural RNA genes family인 tRNA이다. 예를 들어, yeast에는 274 tRNA gene이 있으며, 인간의 경우 약 1500개의 서로다른 tRNA 유전자들이 존재한다. 이들 tRNA의 detection program은 주로 대규모 Genome annotation project의 일부가 된다.