(<-) |
. |
(->) |
BioinformaticsTheMachineLearningApproach Chap.1
Preface and Introduction
Preface
- full [Genome] sequencing
DnaMicroArray, MassSpectrometry 와 같은 HighThroughPut / combinatorial technologies 발전
overwhelming data -> need for computer / statistical / MachineLearning technique
Bioinformatics in the Post-genome Era
computational analysis : sequence analysis -> sophisticated integration of extremely diverse sets of data
main driving force of change : LargeScaleSequencing, DnaMicroArray
- data interpretation 중요성 증대
Our ability in the future to make new biological discoveries will depend strongly on our ability to combine and correlate diverse data sets along multiple dimensions and scales
system and integrative biology : (sequence + structure + functional + gene expression + pathways + phenotypic + clinical ...) * data
- needed software : storing, retrieving, networking, processing, analyzing, navigating, visualizing biological information
new concepts : GeneticAlgorithm, ArtificialNeuralNetwork, ComputerVirus, SyntheticImmuneSystem, DnaComputer, ArtificialLife, hybrid VLSI-DNA GeneChip.
Large databases of biological information -> data mining problems
evolutionary tinkering 에 의한 inherent complexity of biological systems -> machine learning approach {{{Nature is a tinkerer and not an inventor (Jacob, 1977) }}}
- 최근 computer speed 의 발전을 biological data 가 능가
초심자에게 MachineLearning method 들은 서로 상관없는 기법들로 보이지만, 그렇지 않다. -> Bayesial probabilistic framework {{{On the theoretical side, a unifying framework for all machine-learning methods also has emerged since the late 1980s. This is the Bayesial probabilistic framework for modelling and inference }}}
- biological data 의 확률적 모델링의 필요성
biological data 는 inherently noisy
evolutionary tinkering 에 의한 biological system 의 complexity and variability -> majority of the variables remain hidden
An often-met criticism of machine-learning techniques : black box approaches {{{One cannot always pin down exactly how a complex neural network, or hidden Markov model, reaches a particular answer }}}
-> but, PolymeraseChainReaction, 전기영동(GelElectrophoresis), pharmacological effect 역시 black box
Audience and Prerequisites
Technical prerequisites for the book for the book are basic calculus, algebra, discrete probability theory, at the level of an undergraduate course. Any prior knowlege of DNA, RNA, and proteins is of course helpful, but not required
Content and General Outlines of the Book
Chapter 2 : most important theoretical chapter
- The inevitable use of probability theory and sparse graphical models are really the two central ideas behind all the methods.
Chapter 5-9, Chapter 12 : the core of the book
Appendix
What Is New and What Is Omitted
At the theoretical level, we would have liked to be able to go more into higher levels of BayesianInference and BayesianNetwork.
Vocabulary and Notation
ComputationalMolecuarBiology == DNA computing
(artificial) NeuralNetwork
Chapter 1. Introduction
1.1 Biological Data in Digital Symbol Sequences
- chain molecule (digital symbol sequence) : DNA, RNA, [Protein]...
- digital nature of biological sequence data
Competent comparison of sequence patterns across species must take into account that biological sequences are inherently noisy, the variability resulting in part from random amplified by evolution -> Because DNA or AminoAcid sequences with a given function or structure will differ (and be uncertain), sequence models must be probabilistic
1.1.1 Database Annotation Quality
public databases : diverse group들이 data 생성, 더 많은 diverse group 들이 data annotation -> initial experimental error 보다 information handling 에서의 error rate 가 더 높을 것
- Present-day computers are designed to handle numbers.
Computer do not like content-addressable procedures for annotating and retieving information -> only perfect match
- Biological sequence retrieval algorithm : fuzzy representation of their content
annotation, data storage 의 어려움 -> MachineLearning approach 를 통해 prediction, classification 할 때, potential sources of error 를 고려해야 함.
많은 종류의 error 가 존재하는데, 어떤 종류는 prior knowledge 나 MachineLearning technique 을 통해서 detect 할 수 있다.
- discard data from unclear sources
MachineLearning technique 이 성공할 수 있었던 한가지 이유는, 많은 sequence 들의 corpora 가 주어질때 noise 를 다룰 수 있기 때문
1.1.2 Database Redundancy
또다른 문제는 redundancy of the data : 몇개의 그룹이 같은 sequence submit -> 같지 않으면 closely related
- sequencing project 에서는 different experimental approach 에서 redundancy 발생 가능 : genomic form의 DNA vs. cDNA, single-pass sequence vs. tenfold repitition
- cDNA sequence 는 alternative splicing 을 겪은 것을 반영.
- different versions of the same [Gene] in the [Genome]
- it is not unlikely that at least 30-80% of the genes are alternatively spliced, in fact it may be the rule rather than the exception
data redundancy -> parrallel gene expression -> noise
The use of a redundant data set -> at least three potential sources of error
- training set 에 잘못된 data 들어가면 안됨 : algorithm 적으로 제거할 수 있으나, 먼저 제거하는 것이 바람직
too closely related sequence (redundancy) 들을 한 data set 에 넣는 것을 피해야 한다 -> trade-off between data set size and nonredundancy
- alternative strategy : assign weights according to sequence novelty
- major risk : 잘못된 data 가 큰 weight 를 얻어서 major influence 가질 수 있음
- Sequence profile : very productive way of exploiting database redundancy
- individual sequence 에 대한 정보를 포함하는 것이 아님.
1.2 Genomes - Diversity, Size, and Structure
- [Genome] diversity : size, storage principle 등에서 매우 다양
- sense(positive), antisense(negative) / ambisense (bi-direction)
- prokaryote, eukaryote, archaeon
The chromosome in some organism is not stable (genome transposable element) -> major obstacle in determining the genomic sequence
- Some theories claim that a high number of chromosomal components is advantageous and increases the speed of evolution, but currently there is no final answer.
Figure 1.2 : genome size 의 nonoverlapping intervals -> 생명형태의 복잡성에 따라 size가 불연속적으로 뛴다.
- In eukaryotes, a few exceptional classes(e.g. mammals, birds, and reptiles) have genome sizes confined to a narrow interval.
- Vertebrates share a lot of basic machinery.
- 양서류(amphibians) genome 은 700-80,000 Mbp. Nevertheless, they are surely less complex than most humans in their structure and behavior.
1.2.1 Gene Content in the Human [Genome] and other Genomes
- genes : one or several segments that constitute an expressible unit
- Genes may encode a protein product, or they may encode one of the many RNA molecules that are necessary for the processing of genetic material and for the proper functioning of the cell.
- noncoding regions : sequence segments that do not directly give rise to gene products. Noncoding regions can be parts of genes
- In some organisms, such as bacteria, where the genome size is a strong growth-limiting factor, almost the entire genome is covered with coding (protein and RNA) regions; in other, more slowly growing organisms the coding part may be as little as 1-2%
- The noncoding part of a genome will often contain many pseudo-genes.
The biggest surprise from the HumanGenomeProject : gene content may be as low as in the order of 30,000 genes.
- 그 전에는 10만-20만으로 추정하였음.
The fact that worms have almost as many genes as humans is somewhat irritating.
AlternativeSplicing, multiplexing the function of genes
biological complexity (linear, polynomial, exponential, factorial) : 230000 / 220000
1.1 % of human sequence seems to be coding. 하지만, 40,000개의 genes 을 추정할때, HumanGenome 의 1/3 이 [Gene]
- C-value : mass of the nuclear DNA in an unreplicated haploid genome
- C-value paradox : C-values of eukaryotic genomes vary at least 80,000-fold across species.
- noncoding DNA just accumulates in the nuclear genome
- In plants, where some of the most exorbitant genomes have been identified, clear evidence for a correlation between genome size and climate has been estabilished.
- the message length fails to be a good measure of the quality of the information exchanged.
- completely new sequencing approachs : data 더 급속히 증가하나 mammalian sequencing 끝나면 stagnate 될 것
- Today, the raw sequencing of a complete prokaryotic genome may take less than a day.
1.3 Proteins and Proteomes
1.3.1 From [Genome] to [Proteom]
- proteomes : contain total protein expression of a set of chromosomes. multicelluar organism 에서는 cell type 과 시간에 따라 틀림.
Proteins often undergo a large number of PostTranslationalModification
- glycosylation
- phosphorylation
- addition of fatty acids
- cleavage of signal peptides in the N-terminus
역시 proteome analysis 에도 MachineLearning technique 이 유용하다.
1.3.2 Protein Length Distribution
- 생물체는 그것들이 기능을 수행하는 물이나 지방에서 안정된 구조를 가지는 polypeptide chain 을 선택하였다
protein folding : long-range interactions -> major obstacle to computational approach
A key question is whether the protein sequences we see today represent edited version of sequences that were of essentially random composition when evolution started working on them. Alternatively, they could have been created early on with a considerable bias in their composition.
- random origin hypothesis : null hypothesis
- evidence for long-range order and regularity in protein primary structure is accumulating.
- Suprisingly, species-specific regularity exists. the typical length of prokaryotic proteins is consistently different from the typical length in eukaryotes.
concentration of disulfide bond : 파마
- Several other types of long-range regularities
- One quite surprising observation has been that proteins appear to be made out of different sequence units with characteristic length of ~~125 acids in eukaryotes and ~~150 amino acids in prokaryoties.
- the length distributions of the polypeptide chains may be more fundamental than what conventionally is known as the "primary" structure of proteins.
Annotated protein primary structure : protein sequence databases, SwissProt
ProteinDataBank, [PDB] : XrayCrystallography or [NMR] 에 의한 3차원 ProteinStructure
1.3.3 Protein Function
- Many functional aspects of proteins are determined mainly by local sequence characteristics, and do not depend critically on a full 3D structure maintained in part by long-range interactions.
- Protein 기능을 실험실에서 알아내는 것을 어려울 수 있다.
- Protein 기능을 알아내는 Methods
- many protein characteristics can be inferred from the sequence.
- DNA array, chip technology : cluster 에 속한 gene 은 시간이나 tissue type 에 따라 similar expression 보임.
- Rosetta stone method : one fused multidomain protein 이 다른 organism 에서 발견되면 기능적 연관성을 알 수 있음.
- phylogenetic profiles
1.3.4 Protein Function and GeneOntology
GeneOntology Consortium : molecular function, biological process, cellular component 에 근거하여 dynamic controlled vocabulary 만듬.
1.4 On the Information Content of Biological Sequences
- Data-driven prediction methods should be able to extract essential features from individual examples and to discard unwanted information when present.
MachineLearning techniques are excellent for the task of discarding and compacting redundant seqeunce information.
NeuralNetwork : transforms a complex topology in the input sequence space into a simpler representation (모든 다변량 분석의 기본 원칙)
- sequence space : related functional or structural categories end up clustered rather than scattered
central residue 가 helical conformation 을 이루는 length 13 의 amino acid -> 20^13 possible segment -> sequence space 의 separated region 에서는 서로 다른 구조적 범주가 전형적으로 찾아지지 않음 -> but, Machine-learning technique 은 nonlinearities 를 다룰 수 있기 때문에 사용됨.
Some sequence segments may even have ability to attain both the helix and the sheet conformation : prion (proteinaceous and infectious) -> hot debate
- Another issue : What fractoin of a protein's amino acid seqeunce is sufficient to specify its structure?
- Paracelsus Challenge : The task is to convert one protein fold into another, whiel retaining 50% of the original sequence.
- 최근의 연구에 의하면, residues determine the fold in a highly nonlinear manner.
natural language <-> biological sequence (<-> 卦) : analoguous
- genetic code 의 error correction
codon-anticodon recognition 에서 most frequent error 는 같은 AminoAcid 만들던지, 최소한 어느정도 비슷한 특성(hydrophobicity)을 가지는 amino acid 를 insert
- Data-compression : in later chater both implicit and explicit use of compression in connection with machine learning will be described
- conventional text-compression schemes are so constructed that they can recover the original data perfectily without losing a single bit.
- 어떤 text sequence 는 original message 가 완전하게 회복되지 않아도 유용 : sound data
- The study of the statistical properties of repeated segments in biological seqeunces, and especially their relation to the evolution of genomes, is highly informative.
- Such analysis provides much evidence for events more complex than the fixation and incorporation of single stochastically generated mutations.
- Combination of interacting genomes, both between individuals in the same speciees and by horizontal transfer of genetic inforamtion between species, represents intergenome communication, which makes the analysis of evolutionary pathways difficult.
- In biological sequences repeats are clearly good targets for compaction
- Lumpel-Ziv algorithm
- Coding region 은 more compressible 하고, randomness 와 Information content 가 낮아진다.
- Hidden Markov model 은 generative model 인데, 이 모델이 보통 sequence set 의 regularity 를 구현하도록 training 되기 때문에, 대부분의 가능한 sequence 는 0 에 가까운 확률값을 가진다.
1.4.1 Information and Information Reduction
Classification and prediction algorithms are in general computational means for reducing the amount of information.
- The input is information-rich sequence data, and the output may be a single number or, in the simplest case, a yes or no representing a choice between two categories.
- The contractive character of these algorithms means that they cannot be inverted : sum
- This is also true for much of the sequence-related information processing that takes place in the cell.
- other examples : mRNA에서의 intron 제거, RNA editing
Many other examples of logically and physically irreversible processes exist. This fact is of course related to the irreversible thermodynamic nature of most life processes.
한강에 손을 넣어보고, 어디서 온 물인지 알 수 있는가? 솔리톤?
- The information reduction inherent in computational classification and prediction makes it easier to see why in general it does not help to add extra input data to a method processing a single data item.
- Protein secondary structure prediction normally works better when based on 13 amino acid segments instead of segment of size 23 or higher.
Machine-learning approaches may have some advantages over other methods. Weights in neural netwokrs vanish during training unless positive or negative correlations keep them alive and put them into use. see also GracefulDegradation?
- Information reduction is a key feature in the understanding of almost any kind of system.
- machine-learning algorithm will create a simpler representation of a sequence space that can be much more powerful and useful than the original data containing all details.
음양오행, 64괘
- Lewis Carroll : maps and mapping
1.4.2 Alignment Versus Prediction : When Are Alignments Reliable?
- When is the sequence family high enough that one may safely infer either a structural or a functional similarity from the pairwise alignment of two sequences?
- It is well known that proteins can be structurally very similar even if the sequence similarity is very low.
- The necessary and sufficient similarity threshold will be different for each task.
- In general, one may say that in the zone of safe inference, alignment should be preferred to prediction.
Sander and Schneider : length dependent function -> threshold 에 따라 alignment / prediction method 선택
- safe zone of inference : guideliny only
- coding [SNP] 의 중요성 : In many cases the change of a single amino acid is known to lead to a completely different, possibly unfoled and unfunctional protein
1.4.3 Prediction of Functional Features
- The sequence identity threshold for structural problems cannot be used directly in sequence prediction problems involving functionality.
- zone-separating principle
- to split each sequence into a number of subsequences
1.4.4 GlobalAlignment and LocalAlignment and SubstitutionMatrix Entropies
- alignment algorithm 에 의해 생성된 match 는 전적으로 parameter 들과, global, local 등의 alignment algorithm 에 의존한다.
- Classical alignment algorithms are based on dynamic programming : Needleman-Wunshch, Smith-Waterman algorithm
dynamic programming 은 combinatiorial explosion -> reduction 을 위한 heuristic 필요
- local alignment 의 "local" 정도는 substitution matrix 의 선택에 강하게 영향을 받음
- 모든 substitution matrix 는 substitution matrix entropy 의 개념으로 생각할 수 있다
- identity matrix 를 예제로...
1.4.5 Consensus Sequences and Sequence Logos
molecular binding site 의 specificity 를 연구할때, alignement 로부터 consensus sequence 를 만들어서 가장 흔한 nucleotide 나 amino acid 를 그 position 의 대표로 하는 것이 흔하다. -> 정보의 손실
A graphical visualization technique based on the Shannon information content at each position is the sequence logo approach
- Fig 1.8
- D is summed over the region of the site, one gets a measure of the accumulated information in a given type of site
extremly thermophilic archaeon Sulfolobus solfataricus의 translation initiation pattern 2가지 : logo 가 data 의 서로 다른 부분의 서로 다른 패턴을 검사하는 과정에 유용 -> Fig 1.9
- Sequence logos are useful for a quick examination of the statistics in the context of functional sites or regions
Sequence logos using monomers will treat the positions in the context of a site independently
visulatization technique can easily handle the occurence of dinucleotide or dipeptides -> Fig 1.10
- logo formula 를 relative entropy 를 사용해 변환 : reference probability distribution
NeuralNetwork 은 simple logo visualization 과 같은 sequence 의 positionwise uncorrelated analysis 를 넘어선다. Neural networks have the ability to process the sequence data nonlinearly where correlation between different positions can be taken into account.
- O-glycosylation site
- dipeptide or more complex weight matrices
divide all the positive cases into two or more classes : nonlinear problem -> linear one
- drawback of linear technique is that it becomes impossible to subtract evidence.
- 최소한 시작할때는 neural network 통해서 input representation 을 변화시킴으로써, 보다 연결된 sequence space 의 topology 를 만들 수 있음.
- 어떤 method 사용할 것인가? Data 가 비교적 깨끗할때, near-optimal method 를 빨리 발전시킬 수 있다는 점에서 machine-learning method 가 생산적이라는 것이 많은 사람들의 경험
1.5 Prediction of Molecular Function and Structure
- What can you do with your sequence once you have it?
1.5.1 Sequence-based Analysis
- Intron splice sites and branch points in eukaryotic pre-mRNA
GeneFinding in prokaryote and eukaryotes
- Recognition of promotores - transcription initiation and termination
- Gene expression levels
- Prediction of DNA bending and bendability
- Nucleosome positioning signals
- Sequence clustering and cluster topology
- Prediction of RNA secondary structure
- Other functional sites and classes of DNA and RNA
- Protein structure prediction
- Pretein function prediction
- Protein family classification
- Protein degradation