StatisticalMethodsInBioinformatics/Chap1

(<-)

StatisticalMethodsInBioinformatics

[Chap2]

(->)

Chap.1 [Probability] Theory (1) : One random variable

Contents

1.1 Introduction
1.2 Discrete Random Variables, Definitions
1.3 Six Important '''Discrete''' Probability Distributions
1.4 The Mean of a Discrete Random Variable
1.5 The Variance of a Discrete Random Variable
1.6 General Moments of a Probability Distribution
1.7 The Probability-Generating Function
1.8 Continuous Random Variables
1.9 The Mean, Variance, and Median of a Continuous Random Variable
1.10 Five Important '''Continuous''' Distributions
1.11 The Moment-Generating Function
1.12 Events
1.13 The Memoryless Property of the Geometric and the Exponential Distributions
1.14 Entropy and Related Concepts
1.15 Transformations
1.16 Empirical Methods

1.1 Introduction

We wish to gauge whether the two sequences show significant similarity, to assess, for example, whether they have a remote common ancestor.
26개 sequence 중 11개가 일치 -> p=0.04 (probability calculation)
This chapter provides an introduction to the probability theory relating to a single random variable.

1.2 Discrete Random Variables, Definitions

1.2.1 Probability Distributions and Parameters
- discrete random variable : two six-sided dice, and the random variable might be the sum of the two numbers showing on the dice.
- By convention, random variables are written as uppercase symbols, often X, Y, Z, while the eventually observed values of an random variable are written in lowercase, for example x, y, and z.
- The probability distribution of a discrete random variable Y is the set of values that this random variable can take, together with their associated probabilities.
- The probability distribution is often presented in the form of a table.
- There are two other frequently used methods of presenting a probability distribution : chart or diagram
- However, a third method of presentation is more appropriate in theoretical work, namely through a mathematical function. P_Y(y)
- Another important function is the distribution function, F_Y(y) of the discrete random variable Y.
- P_Y(y) vs. F_Y(y) : Figure 1.2
1.2.2 Independence
- The concept of independence is central in probability and statistics.
- Two or more events are independent if the outcom of one event does not affect in any way the outcome of any other event.
- e.g. different rolls of a die

1.3 Six Important '''Discrete''' Probability Distributions

1.3.1 One Bernoulli Trial
- A Bernoulli trial is a single trial with two possible outcomes, often called "success" and "failure" : p / 1-p (or q)
1.3.2 The Binomial Distribution
- A binomial random variable is the number of successes in a fixed number n of independent Bernoulli trials with the same probability of success for each trial.
- binomial distribution 을 일으키는 4가지 조건
- There are several important comments to make concerning Bernoulli trials and the binomial distribution : 5가지
1.3.3 The Uniform Distribution
1.3.4 The [GeometricDistribution]
- Suppose that a sequence of independent Bernoulli trials is conducted, each trial having probability p of success.
- The random variable of interest is the number Y of trials before but not including the first failure.
- One exmple of the use of success runs occurs in the comparison of the two sequences in (1.1), where the length of the longest success run in this comparison is three. We might wish to use [GeometricDistribution] probabilities to assess whether this is a significantly long run.
- Geometric-Like Random Variables : BLAST thoery 의 핵심
1.3.5 The Negative Binomial and the Generalized Geometric Distributions
- In some applications the role of the number of trials and the number of successes is reversed, in that the number of successes is fixed in advance (at the value m, and the random varialbe N is the number of trials up to and including this mth success.
1.3.6 The Poisson Distribution
- "n large, p small, np moderate" condition

1.4 The Mean of a Discrete Random Variable

The mean of a random variable is often confused with the concept of an average, and it is important to keep the distinction between the two concepts clear.
There are several remarks to make regarding the mean of a discrete random variable.
- mean : expected value
- The word "average" is not an alternative for the word "mean," and has a quite different interpretation from that fo "mean."
- The mean is not necessarily a realizable value of a discrete random variable.

1.5 The Variance of a Discrete Random Variable

The variance is a measure of the dispersion of the probability distribution of the random variable around its mean.
The variance, like the mean, is often unknown to us.
p.18 정리된 표

1.6 General Moments of a Probability Distribution

The mean and variance are special cases of moments of a discrete probability distribution.
Equation 1.29

1.7 The Probability-Generating Function

The original purpose for the pfg is to generate probabilities, as the name suggests.
Another use of the pgf is to derive moments of a probability distribution.

1.8 Continuous Random Variables

1.9 The Mean, Variance, and Median of a Continuous Random Variable

1.9.1 Definitions
1.9.2 Chebyshev's Inequality

1.10 Five Important '''Continuous''' Distributions

1.10.1 The Uniform Distribution
1.10.2 The Normal Distribution
- standard normal distribution : 평균이 0, 분산이 1
- standadized random variable Z= (X - mu)/sigma
- two standard deviation from the mean : 0.95
1.10.3 The Normal Approximation to a Discrete Distribution
- One of the many uses of the normal distribution is to provide approximations for probabilities for certain random variables.
- The normal approximation to the binomial is a consequence of the central limit theorem.
- For p=0.5 (the only case where the binomial probability distribution is symmetric) the approximation is good even for n as small as 20.
- continuity correction
1.10.4 The Exponential Distribution
- The exponential distribution is the continuous analogue of the geometric distribution.
1.10.5 The Gamma Distribution
- The exponential distribution is a special case of the gamma distribution where k=1.
- The value of k need not be an integer, but if it is, then GAMMA(k) = (k-1)!
- Another important special case is where ... -> chi-square distribution
- It can be shown that if Z is a standard normal random variable, then Z^2 is a random variable having the chi-square distribution with one degree of freedom.
1.10.6 The Beta Distribution
- The uniform distribution is the special case of this distribution ...

1.11 The Moment-Generating Function

We conclude this section with two important properties of mgfs.
- First, let X be any continuous random variable ... g(X) is itself a random variable.
- The second property of mgfs is important for BLAST theory ...

1.12 Events

1.12.1 What Are Evenets?
- In many contexts, however, it is more natural, or more convenient, to consider probabilities relating to events rather than to random variables.
- We simply think of an event as something that either will or will not occur when some experiment is performed.
- A certain event is an event that must happen.
1.12.2 Complements, Unions, and Intersections
1.12.3 Probabilities of Events
1.12.4 Conditional Probabilities
1.12.5 Independence of Events

1.13 The Memoryless Property of the Geometric and the Exponential Distributions

1.14 Entropy and Related Concepts

1.14.1 Entropy
- The entropy of a probability distribution is a measure of the evenness of that distribution and thus, in a sense, of the unpredictability of any observed value of a random variable having that distribution.
1.14.2 Relative Entropy
- The relative entropy of two probability distributions measures in some sense the dissimilarity between them.
- However, since H( P0 || P1) is not equal to H( P1 || P0) , ... -> divergence
1.14.3 Scores and Support

1.15 Transformations

1.16 Empirical Methods

One consequence of currently available computational power, however, is that we can simulate an experiment many times and obtain "data" from each simulated replication. We can then approximate probability distributions and their means and variance using these simulation "data."