3. Aspects of Statistical Inference
- This chapter provides a selection of technical statistical tools for readers with little background in statistics.
Knowledge of computer programming is essential for anyone doing methodological work.
In this chapter, what is largely missing is a discussion of Bayesian inference methods - a reflection of my personal bias.
3.1 Likelihood and Lod Score
- Likelihood and lod score are measures of plausibility of the observed data (they differ from each other in their scale - the lod score is on a log scale).
- Their values depend on assumed value theta. If a number of theta values are tried and the likelihood (or lod score) computed for each of them, the plausibility of the data will be largest for one specific theta value, which is then taken to be the best estimate(the maximum likelihood estimate) of theta.
To explain a particular phenomenon in nature, scientists build a model (also called a hypothesis) of the phenomenon.
- Good models are those that explain and accurately predict a large number of properties. e.g. Mendelian laws
- Hypotheses are verified on the basis of observations. The better the data are in agreement with a hypothesis, the more readily we accept it as being true. But how are we to measure how well observations agree with a hypothesis, or whether observations agree better with hypothesis H1 or H2?
The statistical quantity that seems most suitable to serve as a measure for our belief in a particular hypothesis is the likelihood.
- The likelihood for a hypothesis H given a set of observations F is defined as the probability, L(H) = P(F; H), with which the observations have occured, this probability being calculated under the targeted hypothesis.
It is often different values of an unknown parameter whose likelihoods are of interest. Thus, the distinction between likelihood and probability is that the former is as a function of a parameter (an unknown constant) whereas the latter is a function of an event.
- The two quantities, likelihood and proability, thus have different properties and follow different laws, but both are calculated by the laws of probability calculus.
- When comparing hypothesis (or values of an unknown parameter), the absolute values of their likelihoods are not generaly meaningful so that they are often scaled by suitable constants.
The odds in favor of hypothesis H1 versus H2 are expressed by the likelihood ratio.
In linkage analysis for two loci, the two basic hypotheses are free recombination (H0) and linkage (H1). They are defined through the value of the recombination fraction, free recombination corresponding to theta = 1/2 and linkage to theta < 1/2.
- Logarithm of the likelihood ratio (lod score) is used as the measure of support for linkage versus absence of linkage.
3.2 Maximum Likelihood Estimation
Many models or hypotheses contain variables, called parameters, whose values are unknown and to be estimated on the basis of observations.
- Any function of random variables, which does not depend on unknown parameters, is termed a statistic.
- In particualar, estimates or estimators are functions of observations and are constructed to estimate unknown paramter values.
Various statistical methods of parameter estimation exist. A very general one is the method of maximum likelihood.
An MLE of a parameter is usually symbolized by the parameter symbol with a hat on top.
Generally speaking, when we treat P(x|y) as a function of x we refer to it as a probability; when we treat it as a function of y we call it a likelihood. Note that a likelihood is not a probability distribution or density, but simply a function of the variable y.
-> y 가 변할 때는 P(x|y) 를 다 더하면 1을 넘을 수 있으므로 확률분포가 아니다. e.g. P(남자|군인) = P(군인|남자)P(남자) / P(군인)
In linkage analysis, as will be seen later, the likelihood cannot generally be maximized analytically. Instead, MLEs must be found numerically by varying the values of the parameters of interest and recomputing the likelihood for many trial values of the parameter until an approximate maximum if found.
- nuisance parameter
3.3 Statistical Properties of Maximum Likelihood Estimates
- Maximum likelihood estimates have certain well-known properties.
- Function of a paramter
- It is often easier to find the MLE of a function of a parameter rather than of the prameter itself.
- Bias
- Maximum likelihood estimates are often biased.
- estimates of nonlinear functions of the class probabilities are generally biased.
Bias reduction techniques for this case were proposed by Huether and Murphy. A general method for bias reduction is the jackknife procedure.
- Bolling and Murphy (1979) demonstrated strong biases in the estimate of the recombination fraction from certain family types. These biases are at least partly due to truncation.
The statistical bias due to truncation is not a matter of concern, the really serious biases are those arising from selected sampling when only a portion of the data is analyzed.
In general, MLEs are asymptotically unbiased, that is, their bias tends to vanish when the number of observations becomes large, where observations refers to either the number of families or the number of individuals withn a family.
- Consistency
MLEs generally are consistent, that is, they are asymptotically unbiased, and their variances approach zero in the limit for a large number of observations.
- In other words, the accuracy of MLEs increase with increases with increasing sample size.
- In the presence of ascertainment biases, as will be seen in Chapter 10, MLEs are often inconsistent. This is a most unfortunate situation because with the accumulation of more and more data, one will find an increasingly precise estimate of a quantity that is different from the one to be estimated.
- Classes of Data
- For the special case that the number of parameters is equal to the number of degrees of freedom, MLEs may simply be obtained by equating observations to their expected value (Bailey's rule).
3.4 Significance Tests and p-Values
- type 1 error : false-positive
- type 2 error : false-negative
- sensitivity : conditional probability that the test is positive given the disease is present
- specificity : conditional probability that the test is negative given the disease is absent
- Simple hypothesis : single parameter value
- To test H0 on the basis of observations, one may use the likelihood ration, T(theta) = L(theta1) / L(1/2), as a test statistic.
- The statistical test consists of the decision rule to reject H0 if T exceeds a critical point : likelihood ratio (LR) test
- To determine the error probabilities associated with this test, one must work out the distribution of the test statistic, ...
- Not knowing the distribution of the test statistic, one can give an upper bound alpha. With this, probability calculus leads to the Chebyshev-type inequality.
- In the tests considered thus far, the number of observations is viewed as a fixed constant.
- In the sequential probability ratio test(SPRT), which was developed by Wald in connection with acceptance sampling, data are accumulated until a stopping criterion is met. The number of observations is thus a random variable.
- Composite hypothesis : whole range of parameter values
In practice, of course, one wants to consider not just a single alternative but the whole range of values. A possible solution proposed by Haldane and Smith is to sue as a test statstic the average value, T, of the likelihood ration between 0 and 1/2. -> It is not used in current linkage analysis and is mentioned here only for historical reasons.
- Most tests in general use are LR tests. The lod score method as used today falls into this category of tests.
- Tests of a hypothesis H0 may be carried out in a one-sided or two-sided fashion.
- empirical significance level
3.5 The Likelihood Method
- The support (log likelihood) function may be used for finding an MLE, in which case only its properties at the MLE of the parameter are of interest.
- No significant test is carried out in the likelihood approach.
various approach -> each of these approaches has advantages and disadvantages.
- As we will see in Chapter 4, human linkage analysis uses elements from more than just a single method.
3.6 Interval Estimation
- confidence level
- In linkage analysis, determining a proper confidence interval on theat usually impossible because the distribution of the MLE is unknown.
- In many applications (such as multipoint linkage analysis) the accuracy of this approximation is unknown and may be quite unsatisfactory.
Another route often taken is to construct support intervals.
3.7 Bayes Theorem