DSC 3091- Advanced Statistics Applications I

Point Estimation and Confidence Intervals

Dr Jagath Senarathne

Department of Statistics and Computer Science

Point Estimation

Recall

  • Parameter: Characteristics that are used to describe the population.

  • Statistic: a function of the observable random variables in a sample which does not include any unknown quantities.

  • Estimator: A statistic that is used to estimate an unknown parameter.

Point Estimation Cont…

Parameter Estimator
Population mean \(\mu\) Sample mean \(\bar{x}\)
Population variance \(\sigma^2\) Sample variance \(s^2\)
Population proportion \(p\) Sample proportion \(\hat{p}\)

Maximum Likelihood Estimators

  • The point in the parameter space that maximizes the likelihood function.

  • Likelihood function is given by; \[𝐿(𝑥,\theta)=\prod_{𝑖=1}^𝑛𝑓(𝑥_i,\theta)\]

  • The idea of maximum likelihood estimation is to first assume our data come from a known family of distributions that contain parameters.

  • Then the maximum likelihood estimates (MLEs) of the parameters will be the parameter values that are most likely to have generated our data.

Example 1

  • Consider a simple coin-flipping example. Let’s say we flipped a coin 200 times and observed 103 heads and 97 tails. If the probability of “success” (i.e. getting a head) is \(p\),
    1. Define a function that will calculate the likelihood function for a given value of \(p\); then

    2. Search for the value of \(p\) that results in the highest likelihood.

Example 2

Suppose we have data points representing the weight (in kg) of students in a class.

 [1] 59.001 38.267 41.025 35.555 46.690 20.994 39.407 52.780 57.495 52.416
[11] 60.062 48.149 40.182 50.929 49.472 49.197 43.459 40.493 60.196 58.590
[21] 53.645 53.837 61.134 62.115 46.517 41.404 56.500 53.281 44.821 47.610
[31] 51.178 58.315 34.411 47.795 41.828 60.767 60.797 51.421 51.570 48.313
[41] 47.310 58.078 38.753 35.692 50.604 42.070 53.403 47.405 36.952 53.682

This dataset appears to follow a normal distribution. Find the MLEs for the mean and standard deviation for this distribution?

Normal distribution - Maximum Likelihood Estimation

  • The MLE of \(\mu\) is defined as \(\hat{\mu}_{MLE}=argmax(x_1,...,x_n|\mu,\sigma^2)\); where \(\hat{\mu}_{MLE}\) is the value of \(\mu\) that maximizes the likelihood function.
  • If we maximise the above likelihood function, we get \(\hat{\mu}_{MLE}=\bar{x}.\)

  • Since the MLE of \(\mu\) is the sample mean, computing the MLE in R becomes straightforward.

Interval Estimation

  • Point estimators are often use as sample measures for population parameters.

  • It is also helpful to know how reliable this estimate is, that is, how much sampling uncertainty is associated with it.

  • A useful way to express this uncertainty is to calculate an interval estimate or confidence interval for the population parameter

  • In other words, the confidence interval is of the form “point estimate ± uncertainty

Confidence Interval for Mean

Case 1: When data is normal/ large sample and \(\sigma\) is known.

\[\bar{x}\pm z_{\alpha/2}\sigma/\sqrt{n}\]

Case 2: When data is normal/ large samples and \(\sigma\) is unknown.

\[\bar{x}\pm t_{n-1,\alpha/2}\sigma/\sqrt{n}\]

Case 3: When data is non-normal/ small samples

  • For this, bootstrap approach is used as follows.

CONFIDENCE INTERVALS FOR Difference of Means

Case 1: Sampling from two independent normal distributions with known variances.

library("BSDA")
z.test(x,y = NULL,alternative = "two.sided",
sigma.x = NULL, sigma.y = NULL, conf.level = 0.95)

Case 2: Sampling from two independent normal distributions with unknown variances (small samples).

  • when population variances are equal
    t.test(x,y,alternative = "two.sided",
    var.equal=TRUE, conf.level = 0.95)
  • when population variances are unequal

  t.test(x,y,alternative = "two.sided",
    var.equal=FALSE, conf.level = 0.95)

Confidence Interval Chart in R (Independent Means & CIs)

  • Example
set.seed(123456)                 # Create example data
data <- data.frame(x = c("A","B","C"),
                  y = round(runif(3, 10, 20),2),
                  lower = round(runif(3, 0, 10),2),
                  upper = round(runif(3, 20, 30),2))

library(ggplot2)
ggplot(data, aes(x, y)) +        # ggplot2 plot with confidence intervals
  geom_point() +
  geom_errorbar(aes(ymin = lower, ymax = upper))

Confidence Intervals for Proportion

  • Case 1: For large sample (Using Normal approximation)

\[ \hat{p}\pm Z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]

Case 1: For large sample (Using Binomial Distribution)

  • we can use the following functions from R package epitools for this case.

Case 2: For small sample (Using Binomial Distribution)

  • When sample size is small, confidence interval for population can be calculated using binom.test() function.

Confidence Intervals for Variance

Case 1: Under normality assumption

  • User defined function to obtain confidence interval for variance.
 var.interval = function(data, conf.level = 0.95) {
 df = length(data) - 1
 chilower = qchisq((1 - conf.level)/2, df)
 chiupper = qchisq((1 - conf.level)/2, df, lower.tail = FALSE)
 v = var(data)
 c(df * v/chiupper, df * v/chilower)
 }
 
 lizard = c(6.2, 6.6, 7.1, 7.4, 7.6, 7.9, 8, 8.3, 8.4, 8.5, 8.6, 8.8, 8.8, 9.1, 9.2, 9.4, 9.4, 9.7, 9.9, 10.2, 10.4, 10.8, 11.3, 11.9)

 var.interval(lizard)

Case 2: Under non-normality assumption

  • When no assumption is made about data, a bootstrap method is used to obtain confidence intervals for the population variance.