DSC 3091- Advanced Statistics Applications I

Categorical Data Analysis

Dr Jagath Senarathne

Department of Statistics and Computer Science

Categorical data

A categorical variable has a measurement scale consisting of a set of categories.
Examples:
- choice of accommodation: house, condominium, and apartment.
- political ideology: liberal, moderate, or conservative.
- Gender: male and female
Categorical variables can be classified as Binary, Nominal or Ordinal.
Probability distributions for categorical data:
- Binomial distribution
- Multinomial distribution

Summarizing Categorical data

Example 1

Let’s consider the knee dataset in catdata R-packge.

Description:

In a clinical study n=127 patients with sport related injuries have been treated with two different therapies (chosen by random design). After 3,7 and 10 days of treatment the pain occuring during knee movement was observed.

library(catdata)
data(knee)
head(knee)

  N Th Age Sex R1 R2 R3 R4
1 1  1  28   1  4  4  4  4
2 2  1  32   1  4  4  4  4
3 3  1  41   1  3  3  3  3
4 4  2  21   1  4  3  3  2
5 5  2  34   1  4  3  3  2
6 6  1  24   1  3  3  3  2

Check the structure of the data

str(knee)

'data.frame':   127 obs. of  8 variables:
 $ N  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Th : int  1 1 1 2 2 1 2 2 2 1 ...
 $ Age: int  28 32 41 21 34 24 28 40 24 39 ...
 $ Sex: num  1 1 1 1 1 1 1 1 0 0 ...
 $ R1 : int  4 4 3 4 4 3 4 3 4 4 ...
 $ R2 : int  4 4 3 3 3 3 3 2 4 4 ...
 $ R3 : int  4 4 3 3 3 3 3 2 4 4 ...
 $ R4 : int  4 4 3 2 2 2 2 2 3 3 ...

Convert into factor variables

knee$Th <- as.factor(knee$Th)
knee$Sex <- as.factor(knee$Sex)
str(knee)

'data.frame':   127 obs. of  8 variables:
 $ N  : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Th : Factor w/ 2 levels "1","2": 1 1 1 2 2 1 2 2 2 1 ...
 $ Age: int  28 32 41 21 34 24 28 40 24 39 ...
 $ Sex: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
 $ R1 : int  4 4 3 4 4 3 4 3 4 4 ...
 $ R2 : int  4 4 3 3 3 3 3 2 4 4 ...
 $ R3 : int  4 4 3 3 3 3 3 2 4 4 ...
 $ R4 : int  4 4 3 2 2 2 2 2 3 3 ...

Changing factor levels

levels(knee$Th) <- c("Placebo","Treatment") 
levels(knee$Sex) <-c("Male","Female")
head(knee)

  N        Th Age    Sex R1 R2 R3 R4
1 1   Placebo  28 Female  4  4  4  4
2 2   Placebo  32 Female  4  4  4  4
3 3   Placebo  41 Female  3  3  3  3
4 4 Treatment  21 Female  4  3  3  2
5 5 Treatment  34 Female  4  3  3  2
6 6   Placebo  24 Female  3  3  3  2

Creating tabulated summaries

T1=table(knee$Th)
T1


  Placebo Treatment 
       63        64

prop.table(T1)


  Placebo Treatment 
 0.496063  0.503937

T2=table(knee$Th,knee$Sex)
T2

           
            Male Female
  Placebo     17     46
  Treatment   21     43

prop.table(T2)

           
                 Male    Female
  Placebo   0.1338583 0.3622047
  Treatment 0.1653543 0.3385827

Using CrossTable function in gmodels package

library(gmodels)
CrossTable(table(knee$Th,knee$Sex))


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  127 

 
             |  
             |      Male |    Female | Row Total | 
-------------|-----------|-----------|-----------|
     Placebo |        17 |        46 |        63 | 
             |     0.182 |     0.078 |           | 
             |     0.270 |     0.730 |     0.496 | 
             |     0.447 |     0.517 |           | 
             |     0.134 |     0.362 |           | 
-------------|-----------|-----------|-----------|
   Treatment |        21 |        43 |        64 | 
             |     0.179 |     0.076 |           | 
             |     0.328 |     0.672 |     0.504 | 
             |     0.553 |     0.483 |           | 
             |     0.165 |     0.339 |           | 
-------------|-----------|-----------|-----------|
Column Total |        38 |        89 |       127 | 
             |     0.299 |     0.701 |           | 
-------------|-----------|-----------|-----------|

Categorical data visualization

library(ggplot2)
ggplot(knee, aes(x = R2, fill = Th)) + geom_bar(position = "dodge") +
  labs(x = "Pain after trteatment", 
       y = "Number of patients", 
       fill = "Treatment")

Chi-square goodness of fit test

A statistical hypothesis test used to determine whether a variable is likely to come from a specified distribution or not.
In Knee injuries dataset, let’s check whether the patients were randomly allocated to the treatment and placebo groups.
- Null hypothesis: \(P_{trt}=P_{plc}=0.5\)

probabilities <- c(Treatment = .5, Placebo = .5) 
probabilities

Treatment   Placebo 
      0.5       0.5

library(lsr)
goodnessOfFitTest(x=knee$Th) # No need to input probabilities if they are equal


     Chi-square test against specified probabilities

Data variable:   knee$Th 

Hypotheses: 
   null:        true probabilities are as specified
   alternative: true probabilities differ from those specified

Descriptives: 
          observed freq. expected freq. specified prob.
Placebo               63           63.5             0.5
Treatment             64           63.5             0.5

Test results: 
   X-squared statistic:  0.008 
   degrees of freedom:  1 
   p-value:  0.929

Chi-square test of Independence

A hypothesis test used to determine whether two categorical or nominal variables are likely to be related or not.
In Knee injuries dataset, let’s check whether the variables Th and R2 are independent or not.

library(lsr)
associationTest( formula = ~Th+R2, data = knee )


     Chi-square test of categorical association

Variables:   Th, R2 

Hypotheses: 
   null:        variables are independent of one another
   alternative: some contingency exists between variables

Observed contingency table:
           R2
Th           1  2  3  4  5
  Placebo   13  3 18 22  7
  Treatment 14  6 23 17  4

Expected contingency table under the null hypothesis:
           R2
Th             1    2    3    4    5
  Placebo   13.4 4.46 20.3 19.3 5.46
  Treatment 13.6 4.54 20.7 19.7 5.54

Test results: 
   X-squared statistic:  3.098 
   degrees of freedom:  4 
   p-value:  0.542 

Other information: 
   estimated effect size (Cramer's v):  0.156 
   warning: expected frequencies too small, results may be inaccurate

Another way to do chi-square tests in R

goodness of fit

chisq.test(x=table(knee$Th))


    Chi-squared test for given probabilities

data:  table(knee$Th)
X-squared = 0.007874, df = 1, p-value = 0.9293

Independence

T3=table(knee$Th,knee$R2)
chisq.test(T3)

Assumptions of chi-square test

Expected frequencies are sufficiently large.

If this assumption is violated

If your expected cell counts are too small, check out the Fisher exact test.
observations are independent.

If observations are not independent

It may be possible to use the McNemar test or the Cochran test.

Fisher exact test

The Fisher exact test works somewhat differently to the chi-square test (or in fact any of the other hypothesis tests)
As can be seen it does not calculate a test statistic.

T3=table(knee$Th,knee$R2)
fisher.test(T3)


    Fisher's Exact Test for Count Data

data:  T3
p-value = 0.5641
alternative hypothesis: two.sided

McNemar test

Suppose we want to check whether the two variables R2 and R3 are independent or not.
Here, both variables measure the pain of the same set of patients after the treatment.
Therefore, these observations can be correlated.

R2.merge=factor(ifelse(knee$R2==1 | knee$R2==2,1,2))
R3.merge=ifelse(knee$R3==1 | knee$R3==2,1,2)
T4=table(R2.merge,R3.merge)
mcnemar.test(T4)


    McNemar's Chi-squared test with continuity correction

data:  T4
McNemar's chi-squared = 9.0909, df = 1, p-value = 0.002569

Odds Ratio and 95% CI

library(vcd) # install the package first
T5 <-table(knee$R4,knee$Th)
odds.2cb <- oddsratio(T5,log=F) # computes the odds ratio
summary(odds.2cb) # summary displays the odds ratio


z test of coefficients:

                      Estimate Std. Error z value Pr(>|z|)  
1:2/Placebo:Treatment  2.90789    1.52468  1.9072  0.05649 .
2:3/Placebo:Treatment  0.24176    0.13799  1.7520  0.07978 .
3:4/Placebo:Treatment  0.38182    0.23506  1.6243  0.10430  
4:5/Placebo:Treatment  1.66667    1.63865  1.0171  0.30911  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(odds.2cb) # displays the confidence intervals

                           2.5 %     97.5 %
1:2/Placebo:Treatment 1.04057283  8.1261509
2:3/Placebo:Treatment 0.07898152  0.7400092
3:4/Placebo:Treatment 0.11424274  1.2760997
4:5/Placebo:Treatment 0.24263538 11.4483623

Plot the odds ratio and their respective confidence intervals.

plot(odds.2cb, main = "Relative Odds of Placebo", xlab = "Pain after treatment", ylab = "Odds Ratio, 95% CI")

Kendall rank correlation

Kendall rank correlation is used to test the similarities in the ordering of data.
A better alternative to Spearman correlation (non-parametric) when your sample size is small and has many tied ranks.
Example: Customer satisfaction (e.g. Very Satisfied, Somewhat Satisfied, Neutral.) and delivery time (< 30 Minutes, 30 minutes - 1 Hour, >2 Hours)

res<-cor.test(knee$R3,knee$R4, method="kendall")
res


    Kendall's rank correlation tau

data:  knee$R3 and knee$R4
z = 12.086, p-value < 2.2e-16
alternative hypothesis: true tau is not equal to 0
sample estimates:
      tau 
0.8869367

DSC 3091- Advanced Statistics Applications I

Categorical data

Summarizing Categorical data

Categorical data visualization

Chi-square goodness of fit test

Chi-square test of Independence

Another way to do chi-square tests in R

Assumptions of chi-square test

Fisher exact test

McNemar test

Odds Ratio and 95% CI

Kendall rank correlation

Some Useful Links …