Analysis of Covariance

This section of notes will explore an analysis procedure classically known as Analysis of Covariance (ANCOVA). This methodology is aimed to answer the following question, Is there a difference in groups, after controlling for other covariates. Here, covariates refer to another attribute in the data that is thought to influence or explain variation in the outcome, therefore, this method is meant to remove, control, or partition the variance associated with the covariate (or multiple covariates) to evaluate the conditional or adjusted means.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggformula)

## Loading required package: scales
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## Loading required package: ggridges
## 
## New to ggformula?  Try the tutorials: 
## 	learnr::run_tutorial("introduction", package = "ggformula")
## 	learnr::run_tutorial("refining", package = "ggformula")

library(broom)

theme_set(theme_bw(base_size = 18))

baby <- read_csv("https://raw.githubusercontent.com/lebebr01/statthink/main/data-raw/baby.csv") |>
   mutate(gestational_mean = gestational_days - mean(gestational_days),
          maternal_weight_mean = maternal_pregnancy_weight - mean(maternal_pregnancy_weight),
          maternal_age_mean = maternal_age - mean(maternal_age),
          maternal_height_mean = maternal_height - mean(maternal_height))

## Rows: 1174 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): birth_weight, gestational_days, maternal_age, maternal_height, mate...
## lgl (1): maternal_smoker
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(baby)

## # A tibble: 6 × 10
##   birth_weight gestational_days maternal_age maternal_height
##          <dbl>            <dbl>        <dbl>           <dbl>
## 1          120              284           27              62
## 2          113              282           33              64
## 3          128              279           28              64
## 4          108              282           23              67
## 5          136              286           25              62
## 6          138              244           33              62
## # ℹ 6 more variables: maternal_pregnancy_weight <dbl>, maternal_smoker <lgl>,
## #   gestational_mean <dbl>, maternal_weight_mean <dbl>,
## #   maternal_age_mean <dbl>, maternal_height_mean <dbl>

Let’s first explore the model that only includes gestational days in the model, that is, this is a simple linear regression model. Note: I mean centered gestational days for this model.

\[ weight = \beta_{0} + \beta_{1} days + \epsilon \]

baby_lm <- lm(birth_weight ~ gestational_mean, data = baby)

glance(baby_lm)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.166         0.165  16.7      233. 3.40e-48     1 -4973. 9953. 9968.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

tidy(baby_lm) |> 
  mutate_if(is.double, round, 3)

## # A tibble: 2 × 5
##   term             estimate std.error statistic p.value
##   <chr>               <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)       119.        0.489     244.        0
## 2 gestational_mean    0.467     0.031      15.3       0

Let’s next explore another simple linear regression model that only includes whether a mother smoked or not. This model is:

\[ weight = \beta_{0} + \beta_{2} smoker + \epsilon \]

baby_smoker <- lm(birth_weight ~ maternal_smoker, data = baby)

glance(baby_smoker)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1    0.0609        0.0601  17.8      76.0 9.46e-18     1 -5043. 10092. 10107.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

tidy(baby_smoker) |> 
  mutate_if(is.double, round, 3)

## # A tibble: 2 × 5
##   term                estimate std.error statistic p.value
##   <chr>                  <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)           123.       0.665    185.         0
## 2 maternal_smokerTRUE    -9.27     1.06      -8.72       0

The idea behind ANCOVA is that we know that gestational days is an important predictor in understanding variation in the birth weight of the baby. This may not be a predictor that is of most interest to our research question of knowing whether or not there are differences in birth weight between those that smoked vs did not smoke when the baby was in the womb. There may be some sampling bias occurring, that is, those that smoked may have fewer gestational days compared to those that did not smoke. This fact may drive the differences in the mean difference of birth weight. As a results, we would want to adjust for the effect of gestational days prior to evaluating the mean difference of smoker status.

Below is a descriptive figure trying to show the relationship between birth weight and gestational days by smoker status.

gf_point(birth_weight ~ gestational_days, data = baby, color = ~maternal_smoker, size = 4) |>
  gf_smooth(method = 'lm', size = 1)

Below, let’s fit the traditional ANCOVA model, that contains a single continuous covariate (gestational days) and the categorical predictor. The model would be:

\[ weight = \beta_{0} + \beta_{1} days + \beta_{2} smoker + \epsilon \]

baby_ancova <- lm(birth_weight ~ gestational_mean + maternal_smoker, data = baby)

glance(baby_ancova)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.216         0.214  16.2      161. 1.71e-62     2 -4937. 9883. 9903.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

tidy(baby_ancova) |> 
  mutate_if(is.double, round, 3)

## # A tibble: 3 × 5
##   term                estimate std.error statistic p.value
##   <chr>                  <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)          123.        0.608    202.         0
## 2 gestational_mean       0.451     0.03      15.2        0
## 3 maternal_smokerTRUE   -8.37      0.973     -8.60       0

The model above can be partioned to better show the specific terms in the model.

\[ weight_{smoker} = (\beta_{0} + \beta_{2}) + \beta_{1} days + \epsilon \]

\[ weight_{non-smoker} = \beta_{0} + \beta_{1} days + \epsilon \]

More explicitly, the term, \(\beta_{2}\) above reflects a mean level change on the intercept after controlling for the effects of gestational days.

Adjusted Means

Let’s compare the means from this ANCOVA model compared to the unconditional means.

uncond_mean <- df_stats(birth_weight ~ maternal_smoker, data = baby, mean) |>
   mutate(maternal_smoker = as.character(maternal_smoker))

cond_mean <- data.frame(maternal_smoker = c('FALSE', 'TRUE'), cond_mean = c(122.74, 122.74 - 8.37))

combine_means <- full_join(uncond_mean, cond_mean)

## Joining with `by = join_by(maternal_smoker)`

combine_means

##       response maternal_smoker     mean cond_mean
## 1 birth_weight           FALSE 123.0853    122.74
## 2 birth_weight            TRUE 113.8192    114.37

Note, the conditional mean used the fact that the gestational days were mean centered, therefore, a value of 0 was included in the above equations.

Interaction between continuous and categorical predictor

One assumption in classical ANCOVA is to assume that the effect of the adjusting covariate is the same for each group. Although, this is a classical assumption, this is an empirical question that we can explore given the data. This can be explored by the introduction of an interaction.

This new model would be:

\[ weight = \beta_{0} + \beta_{1} days + \beta_{2} smoker + \beta_{3} days:smoker + \epsilon \]

where the \(:\) in the equation above represents the interaction effect between gestational days and smoker status. An interaction is testing if the slope for gestational days is the same for those that smoke vs those that did not smoke.

More explicitly, we can partition the model above into the following two regression equations.

\[ weight_{smoker} = (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3}) days + \epsilon \]

\[ weight_{non-smoker} = \beta_{0} + \beta_{1} days + \epsilon \]

baby_lm_int <- lm(birth_weight ~ gestational_mean * maternal_smoker, data = baby)
# baby_lm_int <- lm(birth_weight ~ gestational_mean + maternal_smoker + gestational_mean:maternal_smoker, 
#    data = baby)

glance(baby_lm_int)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.225         0.223  16.2      113. 2.41e-64     3 -4930. 9871. 9896.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

tidy(baby_lm_int) |> 
  mutate_if(is.double, round, 3)

## # A tibble: 4 × 5
##   term                                 estimate std.error statistic p.value
##   <chr>                                   <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)                           123.        0.605    203.         0
## 2 gestational_mean                        0.37      0.037     10.1        0
## 3 maternal_smokerTRUE                    -8.26      0.969     -8.52       0
## 4 gestational_mean:maternal_smokerTRUE    0.231     0.062      3.74       0

What do the 4 terms represent now?

\(\beta_{0}\) is the intercept for the reference group (non-smokers) when all other attributes in the model are equal to 0.
\(\beta_{1}\) is the slope for the reference group (non-smokers), controlling for the smoker status.
\(\beta_{2}\) is the average change for those that smoke, controlling for differences in gestational days.
\(\beta_{3}\) is the average slope difference for those that smoke compared to those that did not smoke.

Often, it is easier to combine the terms to be compared directly. For example, given the example above, we could combine \(\beta_{0}\) and \(\beta_{2}\) to get the estimated intercept for those that smoke. Then we could do something similar for the slope terms for those that smoke.

Hypotheses being tested

The hypotheses being tested here for the \(\beta_{2}\) and \(\beta_{3}\) are as follows:

\[ H_{0}: \beta_{2} = 0 \]

and

\[ H_{0}: \beta_{3} = 0 \]

These can be a little bit untenable, therefore stating them in words can help.

\(H_{0}: \beta_{2} = 0\), is really testing, is there a mean difference between the reference and focal groups, controlling for the other terms in the model. This is inherently a conditional or adjusted means question.
\(H_{0}: \beta_{3} = 0\), is really testing, is there a slope difference comparing the focal and reference groups, controlling for other terms in the model.

Including more than 1 covariate to adjust/condition

It is also possible to adjust for more than one covariate. For example, maybe we thought adjusting for the weight, height, and age of the mother would also be a good idea. This can be done to remove any variation associated with that attribute prior to making exploring the mean difference in the attributes of interest.

This model would now be:

\[ weight = \beta_{0} + \beta_{1} days + \beta_{4} mother\_weight + \beta_{5} mother\_height + \beta_{6} mother\_age + \beta_{2} smoker + \beta_{3} days:smoker + \epsilon \]

These can also be partitioned into two separate equations.

\[ weight_{smoker} = (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3}) days + \beta_{4} mother\_weight + \beta_{5} mother\_height + \beta_{6} mother\_age + \epsilon \]

\[ weight_{non-smoker} = \beta_{0} + \beta_{1} days + \beta_{4} mother\_weight + \beta_{5} mother\_height + \beta_{6} mother\_age + \epsilon \]

baby_lm_int_w <- lm(birth_weight ~ gestational_mean + maternal_pregnancy_weight + maternal_height + maternal_age + maternal_smoker + gestational_mean:maternal_smoker, 
    data = baby)

glance(baby_lm_int_w)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.260         0.256  15.8      68.3 6.14e-73     6 -4903. 9822. 9863.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

tidy(baby_lm_int_w) |> 
  mutate_if(is.double, round, 3)

## # A tibble: 7 × 5
##   term                                 estimate std.error statistic p.value
##   <chr>                                   <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)                            43.5      12.3       3.54    0    
## 2 gestational_mean                        0.365     0.036    10.1     0    
## 3 maternal_pregnancy_weight               0.052     0.025     2.06    0.039
## 4 maternal_height                         1.10      0.204     5.41    0    
## 5 maternal_age                            0.07      0.081     0.873   0.383
## 6 maternal_smokerTRUE                    -8.20      0.952    -8.62    0    
## 7 gestational_mean:maternal_smokerTRUE    0.208     0.061     3.44    0.001

How could the appropriate conditional means be computed here? Also, what happened to the intercept in this model, why is it so small?

Last updated on Apr 9, 2024