Analysis of Covariance
Analysis of Covariance
This section of notes will explore an analysis procedure classically known as Analysis of Covariance (ANCOVA). This methodology is aimed to answer the following question, Is there a difference in groups, after controlling for other covariates. Here, covariates refer to another attribute in the data that is thought to influence or explain variation in the outcome, therefore, this method is meant to remove, control, or partition the variance associated with the covariate (or multiple covariates) to evaluate the conditional or adjusted means.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggformula)
## Loading required package: scales
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
##
## Loading required package: ggridges
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
library(broom)
theme_set(theme_bw(base_size = 18))
baby <- read_csv("https://raw.githubusercontent.com/lebebr01/statthink/main/data-raw/baby.csv") |>
mutate(gestational_mean = gestational_days - mean(gestational_days),
maternal_weight_mean = maternal_pregnancy_weight - mean(maternal_pregnancy_weight),
maternal_age_mean = maternal_age - mean(maternal_age),
maternal_height_mean = maternal_height - mean(maternal_height))
## Rows: 1174 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): birth_weight, gestational_days, maternal_age, maternal_height, mate...
## lgl (1): maternal_smoker
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(baby)
## # A tibble: 6 × 10
## birth_weight gestational_days maternal_age maternal_height
## <dbl> <dbl> <dbl> <dbl>
## 1 120 284 27 62
## 2 113 282 33 64
## 3 128 279 28 64
## 4 108 282 23 67
## 5 136 286 25 62
## 6 138 244 33 62
## # ℹ 6 more variables: maternal_pregnancy_weight <dbl>, maternal_smoker <lgl>,
## # gestational_mean <dbl>, maternal_weight_mean <dbl>,
## # maternal_age_mean <dbl>, maternal_height_mean <dbl>
Let’s first explore the model that only includes gestational days in the model, that is, this is a simple linear regression model. Note: I mean centered gestational days for this model.
\[ weight = \beta_{0} + \beta_{1} days + \epsilon \]
baby_lm <- lm(birth_weight ~ gestational_mean, data = baby)
glance(baby_lm)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.166 0.165 16.7 233. 3.40e-48 1 -4973. 9953. 9968.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(baby_lm) |>
mutate_if(is.double, round, 3)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 119. 0.489 244. 0
## 2 gestational_mean 0.467 0.031 15.3 0
Let’s next explore another simple linear regression model that only includes whether a mother smoked or not. This model is:
\[ weight = \beta_{0} + \beta_{2} smoker + \epsilon \]
baby_smoker <- lm(birth_weight ~ maternal_smoker, data = baby)
glance(baby_smoker)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0609 0.0601 17.8 76.0 9.46e-18 1 -5043. 10092. 10107.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(baby_smoker) |>
mutate_if(is.double, round, 3)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 123. 0.665 185. 0
## 2 maternal_smokerTRUE -9.27 1.06 -8.72 0
The idea behind ANCOVA is that we know that gestational days is an important predictor in understanding variation in the birth weight of the baby. This may not be a predictor that is of most interest to our research question of knowing whether or not there are differences in birth weight between those that smoked vs did not smoke when the baby was in the womb. There may be some sampling bias occurring, that is, those that smoked may have fewer gestational days compared to those that did not smoke. This fact may drive the differences in the mean difference of birth weight. As a results, we would want to adjust for the effect of gestational days prior to evaluating the mean difference of smoker status.
Below is a descriptive figure trying to show the relationship between birth weight and gestational days by smoker status.
gf_point(birth_weight ~ gestational_days, data = baby, color = ~maternal_smoker, size = 4) |>
gf_smooth(method = 'lm', size = 1)
Below, let’s fit the traditional ANCOVA model, that contains a single continuous covariate (gestational days) and the categorical predictor. The model would be:
\[ weight = \beta_{0} + \beta_{1} days + \beta_{2} smoker + \epsilon \]
baby_ancova <- lm(birth_weight ~ gestational_mean + maternal_smoker, data = baby)
glance(baby_ancova)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.216 0.214 16.2 161. 1.71e-62 2 -4937. 9883. 9903.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(baby_ancova) |>
mutate_if(is.double, round, 3)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 123. 0.608 202. 0
## 2 gestational_mean 0.451 0.03 15.2 0
## 3 maternal_smokerTRUE -8.37 0.973 -8.60 0
The model above can be partioned to better show the specific terms in the model.
\[ weight_{smoker} = (\beta_{0} + \beta_{2}) + \beta_{1} days + \epsilon \]
\[ weight_{non-smoker} = \beta_{0} + \beta_{1} days + \epsilon \]
More explicitly, the term, \(\beta_{2}\) above reflects a mean level change on the intercept after controlling for the effects of gestational days.
Adjusted Means
Let’s compare the means from this ANCOVA model compared to the unconditional means.
uncond_mean <- df_stats(birth_weight ~ maternal_smoker, data = baby, mean) |>
mutate(maternal_smoker = as.character(maternal_smoker))
cond_mean <- data.frame(maternal_smoker = c('FALSE', 'TRUE'), cond_mean = c(122.74, 122.74 - 8.37))
combine_means <- full_join(uncond_mean, cond_mean)
## Joining with `by = join_by(maternal_smoker)`
combine_means
## response maternal_smoker mean cond_mean
## 1 birth_weight FALSE 123.0853 122.74
## 2 birth_weight TRUE 113.8192 114.37
Note, the conditional mean used the fact that the gestational days were mean centered, therefore, a value of 0 was included in the above equations.
Interaction between continuous and categorical predictor
One assumption in classical ANCOVA is to assume that the effect of the adjusting covariate is the same for each group. Although, this is a classical assumption, this is an empirical question that we can explore given the data. This can be explored by the introduction of an interaction.
This new model would be:
\[ weight = \beta_{0} + \beta_{1} days + \beta_{2} smoker + \beta_{3} days:smoker + \epsilon \]
where the \(:\) in the equation above represents the interaction effect between gestational days and smoker status. An interaction is testing if the slope for gestational days is the same for those that smoke vs those that did not smoke.
More explicitly, we can partition the model above into the following two regression equations.
\[ weight_{smoker} = (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3}) days + \epsilon \]
\[ weight_{non-smoker} = \beta_{0} + \beta_{1} days + \epsilon \]
baby_lm_int <- lm(birth_weight ~ gestational_mean * maternal_smoker, data = baby)
# baby_lm_int <- lm(birth_weight ~ gestational_mean + maternal_smoker + gestational_mean:maternal_smoker,
# data = baby)
glance(baby_lm_int)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.225 0.223 16.2 113. 2.41e-64 3 -4930. 9871. 9896.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(baby_lm_int) |>
mutate_if(is.double, round, 3)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 123. 0.605 203. 0
## 2 gestational_mean 0.37 0.037 10.1 0
## 3 maternal_smokerTRUE -8.26 0.969 -8.52 0
## 4 gestational_mean:maternal_smokerTRUE 0.231 0.062 3.74 0
What do the 4 terms represent now?
- \(\beta_{0}\) is the intercept for the reference group (non-smokers) when all other attributes in the model are equal to 0.
- \(\beta_{1}\) is the slope for the reference group (non-smokers), controlling for the smoker status.
- \(\beta_{2}\) is the average change for those that smoke, controlling for differences in gestational days.
- \(\beta_{3}\) is the average slope difference for those that smoke compared to those that did not smoke.
Often, it is easier to combine the terms to be compared directly. For example, given the example above, we could combine \(\beta_{0}\) and \(\beta_{2}\) to get the estimated intercept for those that smoke. Then we could do something similar for the slope terms for those that smoke.
Hypotheses being tested
The hypotheses being tested here for the \(\beta_{2}\) and \(\beta_{3}\) are as follows:
\[ H_{0}: \beta_{2} = 0 \]
and
\[ H_{0}: \beta_{3} = 0 \]
These can be a little bit untenable, therefore stating them in words can help.
- \(H_{0}: \beta_{2} = 0\), is really testing, is there a mean difference between the reference and focal groups, controlling for the other terms in the model. This is inherently a conditional or adjusted means question.
- \(H_{0}: \beta_{3} = 0\), is really testing, is there a slope difference comparing the focal and reference groups, controlling for other terms in the model.
Including more than 1 covariate to adjust/condition
It is also possible to adjust for more than one covariate. For example, maybe we thought adjusting for the weight, height, and age of the mother would also be a good idea. This can be done to remove any variation associated with that attribute prior to making exploring the mean difference in the attributes of interest.
This model would now be:
\[ weight = \beta_{0} + \beta_{1} days + \beta_{4} mother\_weight + \beta_{5} mother\_height + \beta_{6} mother\_age + \beta_{2} smoker + \beta_{3} days:smoker + \epsilon \]
These can also be partitioned into two separate equations.
\[ weight_{smoker} = (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3}) days + \beta_{4} mother\_weight + \beta_{5} mother\_height + \beta_{6} mother\_age + \epsilon \]
\[ weight_{non-smoker} = \beta_{0} + \beta_{1} days + \beta_{4} mother\_weight + \beta_{5} mother\_height + \beta_{6} mother\_age + \epsilon \]
baby_lm_int_w <- lm(birth_weight ~ gestational_mean + maternal_pregnancy_weight + maternal_height + maternal_age + maternal_smoker + gestational_mean:maternal_smoker,
data = baby)
glance(baby_lm_int_w)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.260 0.256 15.8 68.3 6.14e-73 6 -4903. 9822. 9863.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(baby_lm_int_w) |>
mutate_if(is.double, round, 3)
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 43.5 12.3 3.54 0
## 2 gestational_mean 0.365 0.036 10.1 0
## 3 maternal_pregnancy_weight 0.052 0.025 2.06 0.039
## 4 maternal_height 1.10 0.204 5.41 0
## 5 maternal_age 0.07 0.081 0.873 0.383
## 6 maternal_smokerTRUE -8.20 0.952 -8.62 0
## 7 gestational_mean:maternal_smokerTRUE 0.208 0.061 3.44 0.001
How could the appropriate conditional means be computed here? Also, what happened to the intercept in this model, why is it so small?