Model Comparisons
Model Comparison
This section of multiple regression is going to explore model comparison, to guide the selection of the best fitting model from a set of competing models.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggformula)
## Loading required package: ggstance
##
## Attaching package: 'ggstance'
##
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
##
## Loading required package: scales
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
##
## Loading required package: ggridges
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
theme_set(theme_bw(base_size = 18))
college <- read_csv("https://raw.githubusercontent.com/lebebr01/statthink/main/data-raw/College-scorecard-4143.csv") %>%
mutate(act_mean = actcmmid - mean(actcmmid, na.rm = TRUE),
cost_mean = costt4_a - mean(costt4_a, na.rm = TRUE)) %>%
drop_na(act_mean, cost_mean)
## Rows: 7058 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): instnm, city, stabbr, preddeg, region, locale
## dbl (10): adm_rate, actcmmid, ugds, costt4_a, costt4_p, tuitionfee_in, tuiti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(college)
## # A tibble: 6 × 18
## instnm city stabbr preddeg region locale adm_r…¹ actcm…² ugds costt…³
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama A & … Norm… AL Bachel… South… City:… 0.903 18 4824 22886
## 2 University o… Birm… AL Bachel… South… City:… 0.918 25 12866 24129
## 3 University o… Hunt… AL Bachel… South… City:… 0.812 28 6917 22108
## 4 Alabama Stat… Mont… AL Bachel… South… City:… 0.979 18 4189 19413
## 5 The Universi… Tusc… AL Bachel… South… City:… 0.533 28 32387 28836
## 6 Auburn Unive… Mont… AL Bachel… South… City:… 0.825 22 4211 19892
## # … with 8 more variables: costt4_p <dbl>, tuitionfee_in <dbl>,
## # tuitionfee_out <dbl>, debt_mdn <dbl>, grad_debt_mdn <dbl>, female <dbl>,
## # act_mean <dbl>, cost_mean <dbl>, and abbreviated variable names ¹adm_rate,
## # ²actcmmid, ³costt4_a
dim(college)
## [1] 1281 18
adm_mult_reg <- lm(adm_rate ~ act_mean + cost_mean, data = college)
summary(adm_mult_reg)
##
## Call:
## lm(formula = adm_rate ~ act_mean + cost_mean, data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65484 -0.12230 0.02291 0.13863 0.37054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.834e-01 6.346e-03 107.681 < 2e-16 ***
## act_mean -1.669e-02 1.650e-03 -10.114 < 2e-16 ***
## cost_mean -2.304e-06 4.047e-07 -5.693 1.55e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1813 on 1278 degrees of freedom
## Multiple R-squared: 0.1836, Adjusted R-squared: 0.1824
## F-statistic: 143.8 on 2 and 1278 DF, p-value: < 2.2e-16
Omnibus Hypothesis
In linear regression, I omitted a portion of the output from above that typically comes with most statistical output. That is, there is a test statistic that aims to test if the model is explaining variation over and above a simple mean. More specifically, this omnibus hypothesis tests the following:
$$ H_{0}: All\ \beta = 0 \[10pt] H_{A}: Any\ \beta \neq 0 $$
This hypothesis can be formally tested with an F-statistic which is distributed as an F distribution with \(p\)
predictor attributes and \(n - p - 1\)
degrees of freedom.
f_data <- data.frame(value = seq(0, 5, by = .01)) %>%
mutate(dens = df(value, 2, 1278))
gf_line(dens ~ value, data = f_data, size = 2)
f_data <- data.frame(value = seq(0, 5, by = .01)) %>%
mutate(dens = df(value, 5, 50))
gf_line(dens ~ value, data = f_data, size = 2)
Adjusted R-squared
The adjusted R-squared is typically used when comparing models. This statistic is commonly used as R-square represents the ratio between explained and total variance, therefore, this will always increase, even if the new attribute entered is not helpful. The adjusted R-squared tries to adjust for model complexity. There are many ways to do this, but the most common will be defined here.
$$ \bar{R}^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1} $$
or
$$
\bar{R}^2 = 1 - \frac{SS_{res} / df_{e}}{SS_{tot} / df_{t}}
$$
where \(p\)
is the number of predictors (excluding the intercept), \(n\)
is the sample size, \(SS_{e}\)
and \(SS_{tot}\)
are sum of square residual and total respectively, and \(df_{e}\)
and \(df_{t}\)
are degrees of freedom for the error ($n - p - 1$) and total ($n - 1$) respectively.
summary(adm_mult_reg)
##
## Call:
## lm(formula = adm_rate ~ act_mean + cost_mean, data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65484 -0.12230 0.02291 0.13863 0.37054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.834e-01 6.346e-03 107.681 < 2e-16 ***
## act_mean -1.669e-02 1.650e-03 -10.114 < 2e-16 ***
## cost_mean -2.304e-06 4.047e-07 -5.693 1.55e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1813 on 1278 degrees of freedom
## Multiple R-squared: 0.1836, Adjusted R-squared: 0.1824
## F-statistic: 143.8 on 2 and 1278 DF, p-value: < 2.2e-16
1 - (1 - .1836) * (1281 - 1) / (1281 - 2 - 1)
## [1] 0.1823224
anova(adm_mult_reg)
## Analysis of Variance Table
##
## Response: adm_rate
## Df Sum Sq Mean Sq F value Pr(>F)
## act_mean 1 8.386 8.3861 255.092 < 2.2e-16 ***
## cost_mean 1 1.065 1.0655 32.411 1.546e-08 ***
## Residuals 1278 42.014 0.0329
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1 - (42.01 / 1278) / ((8.39 + 1.066 + 42.014) / 1280)
## [1] 0.1825191
Model Comparison
There are a variety of statistics used to provide statistical evidence for competing models. If the models are nested, then the variance decomposition can be used to determine if the added predictors helped to explain significant variation over and above the simpler model.
In this situation, another F statistic can be derived where
$$ F = \frac{SS_{res}^{R} - SS_{res}^{F} / \Delta p}{SS_{res}^{F} / df_{F}} $$
act_lm <- lm(adm_rate ~ act_mean, data = college)
summary(act_lm)
##
## Call:
## lm(formula = adm_rate ~ act_mean, data = college)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71229 -0.12483 0.01999 0.14317 0.37289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.661593 0.005128 129.02 <2e-16 ***
## act_mean -0.021905 0.001388 -15.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1835 on 1279 degrees of freedom
## Multiple R-squared: 0.1629, Adjusted R-squared: 0.1623
## F-statistic: 249 on 1 and 1279 DF, p-value: < 2.2e-16
anova(act_lm, adm_mult_reg)
## Analysis of Variance Table
##
## Model 1: adm_rate ~ act_mean
## Model 2: adm_rate ~ act_mean + cost_mean
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1279 43.079
## 2 1278 42.014 1 1.0655 32.411 1.546e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
((43.08 - 42.01) / 1)
## [1] 1.07
1.07 / (42.01 / 1278)
## [1] 32.55082
Non-nested models
For non-nested models, the F-statistic defined above will not work. Instead other statistics are needed to evaluate which model is the best. The one that I prefer for this is the AIC (Akaike information criteria) or the related small sample form, AICc. The equations for these aren’t all that useful, utilizing software is the best way to compute these statistics. In general, smaller AIC values indicate a better fitting model.
library(AICcmodavg)
cost_lm <- lm(adm_rate ~ cost_mean, data = college)
aictab(list(cost_lm, act_lm),
modnames = c('cost', 'act'))
##
## Model selection based on AICc:
##
## K AICc Delta_AICc AICcWt Cum.Wt LL
## act 3 -704.26 0.00 1 1 355.14
## cost 3 -637.70 66.56 0 1 321.86