Regression Estimates

Understanding Regression Parameters

This section of notes aims to dig a bit more into what the simple linear regression (ie., regression with a single continuous covariate/attribute) parameter estimates mean. We will consider the estimation formulas in part of this to gain a sense of how these can be computed.

New Example Data

The new data for this section of notes will explore data from the Environmental Protection Agency on Air Quality collected for the state of Iowa in 2021. The data are daily values for PM 2.5 particulates. The attributes included in the data are shown below with a short description.

Variable Description
date Date of observation
id Site ID
poc Parameter Occurrence Code (POC)
pm2.5 Average daily pm 2.5 particulate value, in (ug/m3; micrograms per meter cubed)
daily_aqi Average air quality index
site_name Site Name
aqs_parameter_desc Text Description of Observation
cbsa_code Core Based Statistical Area (CBSA) ID
cbsa_name CBSA Name
county County in Iowa
avg_wind Average daily wind speed (in knots)
max_wind Maximum daily wind speed (in knots)
max_wind_hours Time of maximum daily wind speed

Guiding Question

How is average daily wind speed related to the daily air quality index?

Bivariate Figure

Note, below I do a bit of post-processing to combine data from different POC values within a single CBSA.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggformula)
## Loading required package: scales
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## Loading required package: ggridges
## Warning: package 'ggridges' was built under R version 4.3.2
## 
## New to ggformula?  Try the tutorials: 
## 	learnr::run_tutorial("introduction", package = "ggformula")
## 	learnr::run_tutorial("refining", package = "ggformula")
library(mosaic)
## Warning: package 'mosaic' was built under R version 4.3.2
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## 
## The following object is masked from 'package:Matrix':
## 
##     mean
## 
## The following object is masked from 'package:scales':
## 
##     rescale
## 
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     stat
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
theme_set(theme_bw(base_size = 18))

airquality <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/iowa_air_quality_2021.csv")
## Rows: 6917 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): date, site_name, aqs_parameter_desc, cbsa_name, county
## dbl (5): id, poc, pm2.5, daily_aqi, cbsa_code
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
wind <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/daily_WIND_2021-iowa.csv")
## Rows: 1537 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): date, cbsa_name
## dbl (3): avg_wind, max_wind, max_wind_hours
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airquality <- airquality %>%
   left_join(wind, by = c('cbsa_name', 'date')) %>% 
   drop_na()
## Warning in left_join(., wind, by = c("cbsa_name", "date")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 21 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
head(airquality)
## # A tibble: 6 × 13
##   date           id   poc pm2.5 daily_aqi site_name aqs_parameter_desc cbsa_code
##   <chr>       <dbl> <dbl> <dbl>     <dbl> <chr>     <chr>                  <dbl>
## 1 1/1/21  190130009     1  15.1        57 Water To… PM2.5 - Local Con…     47940
## 2 1/4/21  190130009     1  13.3        54 Water To… PM2.5 - Local Con…     47940
## 3 1/7/21  190130009     1  20.5        69 Water To… PM2.5 - Local Con…     47940
## 4 1/10/21 190130009     1  14.3        56 Water To… PM2.5 - Local Con…     47940
## 5 1/13/21 190130009     1  13.7        54 Water To… PM2.5 - Local Con…     47940
## 6 1/16/21 190130009     1   5.3        22 Water To… PM2.5 - Local Con…     47940
## # ℹ 5 more variables: cbsa_name <chr>, county <chr>, avg_wind <dbl>,
## #   max_wind <dbl>, max_wind_hours <dbl>
dim(airquality)
## [1] 4821   13
gf_point(daily_aqi ~ avg_wind, data = airquality, size = 4, alpha = .15) %>%
  gf_labs(x = "Average daily wind speed (in knots)",
          y = "Daily Air Quality") %>%
  gf_smooth() %>%
  gf_smooth(method = 'lm', color = 'lightblue', linetype = 2)
## `geom_smooth()` using method = 'gam'

cor(daily_aqi ~ avg_wind, data = airquality) |> round(3)
## [1] -0.292
air_lm <- lm(daily_aqi ~ avg_wind, data = airquality)
coef(air_lm) |> round(3)
## (Intercept)    avg_wind 
##      48.223      -2.212
summary(air_lm)$r.square |> round(3)
## [1] 0.085
summary(air_lm)$sigma |> round(3)
## [1] 18.055

Interpreting these estimates

What do these parameter estimates mean in this context?

Intercept: This is the model implied ______ when the ________ equals 0.
Slope: For each 1 unit change in ____ there is a -2.2 unit decrease in .
R-Square: The
__ in ____ is explained by ____.
Sigma: The _____ distance each point is from ____.

Centering predictors

There are times when centering of predictors can be helpful for interpretation of the model parameters. This can be helpful when 0 is not a practically useful characteristic of the attribute or for more specific tests of certain elements of the X attribute.

Mean Centering

Mean centering is where the mean of the attribute is subtracted from each value. This is a linear transformation where each data point is subtracted by a constant, the mean. This means that the distance between points do not change.

airquality <- airquality %>%
  mutate(avg_wind_mc = avg_wind - mean(avg_wind),
         avg_wind_maxc = avg_wind - max(avg_wind),
         avg_wind_10 = avg_wind - 10)

gf_point(daily_aqi ~ avg_wind_mc, data = airquality, size = 4, alpha = .15) %>%
  gf_labs(x = "Average daily wind speed (in knots)",
          y = "Daily Air Quality") %>%
  gf_smooth() %>%
  gf_smooth(method = 'lm', color = 'lightblue', linetype = 2)
## `geom_smooth()` using method = 'gam'

air_lm_mc <- lm(daily_aqi ~ avg_wind_mc, data = airquality)
coef(air_lm_mc)
## (Intercept) avg_wind_mc 
##   38.788011   -2.211798
summary(air_lm_mc)$r.square
## [1] 0.08528019
summary(air_lm_mc)$sigma
## [1] 18.05479
air_lm_maxc <- lm(daily_aqi ~ avg_wind_maxc, data = airquality)
coef(air_lm_maxc)
##   (Intercept) avg_wind_maxc 
##      5.968391     -2.211798
summary(air_lm_maxc)$r.square
## [1] 0.08528019
summary(air_lm_maxc)$sigma
## [1] 18.05479
air_lm_10 <- lm(daily_aqi ~ avg_wind_10, data = airquality)
coef(air_lm_10)
## (Intercept) avg_wind_10 
##   26.104968   -2.211798
summary(air_lm_10)$r.square
## [1] 0.08528019
summary(air_lm_10)$sigma
## [1] 18.05479

Standardized Regression

Another type of regression that can be done is one in which the attributes are standardized prior to estimating the linear regression. What is meant by standardizing? This is converting the attributes into z-scores:

\[ Z_{api} = \frac{(aqi - \bar{aqi})}{s_{aqi}} \]

airquality <- airquality %>%
  mutate(z_aqi = scale(daily_aqi),
         z_aqi2 = (daily_aqi - mean(daily_aqi)) / sd(daily_aqi),
         z_wind = scale(avg_wind))

head(airquality)
## # A tibble: 6 × 19
##   date           id   poc pm2.5 daily_aqi site_name aqs_parameter_desc cbsa_code
##   <chr>       <dbl> <dbl> <dbl>     <dbl> <chr>     <chr>                  <dbl>
## 1 1/1/21  190130009     1  15.1        57 Water To… PM2.5 - Local Con…     47940
## 2 1/4/21  190130009     1  13.3        54 Water To… PM2.5 - Local Con…     47940
## 3 1/7/21  190130009     1  20.5        69 Water To… PM2.5 - Local Con…     47940
## 4 1/10/21 190130009     1  14.3        56 Water To… PM2.5 - Local Con…     47940
## 5 1/13/21 190130009     1  13.7        54 Water To… PM2.5 - Local Con…     47940
## 6 1/16/21 190130009     1   5.3        22 Water To… PM2.5 - Local Con…     47940
## # ℹ 11 more variables: cbsa_name <chr>, county <chr>, avg_wind <dbl>,
## #   max_wind <dbl>, max_wind_hours <dbl>, avg_wind_mc <dbl>,
## #   avg_wind_maxc <dbl>, avg_wind_10 <dbl>, z_aqi <dbl[,1]>, z_aqi2 <dbl>,
## #   z_wind <dbl[,1]>
air_lm_s <- lm(z_aqi ~ z_wind, data = airquality)
coef(air_lm_s)
##   (Intercept)        z_wind 
## -6.712023e-16 -2.920277e-01
summary(air_lm_s)$r.square
## [1] 0.08528019
summary(air_lm_s)$sigma
## [1] 0.9565091

We can also use this formula to convert any unstandardized regression coefficients into a standardized metric.

\[ b^{'}_{k} = b_{k} * \frac{s_{x_{k}}}{s_{y}} \]

-2.211 * sd(airquality$avg_wind) / sd(airquality$daily_aqi)
## [1] -0.2919224
cor(daily_aqi ~ avg_wind, data = airquality)
## [1] -0.2920277

Parameter Estimation

Now that we looked how the parameters are impacted by some changes in the model specification, how are these parameters actually estimated? I will show two ways, one is general, the other is specific to this simple case with a single predictor/covariate attribute. In general, linear regression (or more generally the general linear model) uses least square estimation. This means that the the parameters in the model minimize the squared distance between the observed and predicted values. That is, least squares estimates minimize this criterion:

\[ \sum (Y - \hat{Y})^2 \]

Specific example

Calculus can be used to show that these two equations can be solved simultanuously to get estimates for \(\beta_{0}\) and \(\beta_{1}\) that minimize the criterion above. These formulas are:

\[ b_{1} = \frac{\sum(X - \bar{X})(Y - \bar{Y})}{\sum(X - \bar{X})^2} \] \[ b_{0} = \bar{Y} - b_{1}\bar{X} \]

Let’s use R to get these quantities.

b1 <- with(airquality, 
      sum((avg_wind - mean(avg_wind)) * (daily_aqi - mean(daily_aqi))) / sum((avg_wind - mean(avg_wind))^2)
)
b0 <- with(airquality, 
      mean(daily_aqi) - b1 * mean(avg_wind)
)
b0
## [1] 48.22295
b1
## [1] -2.211798
coef(air_lm)
## (Intercept)    avg_wind 
##   48.222946   -2.211798

General Approach

When there are more than one predictor, the number of equations gets a bit unyieldy, therefore, there is a general analytic approach that works for any set of predictor attributes. The general approach uses matrix algebra (anyone take linear algebra?), to achieve their estimates. This general form is:

\[ \mathbf{b} = \left( \mathbf{X}^{`}\mathbf{X} \right)^{-1} \left( \mathbf{X}^{`} \mathbf{Y} \right). \] Where \(\mathbf{b}\) is a vector of estimated regression coefficients, \(\mathbf{X}\) is a matrix of covariate/predictor attributes (called the design matrix), and \(\mathbf{Y}\) is a vector of the outcome attribute.

Below, I show what these would look like for the air quality example that has been used and solve for the regression coefficients.

X <- model.matrix(air_lm)
head(X)
##   (Intercept) avg_wind
## 1           1 2.941667
## 2           1 2.445833
## 3           1 1.995833
## 4           1 3.445833
## 5           1 1.116667
## 6           1 6.091667
Y <- as.matrix(airquality$daily_aqi)
head(Y)
##      [,1]
## [1,]   57
## [2,]   54
## [3,]   69
## [4,]   56
## [5,]   54
## [6,]   22
X_X <- solve(t(X) %*% X)
X_X
##               (Intercept)      avg_wind
## (Intercept)  0.0008152474 -1.424894e-04
## avg_wind    -0.0001424894  3.340328e-05
X_Y <- t(X) %*% Y
X_Y
##               [,1]
## (Intercept) 186997
## avg_wind    731464
X_X %*% X_Y
##                  [,1]
## (Intercept) 48.222946
## avg_wind    -2.211798
coef(air_lm)
## (Intercept)    avg_wind 
##   48.222946   -2.211798
Previous
Next