# Understanding Regression Parameters

This section of notes aims to dig a bit more into what the simple linear regression (i.e., regression with a single continuous covariate/attribute) parameter estimates mean. We will consider the estimation formulas in part of this to gain a sense of how these can be computed.

## New Example Data

The new data for this section of notes will explore data from the Environmental Protection Agency on Air Quality collected for the state of Iowa in 2021. The data are daily values for PM 2.5 particulates. The attributes included in the data are shown below with a short description.

Variable Description
date Date of observation
id Site ID
poc Parameter Occurrence Code (POC)
pm2.5 Average daily pm 2.5 particulate value, in (ug/m3; micrograms per meter cubed)
daily_aqi Average air quality index
site_name Site Name
aqs_parameter_desc Text Description of Observation
cbsa_code Core Based Statistical Area (CBSA) ID
cbsa_name CBSA Name
county County in Iowa
avg_wind Average daily wind speed (in knots)
max_wind Maximum daily wind speed (in knots)
max_wind_hours Time of maximum daily wind speed

### Guiding Question

How is average daily wind speed related to the daily air quality index?

## Bivariate Figure

Note, below I do a bit of post-processing to combine data from different POC values within a single CBSA.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1
## ✔ readr   2.1.2      ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──

library(ggformula)

## Loading required package: ggstance
##
## Attaching package: 'ggstance'
##
## The following objects are masked from 'package:ggplot2':
##
##     geom_errorbarh, GeomErrorbarh
##
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
##
##
##     col_factor
##
##
## New to ggformula?  Try the tutorials:
## 	learnr::run_tutorial("introduction", package = "ggformula")
## 	learnr::run_tutorial("refining", package = "ggformula")

library(mosaic)

## Registered S3 method overwritten by 'mosaic':
##   method                           from
##   fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features.  The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
##
## The following object is masked from 'package:Matrix':
##
##     mean
##
## The following object is masked from 'package:scales':
##
##     rescale
##
## The following objects are masked from 'package:dplyr':
##
##     count, do, tally
##
## The following object is masked from 'package:purrr':
##
##     cross
##
## The following object is masked from 'package:ggplot2':
##
##     stat
##
## The following objects are masked from 'package:stats':
##
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
##
## The following objects are masked from 'package:base':
##
##     max, mean, min, prod, range, sample, sum

theme_set(theme_bw(base_size = 18))


## Rows: 6917 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): date, site_name, aqs_parameter_desc, cbsa_name, county
## dbl (5): id, poc, pm2.5, daily_aqi, cbsa_code
##
## ℹ Use spec() to retrieve the full column specification for this data.
## ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

wind <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/daily_WIND_2021-iowa.csv")

## Rows: 1537 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): date, cbsa_name
## dbl (3): avg_wind, max_wind, max_wind_hours
##
## ℹ Use spec() to retrieve the full column specification for this data.
## ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

airquality <- airquality %>%
left_join(wind, by = c('cbsa_name', 'date')) %>%
drop_na()

head(airquality)

## # A tibble: 6 × 13
##   date           id   poc pm2.5 daily_aqi site_…¹ aqs_p…² cbsa_…³ cbsa_…⁴ county
##   <chr>       <dbl> <dbl> <dbl>     <dbl> <chr>   <chr>     <dbl> <chr>   <chr>
## 1 1/1/21  190130009     1  15.1        57 Water … PM2.5 …   47940 Waterl… Black…
## 2 1/4/21  190130009     1  13.3        54 Water … PM2.5 …   47940 Waterl… Black…
## 3 1/7/21  190130009     1  20.5        69 Water … PM2.5 …   47940 Waterl… Black…
## 4 1/10/21 190130009     1  14.3        56 Water … PM2.5 …   47940 Waterl… Black…
## 5 1/13/21 190130009     1  13.7        54 Water … PM2.5 …   47940 Waterl… Black…
## 6 1/16/21 190130009     1   5.3        22 Water … PM2.5 …   47940 Waterl… Black…
## # … with 3 more variables: avg_wind <dbl>, max_wind <dbl>,
## #   max_wind_hours <dbl>, and abbreviated variable names ¹​site_name,
## #   ²​aqs_parameter_desc, ³​cbsa_code, ⁴​cbsa_name

dim(airquality)

## [1] 4821   13

gf_point(daily_aqi ~ avg_wind, data = airquality, size = 4, alpha = .15) %>%
gf_labs(x = "Average daily wind speed (in knots)",
y = "Daily Air Quality") %>%
gf_smooth() %>%
gf_smooth(method = 'lm', color = 'lightblue', linetype = 2)

## geom_smooth() using method = 'gam'


cor(daily_aqi ~ avg_wind, data = airquality) |> round(3)

## [1] -0.292

air_lm <- lm(daily_aqi ~ avg_wind, data = airquality)
coef(air_lm) |> round(3)

## (Intercept)    avg_wind
##      48.223      -2.212

summary(air_lm)$r.square |> round(3)  ## [1] 0.085  summary(air_lm)$sigma |> round(3)

## [1] 18.055


### Interpreting these estimates

What do these parameter estimates mean in this context?

Intercept: This is the model implied ______ when the ________ equals 0.
Slope: For each 1 unit change in ____ there is a -2.2 unit decrease in ___.
R-Square: The _____ in ____ is explained by ____.
Sigma: The _____ distance each point is from ____.

## Centering predictors

There are times when centering of predictors can be helpful for interpretation of the model parameters. This can be helpful when 0 is not a practically useful characteristic of the attribute or for more specific tests of certain elements of the X attribute.

### Mean Centering

Mean centering is where the mean of the attribute is subtracted from each value. This is a linear transformation where each data point is subtracted by a constant, the mean. This means that the distance between points do not change.

airquality <- airquality %>%
mutate(avg_wind_mc = avg_wind - mean(avg_wind),
avg_wind_maxc = avg_wind - max(avg_wind),
avg_wind_10 = avg_wind - 10)

gf_point(daily_aqi ~ avg_wind_mc, data = airquality, size = 4, alpha = .15) %>%
gf_labs(x = "Average daily wind speed (in knots)",
y = "Daily Air Quality") %>%
gf_smooth() %>%
gf_smooth(method = 'lm', color = 'lightblue', linetype = 2)

## geom_smooth() using method = 'gam'


air_lm_mc <- lm(daily_aqi ~ avg_wind_mc, data = airquality)
coef(air_lm_mc)

## (Intercept) avg_wind_mc
##   38.788011   -2.211798

summary(air_lm_mc)$r.square  ## [1] 0.08528019  summary(air_lm_mc)$sigma

## [1] 18.05479

air_lm_maxc <- lm(daily_aqi ~ avg_wind_maxc, data = airquality)
coef(air_lm_maxc)

##   (Intercept) avg_wind_maxc
##      5.968391     -2.211798

summary(air_lm_maxc)$r.square  ## [1] 0.08528019  summary(air_lm_maxc)$sigma

## [1] 18.05479

air_lm_10 <- lm(daily_aqi ~ avg_wind_10, data = airquality)
coef(air_lm_10)

## (Intercept) avg_wind_10
##   26.104968   -2.211798

summary(air_lm_10)$r.square  ## [1] 0.08528019  summary(air_lm_10)$sigma

## [1] 18.05479


## Standardized Regression

Another type of regression that can be done is one in which the attributes are standardized prior to estimating the linear regression. What is meant by standardizing? This is converting the attributes into z-scores:

$$Z_{api} = \frac{(aqi - \bar{aqi})}{s_{aqi}}$$

airquality <- airquality %>%
mutate(z_aqi = scale(daily_aqi),
z_aqi2 = (daily_aqi - mean(daily_aqi)) / sd(daily_aqi),
z_wind = scale(avg_wind))


## # A tibble: 6 × 19
##   date           id   poc pm2.5 daily_aqi site_…¹ aqs_p…² cbsa_…³ cbsa_…⁴ county
##   <chr>       <dbl> <dbl> <dbl>     <dbl> <chr>   <chr>     <dbl> <chr>   <chr>
## 1 1/1/21  190130009     1  15.1        57 Water … PM2.5 …   47940 Waterl… Black…
## 2 1/4/21  190130009     1  13.3        54 Water … PM2.5 …   47940 Waterl… Black…
## 3 1/7/21  190130009     1  20.5        69 Water … PM2.5 …   47940 Waterl… Black…
## 4 1/10/21 190130009     1  14.3        56 Water … PM2.5 …   47940 Waterl… Black…
## 5 1/13/21 190130009     1  13.7        54 Water … PM2.5 …   47940 Waterl… Black…
## 6 1/16/21 190130009     1   5.3        22 Water … PM2.5 …   47940 Waterl… Black…
## # … with 9 more variables: avg_wind <dbl>, max_wind <dbl>,
## #   max_wind_hours <dbl>, avg_wind_mc <dbl>, avg_wind_maxc <dbl>,
## #   avg_wind_10 <dbl>, z_aqi <dbl[,1]>, z_aqi2 <dbl>, z_wind <dbl[,1]>, and
## #   abbreviated variable names ¹​site_name, ²​aqs_parameter_desc, ³​cbsa_code,
## #   ⁴​cbsa_name

air_lm_s <- lm(z_aqi ~ z_wind, data = airquality)
coef(air_lm_s)

##   (Intercept)        z_wind
## -6.712023e-16 -2.920277e-01

summary(air_lm_s)$r.square  ## [1] 0.08528019  summary(air_lm_s)$sigma

## [1] 0.9565091


We can also use this formula to convert any unstandardized regression coefficients into a standardized metric.

$$b^{'}{k} = b{k} * \frac{s_{x_{k}}}{s_{y}}$$

-2.211 * sd(airquality$avg_wind) / sd(airquality$daily_aqi)

## [1] -0.2919224

cor(daily_aqi ~ avg_wind, data = airquality)

## [1] -0.2920277


## Parameter Estimation

Now that we looked how the parameters are impacted by some changes in the model specification, how are these parameters actually estimated? I will show two ways, one is general, the other is specific to this simple case with a single predictor/covariate attribute. In general, linear regression (or more generally the general linear model) uses least square estimation. This means that the the parameters in the model minimize the squared distance between the observed and predicted values. That is, least squares estimates minimize this criterion:

$$\sum (Y - \hat{Y})^2$$

### Specific example

Calculus can be used to show that these two equations can be solved simultanuously to get estimates for $$\beta_{0}$$ and $$\beta_{1}$$ that minimize the criterion above. These formulas are:

$$b_{1} = \frac{\sum(X - \bar{X})(Y - \bar{Y})}{\sum(X - \bar{X})^2}$$ $$b_{0} = \bar{Y} - b_{1}\bar{X}$$

Let’s use R to get these quantities.

b1 <- with(airquality,
sum((avg_wind - mean(avg_wind)) * (daily_aqi - mean(daily_aqi))) / sum((avg_wind - mean(avg_wind))^2)
)
b0 <- with(airquality,
mean(daily_aqi) - b1 * mean(avg_wind)
)
b0

## [1] 48.22295

b1

## [1] -2.211798

coef(air_lm)

## (Intercept)    avg_wind
##   48.222946   -2.211798


### General Approach

When there are more than one predictor, the number of equations gets a bit unyieldy, therefore, there is a general analytic approach that works for any set of predictor attributes. The general approach uses matrix algebra (anyone take linear algebra?), to achieve their estimates. This general form is:

$$\mathbf{b} = \left( \mathbf{X}^{}\mathbf{X} \right)^{-1} \left( \mathbf{X}^{} \mathbf{Y} \right).$$ Where $$\mathbf{b}$$ is a vector of estimated regression coefficients, $$\mathbf{X}$$ is a matrix of covariate/predictor attributes (called the design matrix), and $$\mathbf{Y}$$ is a vector of the outcome attribute.

Below, I show what these would look like for the air quality example that has been used and solve for the regression coefficients.

X <- model.matrix(air_lm)

##   (Intercept) avg_wind
## 1           1 2.941667
## 2           1 2.445833
## 3           1 1.995833
## 4           1 3.445833
## 5           1 1.116667
## 6           1 6.091667

Y <- as.matrix(airquality\$daily_aqi)

##      [,1]
## [1,]   57
## [2,]   54
## [3,]   69
## [4,]   56
## [5,]   54
## [6,]   22

X_X <- solve(t(X) %*% X)
X_X

##               (Intercept)      avg_wind
## (Intercept)  0.0008152474 -1.424894e-04
## avg_wind    -0.0001424894  3.340328e-05

X_Y <- t(X) %*% Y
X_Y

##               [,1]
## (Intercept) 186997
## avg_wind    731464

X_X %*% X_Y

##                  [,1]
## (Intercept) 48.222946
## avg_wind    -2.211798

coef(air_lm)

## (Intercept)    avg_wind
##   48.222946   -2.211798