# Data Conditions for Valid Inference

# Conditions for Linear Regression

The conditions surrounding linear regression typically surround the residuals. Residuals are defined as:

$$ Y - \hat{Y} $$

These are the deviations in the observed scores from the predicted scores from the linear regression. Recall, through least square estimation that these residauls will sum to 0, therefore, their mean would also be equal to 0. However, there are certain conditions about these residuals that are made for the linear regression model to have the inferences be appropriate. We’ll talk more about what the implications for violating these conditions will have on the linear regression model, but first, the conditions.

- Approximately Normally distributed residuals
- Homogeneity of variance
- Uncorrelated residuals
- Error term is uncorrelated with the predictor attribute
- Linearity and additivity of the model.

Each of these will be discussed in turn.

Data conditions that are not directly testable:

- Validity
- Representativeness

## Approximately Normally distributed residuals

The first assumption is that the residuals are at least approximately Normally distributed. This assumption is really only much of a concern when the sample size is small. If the sample size is larger, the Central Limit Thereom (CLT) states that the distribution of the statstics will be approximately normally distributed. The threshold for the CLT to be properly invoked is about 30. Larger then this, the residuals do not need to be approximately normally distributed. Even still, exploring the distribution of the residuals can still be helpful and can also be helpful to identify potential extreme values.

This example will make use of the air quality data one more time.

### Data

The data for this section of notes will explore data from the Environmental Protection Agency on Air Quality collected for the state of Iowa in 2021. The data are daily values for PM 2.5 particulates. The attributes included in the data are shown below with a short description.

Variable | Description |
---|---|

date | Date of observation |

id | Site ID |

poc | Parameter Occurrence Code (POC) |

pm2.5 | Average daily pm 2.5 particulate value, in (ug/m3; micrograms per meter cubed) |

daily_aqi | Average air quality index |

site_name | Site Name |

aqs_parameter_desc | Text Description of Observation |

cbsa_code | Core Based Statistical Area (CBSA) ID |

cbsa_name | CBSA Name |

county | County in Iowa |

avg_wind | Average daily wind speed (in knots) |

max_wind | Maximum daily wind speed (in knots) |

max_wind_hours | Time of maximum daily wind speed |

#### Guiding Question

How is average daily wind speed related to the daily air quality index?

```
library(tidyverse)
```

```
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
```

```
library(ggformula)
```

```
## Loading required package: ggstance
##
## Attaching package: 'ggstance'
##
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
##
## Loading required package: scales
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
##
## Loading required package: ggridges
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
```

```
library(mosaic)
```

```
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
##
## The following object is masked from 'package:Matrix':
##
## mean
##
## The following object is masked from 'package:scales':
##
## rescale
##
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## stat
##
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
##
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
```

```
theme_set(theme_bw(base_size = 18))
airquality <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/iowa_air_quality_2021.csv")
```

```
## Rows: 6917 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): date, site_name, aqs_parameter_desc, cbsa_name, county
## dbl (5): id, poc, pm2.5, daily_aqi, cbsa_code
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
wind <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/daily_WIND_2021-iowa.csv")
```

```
## Rows: 1537 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): date, cbsa_name
## dbl (3): avg_wind, max_wind, max_wind_hours
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```
airquality <- airquality %>%
left_join(wind, by = c('cbsa_name', 'date')) %>%
drop_na()
air_lm <- lm(daily_aqi ~ avg_wind, data = airquality)
coef(air_lm)
```

```
## (Intercept) avg_wind
## 48.222946 -2.211798
```

The residuals can be saved with the `resid()`

function. These can also be added to the original data, which are particularly helpful.

```
head(resid(air_lm))
```

```
## 1 2 3 4 5 6
## 15.283427 11.186742 25.191433 15.398540 8.246896 -12.749410
```

```
airquality$residuals <- resid(air_lm)
head(airquality)
```

```
## # A tibble: 6 × 14
## date id poc pm2.5 daily_aqi site_…¹ aqs_p…² cbsa_…³ cbsa_…⁴ county
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 1/1/21 190130009 1 15.1 57 Water … PM2.5 … 47940 Waterl… Black…
## 2 1/4/21 190130009 1 13.3 54 Water … PM2.5 … 47940 Waterl… Black…
## 3 1/7/21 190130009 1 20.5 69 Water … PM2.5 … 47940 Waterl… Black…
## 4 1/10/21 190130009 1 14.3 56 Water … PM2.5 … 47940 Waterl… Black…
## 5 1/13/21 190130009 1 13.7 54 Water … PM2.5 … 47940 Waterl… Black…
## 6 1/16/21 190130009 1 5.3 22 Water … PM2.5 … 47940 Waterl… Black…
## # … with 4 more variables: avg_wind <dbl>, max_wind <dbl>,
## # max_wind_hours <dbl>, residuals <dbl>, and abbreviated variable names
## # ¹site_name, ²aqs_parameter_desc, ³cbsa_code, ⁴cbsa_name
```

```
gf_density(~ residuals, data = airquality) %>%
gf_labs(x = "Residuals")
```

```
ggplot(airquality, aes(sample = residuals)) +
stat_qq(size = 5) +
stat_qq_line(size = 2)
```

### Standardized Residuals

Standardized residuals can be another way to explore the residuals. These will now be standardized to have a variance of 1, similar to that of a Z-score. These can be computed as:

$$ standardized\ residuals = \frac{\epsilon_{i}}{SD_{\epsilon}} $$

Within R, these can be computed using the function `rstandard()`

. Furthermore, these can be computed from another package called broom with the `augment()`

function.

```
head(rstandard(air_lm))
```

```
## 1 2 3 4 5 6
## 0.8466153 0.6196983 1.3955421 0.8529766 0.4568937 -0.7062638
```

```
library(broom)
resid_diagnostics <- augment(air_lm)
head(resid_diagnostics)
```

```
## # A tibble: 6 × 8
## daily_aqi avg_wind .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 57 2.94 41.7 15.3 0.000266 18.1 0.0000953 0.847
## 2 54 2.45 42.8 11.2 0.000318 18.1 0.0000611 0.620
## 3 69 2.00 43.8 25.2 0.000380 18.1 0.000370 1.40
## 4 56 3.45 40.6 15.4 0.000230 18.1 0.0000836 0.853
## 5 54 1.12 45.8 8.25 0.000539 18.1 0.0000563 0.457
## 6 22 6.09 34.7 -12.7 0.000319 18.1 0.0000795 -0.706
```

```
gf_density(~ .std.resid, data = resid_diagnostics) %>%
gf_labs(x = "Standardized Residuals")
```

```
ggplot(resid_diagnostics, aes(sample = .std.resid)) +
stat_qq(size = 5) +
stat_qq_line(size = 2)
```

## Homogeneity of variance

Homoegeneity of variance is an assumption that is of larger concern compared to normality of the residuals. Homoogeneity of variance is an assumption that states that the variance of the residuals are similar across the predicted or fitted values from the regression line. This assumption can be explored by looking at the residuals (standardized or raw residuals), by the fitted or predicted values. Within this plot, the range of residuals should be similar across the range of fitted or predicted values.

```
gf_point(.resid ~ .fitted, data = resid_diagnostics, size = 5, alpha = .15) %>%
gf_smooth(method = 'loess', size = 2) %>%
gf_labs(x = 'Fitted Values',
y = 'Residuals')
```

```
gf_point(.std.resid ~ .fitted, data = resid_diagnostics, size = 5, alpha = .15) %>%
gf_smooth(method = 'loess', size = 2) %>%
gf_labs(x = 'Fitted Values',
y = 'Standardized Residuals')
```

Another figure that can also be helpful for the homogeneity of variance assumption is one that rescales the residuals on the y-axis. The rescaling makes all the standardized residuals positive (takes the absolute value) and then takes the square root of this.

```
resid_diagnostics %>%
mutate(sqrt_abs_sresid = sqrt(abs(.std.resid))) %>%
gf_point(sqrt_abs_sresid ~ .fitted, size = 5, alpha = .15) %>%
gf_smooth(method = 'loess', size = 2) %>%
gf_labs(x = 'Fitted Values',
y = 'Sqrt Abs Standardized Residuals')
```

## Data with high leverage

Data with high leverage are extreme values that may significantly impact the regression estimates. These statistics include extreme values for the outcome or predictor attributes. Cook’s distance is one statistic that can help to identify points with high impact/leverage for the regression estimates. Cook’s distance is a statistic that represents how much change there would be in the fitted values if the point was removed when estimating the regression coefficients. There is some disagreement between what type of thresholds to use for Cook’s distance, but one rule of thumb is Cook’s distance greater than 1. There has also been some research showing that Cook’s distance follows an F-distribution, so a more specific value could be computed. The rule of thumb for greater than 1 comes from the F distribution for large samples.

```
resid_diagnostics %>%
mutate(obs_num = 1:n()) %>%
gf_col(.cooksd ~ obs_num, fill = 'black', color = 'black') %>%
gf_labs(x = "Observation Number",
y = "Cook's Distance")
```

Another leave one out statistic is the studentized deleted residuals. These are computed by removing a data point, refitting the regression model, then generate a predicted value for the X value for the data point removed. Then the residual is computed the same as before and is standardized like the standardized residuals above. The function in R to compute these is `rstudent()`

.

```
head(rstudent(air_lm))
```

```
## 1 2 3 4 5 6
## 0.8465904 0.6196587 1.3956793 0.8529524 0.4568561 -0.7062271
```

```
airquality$student_residuals <- rstudent(air_lm)
head(airquality)
```

```
## # A tibble: 6 × 15
## date id poc pm2.5 daily_aqi site_…¹ aqs_p…² cbsa_…³ cbsa_…⁴ county
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 1/1/21 190130009 1 15.1 57 Water … PM2.5 … 47940 Waterl… Black…
## 2 1/4/21 190130009 1 13.3 54 Water … PM2.5 … 47940 Waterl… Black…
## 3 1/7/21 190130009 1 20.5 69 Water … PM2.5 … 47940 Waterl… Black…
## 4 1/10/21 190130009 1 14.3 56 Water … PM2.5 … 47940 Waterl… Black…
## 5 1/13/21 190130009 1 13.7 54 Water … PM2.5 … 47940 Waterl… Black…
## 6 1/16/21 190130009 1 5.3 22 Water … PM2.5 … 47940 Waterl… Black…
## # … with 5 more variables: avg_wind <dbl>, max_wind <dbl>,
## # max_wind_hours <dbl>, residuals <dbl>, student_residuals <dbl>, and
## # abbreviated variable names ¹site_name, ²aqs_parameter_desc, ³cbsa_code,
## # ⁴cbsa_name
```

```
airquality %>%
mutate(obs_num = 1:n()) %>%
gf_point(student_residuals ~ obs_num, size = 5, alpha = .15) %>%
gf_hline(yintercept = ~ 3, color = 'blue', size = 2) %>%
gf_labs(x = "Observation Number",
y = "Studentized Residuals")
```

Leverage can be another measure to help detect outliers in X values. The hat values that were computed from the `augment()`

function above and can be interpreted as the distance the X scores are from the center of all X predictors. In the case of a single predictor, the hat values are the distance the X score is from the mean of X. The hat values will also sum up to the number of predictors and will always range between 0 and 1.

```
resid_diagnostics %>%
mutate(obs_num = 1:n()) %>%
gf_col(.hat ~ obs_num, fill = 'black', color = 'black') %>%
gf_labs(x = "Observation Number",
y = "Hat Values (leverage)")
```

```
plot(air_lm, which = 1:5)
```

## Solutions to overcome data conditions not being met

There are some strategies that can be done when a variety of data conditions for linear regression have not been met. In general, the strategies stem around generalizing the model and adding some complexity.

The following are options that would be possible:

- Add interactions or other non-linear terms
- Use more complicated model
- weighted least squares for homogeneity of variance concerns
- models that include measurement error
- mixed models for correlated data

- Transform the data.