Modeling Non-linearity
Modeling non-linearity
Although linear regression has the word, linear, in the name, this is really is misnomer. Regression is able to be more flexible, but the idea of linear regression is that the coeffecients, the \(\beta\) terms, are additive. For example,
\[ Y = \beta_{0} + \beta_{1} X + \beta_{2} X^2 + \beta_{3} X^3 + \epsilon \]
and
\[ Y = \beta_{0} + \beta_{1} log(X) + \epsilon \]
are both additive in the coefficients, therefore are special cases of regression for terms that would result in predictive curves that would be non-linear.
Non-linear in the parameters
There are a series of models that can be non-linear in the parameters. An example is the logistic population-growth model (Shryock, Siegel, et al., 1973):
\[ Y = \frac{\beta_{1}}{1 + exp(\beta_{2} + \beta_{3}X)} + \epsilon \]
We are going to focus on an example for the first example, modeling non-linear associations while keeping the regression model additive.
Non-linear example
Let’s explore some weather data from Australia.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(mosaic)
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
##
## The following object is masked from 'package:Matrix':
##
## mean
##
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## stat
##
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
##
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
library(broom)
library(ggformula)
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
theme_set(theme_bw(base_size = 16))
aus_weather <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/statthink/main/data-raw/weatherAUS.csv") |>
filter(Location %in% c('Adelaide', 'Canberra')) |>
mutate(day_of_year = yday(Date))
## Rows: 142193 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Location, WindGustDir, WindDir9am, WindDir3pm, RainToday, RainTom...
## dbl (17): MinTemp, MaxTemp, Rainfall, Evaporation, Sunshine, WindGustSpeed,...
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
count(aus_weather, Location)
## # A tibble: 2 × 2
## Location n
## <chr> <int>
## 1 Adelaide 3090
## 2 Canberra 3418
head(aus_weather)
## # A tibble: 6 × 25
## Date Location MinTemp MaxTemp Rainfall Evapora…¹ Sunsh…² WindG…³ WindG…⁴
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 2007-11-01 Canberra 8 24.3 0 3.4 6.3 NW 30
## 2 2007-11-02 Canberra 14 26.9 3.6 4.4 9.7 ENE 39
## 3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3 NW 85
## 4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1 NW 54
## 5 2007-11-05 Canberra 7.6 16.1 2.8 5.6 10.6 SSE 50
## 6 2007-11-06 Canberra 6.2 16.9 0 5.8 8.2 SE 44
## # … with 16 more variables: WindDir9am <chr>, WindDir3pm <chr>,
## # WindSpeed9am <dbl>, WindSpeed3pm <dbl>, Humidity9am <dbl>,
## # Humidity3pm <dbl>, Pressure9am <dbl>, Pressure3pm <dbl>, Cloud9am <dbl>,
## # Cloud3pm <dbl>, Temp9am <dbl>, Temp3pm <dbl>, RainToday <chr>,
## # RISK_MM <dbl>, RainTomorrow <chr>, day_of_year <dbl>, and abbreviated
## # variable names ¹Evaporation, ²Sunshine, ³WindGustDir, ⁴WindGustSpeed
gf_point(Temp3pm ~ day_of_year, data = aus_weather) |>
gf_labs(x = "Day of Year",
y = "Temperature (Celsius)")
## Warning: Removed 11 rows containing missing values (`geom_point()`).
gf_point(Temp3pm ~ day_of_year, data = aus_weather, color = ~ Location) |>
gf_labs(x = "Day of Year",
y = "Temperature (Celsius)")
## Warning: Removed 11 rows containing missing values (`geom_point()`).
gf_point(Temp3pm ~ day_of_year, data = aus_weather, color = ~ Location, alpha = .3) |>
gf_smooth(method = 'lm', size = 4) |>
gf_labs(x = "Day of Year",
y = "Temperature (Celsius)")
## Warning: Removed 11 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values (`geom_point()`).
gf_point(Temp3pm ~ day_of_year, data = aus_weather, color = ~ Location, alpha = .3) |>
gf_smooth(method = '', size = 4) |>
gf_labs(x = "Day of Year",
y = "Temperature (Celsius)")
## Warning: Removed 11 rows containing non-finite values (`stat_smooth()`).
## Warning: Computation failed in `stat_smooth()`
## Caused by error in `get()`:
## ! invalid first argument
## Warning: Removed 11 rows containing missing values (`geom_point()`).
linear_mod <- lm(Temp3pm ~ day_of_year + Location, data = aus_weather)
summary(linear_mod)
##
## Call:
## lm(formula = Temp3pm ~ day_of_year + Location, data = aus_weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.012 -5.489 -0.940 4.646 22.684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.9680663 0.1900207 126.13 <2e-16 ***
## day_of_year -0.0128942 0.0007959 -16.20 <2e-16 ***
## LocationCanberra -2.1539876 0.1679972 -12.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.762 on 6494 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.06113, Adjusted R-squared: 0.06084
## F-statistic: 211.4 on 2 and 6494 DF, p-value: < 2.2e-16
plot(linear_mod, which = 1:5)
resid_linear <- broom::augment(linear_mod)
gf_point(.resid ~ .fitted, data = resid_linear, color = ~ Location) |>
gf_hline(yintercept = 0, size = 1.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
quad_model <- lm(Temp3pm ~ poly(day_of_year, 2) + Location,
data = aus_weather)
summary(quad_model)
##
## Call:
## lm(formula = Temp3pm ~ poly(day_of_year, 2) + Location, data = aus_weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8321 -2.9490 -0.5725 2.6229 17.4332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.66025 0.08304 260.84 <2e-16 ***
## poly(day_of_year, 2)1 -109.89896 4.61906 -23.79 <2e-16 ***
## poly(day_of_year, 2)2 399.00995 4.61940 86.38 <2e-16 ***
## LocationCanberra -2.22383 0.11461 -19.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.613 on 6493 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.5631, Adjusted R-squared: 0.5629
## F-statistic: 2790 on 3 and 6493 DF, p-value: < 2.2e-16
resid_quad <- broom::augment(quad_model)
head(resid_quad)
## # A tibble: 6 × 9
## .rownames Temp3pm poly(day_of…¹ Locat…² .fitted .hat .sigma .cooksd .std.…³
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 23.6 0.0144 Canber… 19.8 5.29e-4 4.61 8.92e-5 0.821
## 2 2 25.7 0.0146 Canber… 19.9 5.35e-4 4.61 2.10e-4 1.25
## 3 3 20.2 0.0147 Canber… 20.0 5.42e-4 4.61 1.74e-7 0.0359
## 4 4 14.1 0.0148 Canber… 20.1 5.49e-4 4.61 2.36e-4 -1.31
## 5 5 15.4 0.0149 Canber… 20.3 5.56e-4 4.61 1.55e-4 -1.05
## 6 6 14.8 0.0150 Canber… 20.4 5.64e-4 4.61 2.06e-4 -1.21
## # … with 1 more variable: `poly(day_of_year, 2)`[2] <dbl>, and abbreviated
## # variable names ¹`poly(day_of_year, 2)`[,"1"], ²Location, ³.std.resid
gf_point(.std.resid ~ .fitted, data = resid_quad, color = ~ Location) |>
gf_hline(yintercept = 0, size = 1.5)
quad_model_2 <- lm(Temp3pm ~ day_of_year + I(day_of_year^2) + Location,
data = aus_weather)
summary(quad_model_2)
##
## Call:
## lm(formula = Temp3pm ~ day_of_year + I(day_of_year^2) + Location,
## data = aus_weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8321 -2.9490 -0.5725 2.6229 17.4332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.502e+01 1.821e-01 192.28 <2e-16 ***
## day_of_year -1.944e-01 2.170e-03 -89.57 <2e-16 ***
## I(day_of_year^2) 4.979e-04 5.764e-06 86.38 <2e-16 ***
## LocationCanberra -2.224e+00 1.146e-01 -19.40 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.613 on 6493 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.5631, Adjusted R-squared: 0.5629
## F-statistic: 2790 on 3 and 6493 DF, p-value: < 2.2e-16
resid_quad_2 <- broom::augment(quad_model_2)
head(resid_quad_2)
## # A tibble: 6 × 11
## .rowna…¹ Temp3pm day_o…² I(day…³ Locat…⁴ .fitted .resid .hat .sigma .cooksd
## <chr> <dbl> <dbl> <I<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 23.6 305 93025 Canber… 19.8 3.79 5.29e-4 4.61 8.92e-5
## 2 2 25.7 306 93636 Canber… 19.9 5.78 5.35e-4 4.61 2.10e-4
## 3 3 20.2 307 94249 Canber… 20.0 0.165 5.42e-4 4.61 1.74e-7
## 4 4 14.1 308 94864 Canber… 20.1 -6.05 5.49e-4 4.61 2.36e-4
## 5 5 15.4 309 95481 Canber… 20.3 -4.86 5.56e-4 4.61 1.55e-4
## 6 6 14.8 310 96100 Canber… 20.4 -5.57 5.64e-4 4.61 2.06e-4
## # … with 1 more variable: .std.resid <dbl>, and abbreviated variable names
## # ¹.rownames, ²day_of_year, ³`I(day_of_year^2)`, ⁴Location
gf_point(Temp3pm ~ day_of_year, data = resid_quad_2, color = ~ Location, alpha = .3) |>
gf_line(.fitted ~ day_of_year, color = ~ Location, data = resid_quad_2, size = 4)
quad_model_3 <- lm(Temp3pm ~ poly(day_of_year, 2) * Location,
data = aus_weather)
summary(quad_model_3)
##
## Call:
## lm(formula = Temp3pm ~ poly(day_of_year, 2) * Location, data = aus_weather)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.2571 -2.9927 -0.5634 2.5809 17.7627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.65852 0.08291 261.220 < 2e-16
## poly(day_of_year, 2)1 -114.77043 6.71386 -17.095 < 2e-16
## poly(day_of_year, 2)2 376.21298 6.72202 55.967 < 2e-16
## LocationCanberra -2.22306 0.11443 -19.428 < 2e-16
## poly(day_of_year, 2)1:LocationCanberra 9.03072 9.23824 0.978 0.328
## poly(day_of_year, 2)2:LocationCanberra 43.03473 9.24005 4.657 3.27e-06
##
## (Intercept) ***
## poly(day_of_year, 2)1 ***
## poly(day_of_year, 2)2 ***
## LocationCanberra ***
## poly(day_of_year, 2)1:LocationCanberra
## poly(day_of_year, 2)2:LocationCanberra ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.606 on 6491 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.5646, Adjusted R-squared: 0.5643
## F-statistic: 1684 on 5 and 6491 DF, p-value: < 2.2e-16
quad_model_3 <- lm(Temp3pm ~ day_of_year * Location + I(day_of_year^2) * Location,
data = aus_weather)
resid_quad_3 <- broom::augment(quad_model_3)
gf_point(Temp3pm ~ day_of_year, data = resid_quad_3, color = ~ Location, alpha = .3) |>
gf_line(.fitted ~ day_of_year, color = ~ Location, data = resid_quad_3, size = 4)