Modeling Non-linearity

Modeling non-linearity

Although linear regression has the word, linear, in the name, this is really is misnomer. Regression is able to be more flexible, but the idea of linear regression is that the coeffecients, the \(\beta\) terms, are additive. For example,

\[ Y = \beta_{0} + \beta_{1} X + \beta_{2} X^2 + \beta_{3} X^3 + \epsilon \]

and

\[ Y = \beta_{0} + \beta_{1} log(X) + \epsilon \]

are both additive in the coefficients, therefore are special cases of regression for terms that would result in predictive curves that would be non-linear.

Non-linear in the parameters

There are a series of models that can be non-linear in the parameters. An example is the logistic population-growth model (Shryock, Siegel, et al., 1973):

\[ Y = \frac{\beta_{1}}{1 + exp(\beta_{2} + \beta_{3}X)} + \epsilon \]

We are going to focus on an example for the first example, modeling non-linear associations while keeping the regression model additive.

Non-linear example

Let’s explore some weather data from Australia.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(mosaic)
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## 
## The following object is masked from 'package:Matrix':
## 
##     mean
## 
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     stat
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
library(broom)
library(ggformula)
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
theme_set(theme_bw(base_size = 16))

aus_weather <- readr::read_csv("https://raw.githubusercontent.com/lebebr01/statthink/main/data-raw/weatherAUS.csv") |> 
    filter(Location %in% c('Adelaide', 'Canberra')) |>
    mutate(day_of_year = yday(Date))
## Rows: 142193 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (6): Location, WindGustDir, WindDir9am, WindDir3pm, RainToday, RainTom...
## dbl  (17): MinTemp, MaxTemp, Rainfall, Evaporation, Sunshine, WindGustSpeed,...
## date  (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
count(aus_weather, Location)
## # A tibble: 2 × 2
##   Location     n
##   <chr>    <int>
## 1 Adelaide  3090
## 2 Canberra  3418
head(aus_weather)
## # A tibble: 6 × 25
##   Date       Location MinTemp MaxTemp Rainfall Evapora…¹ Sunsh…² WindG…³ WindG…⁴
##   <date>     <chr>      <dbl>   <dbl>    <dbl>     <dbl>   <dbl> <chr>     <dbl>
## 1 2007-11-01 Canberra     8      24.3      0         3.4     6.3 NW           30
## 2 2007-11-02 Canberra    14      26.9      3.6       4.4     9.7 ENE          39
## 3 2007-11-03 Canberra    13.7    23.4      3.6       5.8     3.3 NW           85
## 4 2007-11-04 Canberra    13.3    15.5     39.8       7.2     9.1 NW           54
## 5 2007-11-05 Canberra     7.6    16.1      2.8       5.6    10.6 SSE          50
## 6 2007-11-06 Canberra     6.2    16.9      0         5.8     8.2 SE           44
## # … with 16 more variables: WindDir9am <chr>, WindDir3pm <chr>,
## #   WindSpeed9am <dbl>, WindSpeed3pm <dbl>, Humidity9am <dbl>,
## #   Humidity3pm <dbl>, Pressure9am <dbl>, Pressure3pm <dbl>, Cloud9am <dbl>,
## #   Cloud3pm <dbl>, Temp9am <dbl>, Temp3pm <dbl>, RainToday <chr>,
## #   RISK_MM <dbl>, RainTomorrow <chr>, day_of_year <dbl>, and abbreviated
## #   variable names ¹​Evaporation, ²​Sunshine, ³​WindGustDir, ⁴​WindGustSpeed
gf_point(Temp3pm ~ day_of_year, data = aus_weather) |>
  gf_labs(x = "Day of Year",
          y = "Temperature (Celsius)")
## Warning: Removed 11 rows containing missing values (`geom_point()`).

gf_point(Temp3pm ~ day_of_year, data = aus_weather, color = ~ Location) |>
  gf_labs(x = "Day of Year",
          y = "Temperature (Celsius)")
## Warning: Removed 11 rows containing missing values (`geom_point()`).

gf_point(Temp3pm ~ day_of_year, data = aus_weather, color = ~ Location, alpha = .3) |>
gf_smooth(method = 'lm', size = 4) |> 
  gf_labs(x = "Day of Year",
          y = "Temperature (Celsius)")
## Warning: Removed 11 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 11 rows containing missing values (`geom_point()`).

gf_point(Temp3pm ~ day_of_year, data = aus_weather, color = ~ Location, alpha = .3) |>
gf_smooth(method = '', size = 4) |> 
  gf_labs(x = "Day of Year",
          y = "Temperature (Celsius)")
## Warning: Removed 11 rows containing non-finite values (`stat_smooth()`).
## Warning: Computation failed in `stat_smooth()`
## Caused by error in `get()`:
## ! invalid first argument
## Warning: Removed 11 rows containing missing values (`geom_point()`).

linear_mod <- lm(Temp3pm ~ day_of_year + Location, data = aus_weather)
summary(linear_mod)
## 
## Call:
## lm(formula = Temp3pm ~ day_of_year + Location, data = aus_weather)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.012  -5.489  -0.940   4.646  22.684 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      23.9680663  0.1900207  126.13   <2e-16 ***
## day_of_year      -0.0128942  0.0007959  -16.20   <2e-16 ***
## LocationCanberra -2.1539876  0.1679972  -12.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.762 on 6494 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.06113,    Adjusted R-squared:  0.06084 
## F-statistic: 211.4 on 2 and 6494 DF,  p-value: < 2.2e-16
plot(linear_mod, which = 1:5)

resid_linear <- broom::augment(linear_mod)

gf_point(.resid ~ .fitted, data = resid_linear, color = ~ Location) |> 
  gf_hline(yintercept = 0, size = 1.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

quad_model <- lm(Temp3pm ~ poly(day_of_year, 2) + Location, 
                 data = aus_weather)

summary(quad_model)
## 
## Call:
## lm(formula = Temp3pm ~ poly(day_of_year, 2) + Location, data = aus_weather)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8321  -2.9490  -0.5725   2.6229  17.4332 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             21.66025    0.08304  260.84   <2e-16 ***
## poly(day_of_year, 2)1 -109.89896    4.61906  -23.79   <2e-16 ***
## poly(day_of_year, 2)2  399.00995    4.61940   86.38   <2e-16 ***
## LocationCanberra        -2.22383    0.11461  -19.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.613 on 6493 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.5631, Adjusted R-squared:  0.5629 
## F-statistic:  2790 on 3 and 6493 DF,  p-value: < 2.2e-16
resid_quad <- broom::augment(quad_model)

head(resid_quad)
## # A tibble: 6 × 9
##   .rownames Temp3pm poly(day_of…¹ Locat…² .fitted    .hat .sigma .cooksd .std.…³
##   <chr>       <dbl>         <dbl> <chr>     <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
## 1 1            23.6        0.0144 Canber…    19.8 5.29e-4   4.61 8.92e-5  0.821 
## 2 2            25.7        0.0146 Canber…    19.9 5.35e-4   4.61 2.10e-4  1.25  
## 3 3            20.2        0.0147 Canber…    20.0 5.42e-4   4.61 1.74e-7  0.0359
## 4 4            14.1        0.0148 Canber…    20.1 5.49e-4   4.61 2.36e-4 -1.31  
## 5 5            15.4        0.0149 Canber…    20.3 5.56e-4   4.61 1.55e-4 -1.05  
## 6 6            14.8        0.0150 Canber…    20.4 5.64e-4   4.61 2.06e-4 -1.21  
## # … with 1 more variable: `poly(day_of_year, 2)`[2] <dbl>, and abbreviated
## #   variable names ¹​`poly(day_of_year, 2)`[,"1"], ²​Location, ³​.std.resid
gf_point(.std.resid ~ .fitted, data = resid_quad, color = ~ Location) |>
gf_hline(yintercept = 0, size = 1.5)

quad_model_2 <- lm(Temp3pm ~ day_of_year + I(day_of_year^2) + Location, 
                 data = aus_weather)

summary(quad_model_2)
## 
## Call:
## lm(formula = Temp3pm ~ day_of_year + I(day_of_year^2) + Location, 
##     data = aus_weather)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8321  -2.9490  -0.5725   2.6229  17.4332 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.502e+01  1.821e-01  192.28   <2e-16 ***
## day_of_year      -1.944e-01  2.170e-03  -89.57   <2e-16 ***
## I(day_of_year^2)  4.979e-04  5.764e-06   86.38   <2e-16 ***
## LocationCanberra -2.224e+00  1.146e-01  -19.40   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.613 on 6493 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.5631, Adjusted R-squared:  0.5629 
## F-statistic:  2790 on 3 and 6493 DF,  p-value: < 2.2e-16
resid_quad_2 <- broom::augment(quad_model_2)

head(resid_quad_2)
## # A tibble: 6 × 11
##   .rowna…¹ Temp3pm day_o…² I(day…³ Locat…⁴ .fitted .resid    .hat .sigma .cooksd
##   <chr>      <dbl>   <dbl> <I<dbl> <chr>     <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
## 1 1           23.6     305   93025 Canber…    19.8  3.79  5.29e-4   4.61 8.92e-5
## 2 2           25.7     306   93636 Canber…    19.9  5.78  5.35e-4   4.61 2.10e-4
## 3 3           20.2     307   94249 Canber…    20.0  0.165 5.42e-4   4.61 1.74e-7
## 4 4           14.1     308   94864 Canber…    20.1 -6.05  5.49e-4   4.61 2.36e-4
## 5 5           15.4     309   95481 Canber…    20.3 -4.86  5.56e-4   4.61 1.55e-4
## 6 6           14.8     310   96100 Canber…    20.4 -5.57  5.64e-4   4.61 2.06e-4
## # … with 1 more variable: .std.resid <dbl>, and abbreviated variable names
## #   ¹​.rownames, ²​day_of_year, ³​`I(day_of_year^2)`, ⁴​Location
gf_point(Temp3pm ~ day_of_year, data = resid_quad_2, color = ~ Location, alpha = .3) |> 
  gf_line(.fitted ~ day_of_year, color = ~ Location, data = resid_quad_2, size = 4)

quad_model_3 <- lm(Temp3pm ~ poly(day_of_year, 2) * Location, 
                 data = aus_weather)

summary(quad_model_3)
## 
## Call:
## lm(formula = Temp3pm ~ poly(day_of_year, 2) * Location, data = aus_weather)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2571  -2.9927  -0.5634   2.5809  17.7627 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              21.65852    0.08291 261.220  < 2e-16
## poly(day_of_year, 2)1                  -114.77043    6.71386 -17.095  < 2e-16
## poly(day_of_year, 2)2                   376.21298    6.72202  55.967  < 2e-16
## LocationCanberra                         -2.22306    0.11443 -19.428  < 2e-16
## poly(day_of_year, 2)1:LocationCanberra    9.03072    9.23824   0.978    0.328
## poly(day_of_year, 2)2:LocationCanberra   43.03473    9.24005   4.657 3.27e-06
##                                           
## (Intercept)                            ***
## poly(day_of_year, 2)1                  ***
## poly(day_of_year, 2)2                  ***
## LocationCanberra                       ***
## poly(day_of_year, 2)1:LocationCanberra    
## poly(day_of_year, 2)2:LocationCanberra ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.606 on 6491 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.5646, Adjusted R-squared:  0.5643 
## F-statistic:  1684 on 5 and 6491 DF,  p-value: < 2.2e-16
quad_model_3 <- lm(Temp3pm ~ day_of_year * Location + I(day_of_year^2) * Location, 
                 data = aus_weather)
resid_quad_3 <- broom::augment(quad_model_3)

gf_point(Temp3pm ~ day_of_year, data = resid_quad_3, color = ~ Location, alpha = .3) |> 
  gf_line(.fitted ~ day_of_year, color = ~ Location, data = resid_quad_3, size = 4)

Previous
Next