Least Squares Simulation

Example to show least squares minimization

This little example is meant as a way to show the least square really minimizes the criterion, $ ( Y - )^2 $.

In this example, we will generate some data so that we know what the truth is. Then, upon data generation, we will compute a bunch of different values for the linear slope and y-intercept. For each combination of the y-intercept and slope, I will compute the sum of squares error depicted above.

Simulate some data

The following example simulates data based on the following linear regression formula:

\[ Y = 5 + 0.5 X + \epsilon \]

More explicitly, the simulation allows us to specify what the intercept and slope is in the population. These are specified below in the reg_weights simulation argument.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(simglm)

theme_set(theme_bw(base_size = 18))

set.seed(2023)

sim_arguments <- list(
    formula = y ~ x,
    fixed = list(x = list(var_type = 'continuous', mean = 100, sd = 20)),
    error = list(variance = 100),
    sample_size = 1000,
    reg_weights = c(5, .5)
)

sim_data <- simulate_fixed(data = NULL, sim_arguments) |>
  simulate_error(sim_arguments) |>
  generate_response(sim_arguments)

head(sim_data)
##   X.Intercept.         x level1_id      error fixed_outcome random_effects
## 1            1  98.32431         1 -2.6862147      54.16216              0
## 2            1  80.34113         2 -1.1886232      45.17056              0
## 3            1  62.49865         3  0.6545456      36.24933              0
## 4            1  96.27711         4 12.9091527      53.13855              0
## 5            1  87.33029         5 -4.5144816      48.66514              0
## 6            1 121.81595         6 19.5705749      65.90797              0
##          y
## 1 51.47594
## 2 43.98194
## 3 36.90387
## 4 66.04771
## 5 44.15066
## 6 85.47855

Visualize the Simulated Data

The following code visualizes the simulated data from above. What would you estimate the correlation to be?

library(ggformula)
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
## Loading required package: ggridges
## Warning: package 'ggridges' was built under R version 4.3.2
## 
## New to ggformula?  Try the tutorials: 
## 	learnr::run_tutorial("introduction", package = "ggformula")
## 	learnr::run_tutorial("refining", package = "ggformula")
gf_point(y ~ x, data = sim_data, size = 4) |>
  gf_smooth(method = 'lm')

Estimate Regression Coefficients

Even though we know what truth is, there is error involved in the simulation process, therefore, the population values specified above will not equal the exact regression coefficients estimated. Below, we estimate what those regression coefficients are.

sim_lm <- lm (y ~ x, data = sim_data)
coef(sim_lm)
## (Intercept)           x 
##   5.1260093   0.5002076

Create different combinations of intercept and slope coefficients

The following code generates a sequence of intercept and corresponding slope conditions. We will use these different values to estimate the sum of squares error shown at the top of the notes for each of these intercept and slope values to show that the regression estimates are optimal to minimize the sum of square error.

y_intercept <- seq(0, 15, by = .25)
slope <- seq(0, 1.5, by = .01)

conditions <- rbind(expand.grid(y_intercept = y_intercept, 
                          slope = slope),
                          coef(sim_lm))

tail(conditions)
##      y_intercept     slope
## 9207   14.000000 1.5000000
## 9208   14.250000 1.5000000
## 9209   14.500000 1.5000000
## 9210   14.750000 1.5000000
## 9211   15.000000 1.5000000
## 9212    5.126009 0.5002076
dim(conditions)
## [1] 9212    2

Showing Two Combinations

Here we visualize two possible slope conditions. Which one seems better for the data?

gf_point(y ~ x, data = sim_data, size = 4) |>
  gf_smooth(method = 'lm') |>
  gf_abline(slope = ~slope, intercept = ~y_intercept, data = slice(conditions, 1), linetype = 2, size = 2) |>
  gf_abline(slope = ~slope, intercept = ~y_intercept, data = slice(conditions, 855), linetype = 2, color = 'lightgreen', size = 2) |>
  gf_refine(coord_cartesian(xlim = c(0, 160), ylim = c(0, 120)))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Compute Sum of Squares Error

The following code creates a new function that computes the sum of square error. The function takes two arguments, the combination of intercept and slope values and the simulated data. The output is the sigma or average error from the regression line. The first code chunk below performs the computation for a single condition. The second code chunk does it for all of the conditions.

sum_square_error <- function(conditions, sim_data) {
    fitted <- conditions[['y_intercept']] + conditions[['slope']] * sim_data[['x']]

    deviation <- sim_data[['y']] - fitted

    sqrt((sum(deviation^2) / (nrow(sim_data) - 2)))
}

sum_square_error(conditions[1892, ], sim_data)
## [1] 26.25721
summary(sim_lm)$sigma
## [1] 9.738423
library(future)

plan(multicore)
## Warning in supportsMulticoreAndRStudio(...): [ONE-TIME WARNING] Forked
## processing ('multicore') is not supported when running R from RStudio because
## it is considered unstable. For more details, how to control forked processing
## or not, and how to silence this warning in future R sessions, see
## ?parallelly::supportsMulticore
conditions$sse <- unlist(lapply(1:nrow(conditions), function(xx) sum_square_error(conditions[xx, ], sim_data)))

head(conditions)
##   y_intercept slope      sse
## 1        0.00     0 56.72588
## 2        0.25     0 56.48344
## 3        0.50     0 56.24108
## 4        0.75     0 55.99878
## 5        1.00     0 55.75655
## 6        1.25     0 55.51440
Previous
Next