Least Squares Simulation
Example to show least squares minimization
This little example is meant as a way to show the least square really minimizes the criterion, $ ( Y - )^2 $.
In this example, we will generate some data so that we know what the truth is. Then, upon data generation, we will compute a bunch of different values for the linear slope and y-intercept. For each combination of the y-intercept and slope, I will compute the sum of squares error depicted above.
Simulate some data
The following example simulates data based on the following linear regression formula:
\[ Y = 5 + 0.5 X + \epsilon \]
More explicitly, the simulation allows us to specify what the intercept and slope is in the population. These are specified below in the reg_weights
simulation argument.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(simglm)
theme_set(theme_bw(base_size = 18))
set.seed(2023)
sim_arguments <- list(
formula = y ~ x,
fixed = list(x = list(var_type = 'continuous', mean = 100, sd = 20)),
error = list(variance = 100),
sample_size = 1000,
reg_weights = c(5, .5)
)
sim_data <- simulate_fixed(data = NULL, sim_arguments) |>
simulate_error(sim_arguments) |>
generate_response(sim_arguments)
head(sim_data)
## X.Intercept. x level1_id error fixed_outcome random_effects
## 1 1 98.32431 1 -2.6862147 54.16216 0
## 2 1 80.34113 2 -1.1886232 45.17056 0
## 3 1 62.49865 3 0.6545456 36.24933 0
## 4 1 96.27711 4 12.9091527 53.13855 0
## 5 1 87.33029 5 -4.5144816 48.66514 0
## 6 1 121.81595 6 19.5705749 65.90797 0
## y
## 1 51.47594
## 2 43.98194
## 3 36.90387
## 4 66.04771
## 5 44.15066
## 6 85.47855
Visualize the Simulated Data
The following code visualizes the simulated data from above. What would you estimate the correlation to be?
library(ggformula)
## Loading required package: scales
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
## Loading required package: ggridges
## Warning: package 'ggridges' was built under R version 4.3.2
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
gf_point(y ~ x, data = sim_data, size = 4) |>
gf_smooth(method = 'lm')
Estimate Regression Coefficients
Even though we know what truth is, there is error involved in the simulation process, therefore, the population values specified above will not equal the exact regression coefficients estimated. Below, we estimate what those regression coefficients are.
sim_lm <- lm (y ~ x, data = sim_data)
coef(sim_lm)
## (Intercept) x
## 5.1260093 0.5002076
Create different combinations of intercept and slope coefficients
The following code generates a sequence of intercept and corresponding slope conditions. We will use these different values to estimate the sum of squares error shown at the top of the notes for each of these intercept and slope values to show that the regression estimates are optimal to minimize the sum of square error.
y_intercept <- seq(0, 15, by = .25)
slope <- seq(0, 1.5, by = .01)
conditions <- rbind(expand.grid(y_intercept = y_intercept,
slope = slope),
coef(sim_lm))
tail(conditions)
## y_intercept slope
## 9207 14.000000 1.5000000
## 9208 14.250000 1.5000000
## 9209 14.500000 1.5000000
## 9210 14.750000 1.5000000
## 9211 15.000000 1.5000000
## 9212 5.126009 0.5002076
dim(conditions)
## [1] 9212 2
Showing Two Combinations
Here we visualize two possible slope conditions. Which one seems better for the data?
gf_point(y ~ x, data = sim_data, size = 4) |>
gf_smooth(method = 'lm') |>
gf_abline(slope = ~slope, intercept = ~y_intercept, data = slice(conditions, 1), linetype = 2, size = 2) |>
gf_abline(slope = ~slope, intercept = ~y_intercept, data = slice(conditions, 855), linetype = 2, color = 'lightgreen', size = 2) |>
gf_refine(coord_cartesian(xlim = c(0, 160), ylim = c(0, 120)))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Compute Sum of Squares Error
The following code creates a new function that computes the sum of square error. The function takes two arguments, the combination of intercept and slope values and the simulated data. The output is the sigma or average error from the regression line. The first code chunk below performs the computation for a single condition. The second code chunk does it for all of the conditions.
sum_square_error <- function(conditions, sim_data) {
fitted <- conditions[['y_intercept']] + conditions[['slope']] * sim_data[['x']]
deviation <- sim_data[['y']] - fitted
sqrt((sum(deviation^2) / (nrow(sim_data) - 2)))
}
sum_square_error(conditions[1892, ], sim_data)
## [1] 26.25721
summary(sim_lm)$sigma
## [1] 9.738423
library(future)
plan(multicore)
## Warning in supportsMulticoreAndRStudio(...): [ONE-TIME WARNING] Forked
## processing ('multicore') is not supported when running R from RStudio because
## it is considered unstable. For more details, how to control forked processing
## or not, and how to silence this warning in future R sessions, see
## ?parallelly::supportsMulticore
conditions$sse <- unlist(lapply(1:nrow(conditions), function(xx) sum_square_error(conditions[xx, ], sim_data)))
head(conditions)
## y_intercept slope sse
## 1 0.00 0 56.72588
## 2 0.25 0 56.48344
## 3 0.50 0 56.24108
## 4 0.75 0 55.99878
## 5 1.00 0 55.75655
## 6 1.25 0 55.51440