Bootstrap

Bootstrap - class

The bootstrap is an alternative to the NHST framework already discussed. The primary benefit of the bootstrap is that it comes with fewer assumptions then the NHST framework. The only real assumption when doing a bootstrap approach is that the sample is obtained randomly from the population, an assumption already made in the NHST framework. The primary drawback of the bootstrap approach is that it is computationally expensive, therefore, it can take time to perform the procedure.

Bootstrapping Steps

The following are the steps when performing a bootstrap.

Treat the sample data as the population.
Resample, with replacement, from the sample data, ensuring the new sample is the same size as the original.
Estimate the model using the resampled data from step 2.
Repeat steps 2 and 3 many many times (eg, 10,000 or more).
Visualize distribution of effect of interest

Resampling with replacement

What is meant by sampling with replacement? Let’s do an example.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

fruit <- data.frame(name = c('watermelon', 'apple', 'orange', 
                             'kumquat', 'grapes', 'canteloupe', 
                             'kiwi', 'banana')) |>
    mutate(obs_num = 1:n())

fruit

##         name obs_num
## 1 watermelon       1
## 2      apple       2
## 3     orange       3
## 4    kumquat       4
## 5     grapes       5
## 6 canteloupe       6
## 7       kiwi       7
## 8     banana       8

slice_sample(fruit, n = nrow(fruit), replace = FALSE)

##         name obs_num
## 1      apple       2
## 2     grapes       5
## 3     banana       8
## 4 canteloupe       6
## 5    kumquat       4
## 6     orange       3
## 7       kiwi       7
## 8 watermelon       1

slice_sample(fruit, n = nrow(fruit), replace = TRUE)

##         name obs_num
## 1     banana       8
## 2     grapes       5
## 3    kumquat       4
## 4      apple       2
## 5       kiwi       7
## 6      apple       2
## 7 canteloupe       6
## 8     orange       3

8 ^ 8

## [1] 16777216

4000 ^ 4000

## [1] Inf

More practical example

Let’s load some data to do a more practical example. The following data come from a TIMSS on science achievement. The data provided is a subset of the available data and is not intended to be representative. Below is a short description of the data. Please don’t hesitate to send any data related questions, happy to provide additional help on interpreting the data appropriately.

IDSTUD: A unique student ID
ITBIRTHM: The birth month of the student
ITBIRTHY: The birth year of the student
ITSEX: The birth sex of the student
ASDAGE: The age of the student (at time of test)
ASSSCI01: Overall science scale score
ASSEAR01: Earth science scale score
ASSLIF01: Life science scale score
ASSPHY01: Physics scale score
ASSKNO01: Science knowing scale score
ASSAPP01: Science applying scale score
ASSREA01: Science reasoning scale score

library(tidyverse)

timss <- readr::read_csv('https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/timss_grade4_science.csv') |> 
   filter(ASDAGE < 15)

## Rows: 10029 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): IDSTUD, ITBIRTHM, ITBIRTHY, ITSEX, ASDAGE, ASSSCI01, ASSEAR01, ASS...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(timss)

## # A tibble: 6 × 12
##   IDSTUD ITBIRTHM ITBIRTHY ITSEX ASDAGE ASSSCI01 ASSEAR01 ASSLIF01 ASSPHY01
##    <dbl>    <dbl>    <dbl> <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1  10101        6     2005     2   9.92     505.     445.     408.     458.
## 2  10102       10     2005     2   9.58     401.     291.     323.     297.
## 3  10103        1     2005     1  10.3      542.     440.     529.     529.
## 4  10104        6     2005     2   9.92     544.     534.     552.     557.
## 5  10105        6     2005     2   9.92     588.     555.     500.     559.
## 6  10106       12     2005     1   9.42     583.     556.     585.     562.
## # ℹ 3 more variables: ASSKNO01 <dbl>, ASSAPP01 <dbl>, ASSREA01 <dbl>

Two Guiding Questions

Is the age of a student related to their overall science scale score?
Is the life science scale score related to the overall science scale score?

Bivariate exploration

It is important to explore relationships bivariately before going to the model phase. To do this, fill in the outcome of interest in place of “%%” below and fill in the appropriate predictor in place of “^^”. You may also want to fill in an appropriate axis labels in place of “@@” below.

Summarize the bivariate association

library(ggformula)

## Loading required package: scales

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

## Loading required package: ggridges

## 
## New to ggformula?  Try the tutorials: 
## 	learnr::run_tutorial("introduction", package = "ggformula")
## 	learnr::run_tutorial("refining", package = "ggformula")

theme_set(theme_bw(base_size = 18))

gf_point(ASSSCI01 ~ ASDAGE, data = timss, size = 4, alpha = .15) |> 
  gf_smooth(method = 'lm', size = 1.5) |>
  gf_smooth(method = 'loess', size = 1.5, color = 'green') |>
  gf_labs(x = "Age",
          y = "Science Score")

mosaic::cor(ASSSCI01 ~ ASDAGE, data = timss) |> round(5)

## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2

## [1] -0.03999

mosaic::cor(ASSSCI01 ~ ASDAGE, data = filter(timss, ASDAGE >= 8 & ASDAGE <= 12)) |> round(5)

## [1] -0.04989

filter(timss, ASDAGE >= 8 & ASDAGE <= 12) |> nrow()

## [1] 9962

nrow(timss)

## [1] 10027

Estimate Parameters

To do this, fill in the outcome of interest in place of “%%” below and fill in the appropriate predictor in place of “^^”.

Interpret the following parameter estimates in the context of the current problem. (ie., what do these parameter estimates mean?)

timss_model <- lm(ASSSCI01 ~ ASDAGE, data = timss)

summary(timss_model)

## 
## Call:
## lm(formula = ASSSCI01 ~ ASDAGE, data = timss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -305.769  -51.969    4.919   56.707  272.217 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  613.563     17.064  35.956  < 2e-16 ***
## ASDAGE        -6.687      1.669  -4.007 6.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.89 on 10025 degrees of freedom
## Multiple R-squared:  0.001599,	Adjusted R-squared:  0.0015 
## F-statistic: 16.06 on 1 and 10025 DF,  p-value: 6.183e-05

confint(timss_model)

##                  2.5 %     97.5 %
## (Intercept) 580.114047 647.012616
## ASDAGE       -9.957345  -3.415915

plot(timss_model, which = 1:5)

Resample these data

For the astronauts data, to resample, a similar idea can be made. Essentially, we are treating these data as a random sample of some population of space missions. We again, would resample, with replacement, which means that multiple missions would likely show up in the resampling procedure.

slice_sample(timss, n = nrow(timss), replace = TRUE) |> 
  count(IDSTUD) |> 
  arrange(-n)

## # A tibble: 6,332 × 2
##     IDSTUD     n
##      <dbl> <int>
##  1  490109     6
##  2  600202     6
##  3  820316     6
##  4 1010227     6
##  5 1420102     6
##  6 1770214     6
##  7   20410     5
##  8   40417     5
##  9  130204     5
## 10  170205     5
## # ℹ 6,322 more rows

resamp_lm <- slice_sample(timss, n = nrow(timss), replace = TRUE) %>%
  lm(ASSSCI01 ~ ASDAGE, data = .)

summary(resamp_lm)

## 
## Call:
## lm(formula = ASSSCI01 ~ ASDAGE, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -305.374  -49.817    4.064   55.371  272.573 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  612.640     17.499  35.010  < 2e-16 ***
## ASDAGE        -6.633      1.712  -3.874 0.000108 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.26 on 10025 degrees of freedom
## Multiple R-squared:  0.001495,	Adjusted R-squared:  0.001395 
## F-statistic: 15.01 on 1 and 10025 DF,  p-value: 0.0001077

confint(resamp_lm)

##                  2.5 %     97.5 %
## (Intercept) 578.338710 646.941301
## ASDAGE       -9.989728  -3.277094

Questions to consider

Do the parameter estimates differ from before? Why or why not?
Would you come to substantially different conclusions from the original analysis? Why or why not?

Let’s now repeat this a bunch of times.

set.seed(2022)

resample_timss <- function(data, model_equation) {
  timss_resample <- slice_sample(data, n = nrow(data), replace = TRUE)
    
  lm(model_equation, data = timss_resample) |>
    coef()
}

Run this function a bunch of times manually, what happens to the estimates? Why is this happening?

resample_timss(data = timss, model_equation = ASSSCI01 ~ ASDAGE)

## (Intercept)      ASDAGE 
##  593.768824   -4.636762

Make sure future.apply is installed

The following code chunk makes sure the future.apply function is installed for parallel processing in R. If you get an error, you can uncomment (delete the # symbol) in the first line of code to (hopefully) install the package yourself.

# install.packages("future.apply")
library(future.apply)

## Loading required package: future

plan(multisession)

Replicate

The following code replicates the analysis many times. Pick an initial value for N and fill in the model equation to match your code above. I will ask you to change this N value later.

timss_coef <- future.apply::future_replicate(10000, resample_timss(data = timss, 
                        model_equation = ASSSCI01 ~ ASDAGE), simplify = FALSE) |>
                        dplyr::bind_rows() |> 
                        tidyr::pivot_longer(cols = everything())

head(timss_coef)

## # A tibble: 6 × 2
##   name         value
##   <chr>        <dbl>
## 1 (Intercept) 610.  
## 2 ASDAGE       -6.56
## 3 (Intercept) 603.  
## 4 ASDAGE       -5.67
## 5 (Intercept) 620.  
## 6 ASDAGE       -7.25

Visualize results

The following code visualizes the results of the analysis above. Explore the following questions.

What does this distribution show/represent?
What are key features of this distribution?
How do these values compare to the original linear regression results?
- Are there comparable statistics here compared to those?

gf_density(~ value, data = timss_coef) |>
  gf_facet_wrap(~ name, scales = 'free') |>
  gf_labs(x = "Regression Estimates")

df_stats(value ~ name, data = timss_coef, mean, median, sd, IQR, length)

##   response        name       mean     median        sd       IQR length
## 1    value (Intercept) 613.863905 614.021876 20.493175 27.714280  10000
## 2    value      ASDAGE  -6.714967  -6.740487  2.005943  2.728819  10000

df_stats(value ~ name, data = timss_coef, quantile(c(0.025, 0.975)))

##   response        name      2.5%      97.5%
## 1    value (Intercept) 574.15367 653.425218
## 2    value      ASDAGE -10.57731  -2.829617

confint(timss_model)

##                  2.5 %     97.5 %
## (Intercept) 580.114047 647.012616
## ASDAGE       -9.957345  -3.415915

summary(timss_model)

## 
## Call:
## lm(formula = ASSSCI01 ~ ASDAGE, data = timss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -305.769  -51.969    4.919   56.707  272.217 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  613.563     17.064  35.956  < 2e-16 ***
## ASDAGE        -6.687      1.669  -4.007 6.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.89 on 10025 degrees of freedom
## Multiple R-squared:  0.001599,	Adjusted R-squared:  0.0015 
## F-statistic: 16.06 on 1 and 10025 DF,  p-value: 6.183e-05

(-6.687 - 0) / 2.0075

## [1] -3.331009

pt(-3.331, df = 10025, lower.tail = TRUE) * 2

## [1] 0.0008684728

Change the N value

Now, change the N value for the replicate step (you can either add a new cell to copy/paste the code or just change it in the code above).

What happens to the resulting figure when the N increases?
What value for N seems to be reasonable? Why?

Last updated on Mar 5, 2024