Bootstrap

Bootstrap

The bootstrap is an alternative to the NHST framework already discussed. The primary benefit of the bootstrap is that it comes with fewer assumptions then the NHST framework. The only real assumption when doing a bootstrap approach is that the sample is obtained randomly from the population, an assumption already made in the NHST framework. The primary drawback of the bootstrap approach is that it is computationally expensive, therefore, it can take time to perform the procedure.

Bootstrapping Steps

The following are the steps when performing a bootstrap.

  1. Treat the sample data as the population.
  2. Resample, with replacement, from the sample data, ensuring the new sample is the same size as the original.
  3. Estimate the model using the resampled data from step 2.
  4. Repeat steps 2 and 3 many many times (eg, 10,000 or more).
  5. Visualize distribution of effect of interest

Resampling with replacement

What is meant by sampling with replacement? Let’s do an example.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
fruit <- data.frame(name = c('watermelon', 'apple', 'orange', 'kumquat', 'grapes', 'canteloupe', 'kiwi', 'banana')) |>
    mutate(obs_num = 1:n())

fruit
##         name obs_num
## 1 watermelon       1
## 2      apple       2
## 3     orange       3
## 4    kumquat       4
## 5     grapes       5
## 6 canteloupe       6
## 7       kiwi       7
## 8     banana       8
slice_sample(fruit, n = nrow(fruit), replace = FALSE)
##         name obs_num
## 1    kumquat       4
## 2     orange       3
## 3 watermelon       1
## 4       kiwi       7
## 5     grapes       5
## 6 canteloupe       6
## 7     banana       8
## 8      apple       2
slice_sample(fruit, n = nrow(fruit), replace = TRUE)
##         name obs_num
## 1       kiwi       7
## 2 canteloupe       6
## 3       kiwi       7
## 4    kumquat       4
## 5 watermelon       1
## 6 canteloupe       6
## 7     orange       3
## 8 watermelon       1

More practical example

Let’s load some data to do a more practical example. The following data come from a TIMSS on science achievement. The data provided is a subset of the available data and is not intended to be representative. Below is a short description of the data. Please don’t hesitate to send any data related questions, happy to provide additional help on interpreting the data appropriately.

  • IDSTUD: A unique student ID
  • ITBIRTHM: The birth month of the student
  • ITBIRTHY: The birth year of the student
  • ITSEX: The birth sex of the student
  • ASDAGE: The age of the student (at time of test)
  • ASSSCI01: Overall science scale score
  • ASSEAR01: Earth science scale score
  • ASSLIF01: Life science scale score
  • ASSPHY01: Physics scale score
  • ASSKNO01: Science knowing scale score
  • ASSAPP01: Science applying scale score
  • ASSREA01: Science reasoning scale score
library(tidyverse)

timss <- readr::read_csv('https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/timss_grade4_science.csv') |> 
   filter(ASDAGE < 15)
## Rows: 10029 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): IDSTUD, ITBIRTHM, ITBIRTHY, ITSEX, ASDAGE, ASSSCI01, ASSEAR01, ASS...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(timss)
## # A tibble: 6 × 12
##   IDSTUD ITBIRTHM ITBIRTHY ITSEX ASDAGE ASSSCI01 ASSEAR01 ASSLIF01 ASSPHY01
##    <dbl>    <dbl>    <dbl> <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1  10101        6     2005     2   9.92     505.     445.     408.     458.
## 2  10102       10     2005     2   9.58     401.     291.     323.     297.
## 3  10103        1     2005     1  10.3      542.     440.     529.     529.
## 4  10104        6     2005     2   9.92     544.     534.     552.     557.
## 5  10105        6     2005     2   9.92     588.     555.     500.     559.
## 6  10106       12     2005     1   9.42     583.     556.     585.     562.
## # ℹ 3 more variables: ASSKNO01 <dbl>, ASSAPP01 <dbl>, ASSREA01 <dbl>

Two Guiding Questions

  1. Is the age of a student related to their overall science scale score?
  2. Is the life science scale score related to the overall science scale score?

Bivariate exploration

It is important to explore relationships bivariately before going to the model phase. To do this, fill in the outcome of interest in place of “%%” below and fill in the appropriate predictor in place of “^^”. You may also want to fill in an appropriate axis labels in place of “@@” below.

  • Summarize the bivariate association
library(ggformula)

theme_set(theme_bw(base_size = 18))

gf_point(%% ~ ^^, data = timss, size = 4) |> 
  gf_smooth(method = 'lm', size = 1.5) |>
  gf_smooth(method = 'loess', size = 1.5, color = 'green') |>
  gf_labs(x = "@@",
          y = "@@")

Estimate Parameters

To do this, fill in the outcome of interest in place of “%%” below and fill in the appropriate predictor in place of “^^”.

  • Interpret the following parameter estimates in the context of the current problem. (ie., what do these parameter estimates mean?)
timss_model <- lm(%% ~ ^^, data = timss)

summary(timss_model)
confint(timss_model)

Resample these data

For the astronauts data, to resample, a similar idea can be made. Essentially, we are treating these data as a random sample of some population of space missions. We again, would resample, with replacement, which means that multiple missions would likely show up in the resampling procedure.

slice_sample(timss, n = nrow(timss), replace = TRUE) |> 
  count(IDSTUD)
resamp_lm <- slice_sample(timss, n = nrow(timss), replace = TRUE) |> 
  lm(%% ~ ^^, data = .)

summary(resamp_lm)
confint(resamp_lm)

Questions to consider

  1. Do the parameter estimates differ from before? Why or why not?
  2. Would you come to substantially different conclusions from the original analysis? Why or why not?

Let’s now repeat this a bunch of times.

set.seed(2022)

resample_timss <- function(data, model_equation) {
  timss_resample <- slice_sample(data, n = nrow(data), replace = TRUE)

  
  lm(model_equation, data = timss_resample) |>
    coef()
}

Run this function a bunch of times manually, what happens to the estimates? Why is this happening?

resample_timss(data = timss, model_equation = %% ~ ^^)

Make sure future.apply is installed

The following code chunk makes sure the future.apply function is installed for parallel processing in R. If you get an error, you can uncomment (delete the # symbol) in the first line of code to (hopefully) install the package yourself.

# install.packages("future.apply")
library(future.apply)

plan(multisession)

Replicate

The following code replicates the analysis many times. Pick an initial value for N and fill in the model equation to match your code above. I will ask you to change this N value later.

timss_coef <- future.apply::future_replicate(N, resample_timss(data = timss, 
                        model_equation = %% ~ ^^), simplify = FALSE) |>
                        dplyr::bind_rows() |> 
                        tidyr::pivot_longer(cols = everything())

head(timss_coef)

Visualize results

The following code visualizes the results of the analysis above. Explore the following questions.

  1. What does this distribution show/represent?
  2. What are key features of this distribution?
  3. How do these values compare to the original linear regression results?
    • Are there comparable statistics here compared to those?
gf_density(~ value, data = timss_coef) |>
  gf_facet_wrap(~ name, scales = 'free') |>
  gf_labs(x = "Regression Estimates")

Change the N value

Now, change the N value for the replicate step (you can either add a new cell to copy/paste the code or just change it in the code above).

  • What happens to the resulting figure when the N increases?
  • What value for N seems to be reasonable? Why?
Previous
Next