Bootstrap

Bootstrap

The bootstrap is an alternative to the NHST framework already discussed. The primary benefit of the bootstrap is that it comes with fewer assumptions then the NHST framework. The only real assumption when doing a bootstrap approach is that the sample is obtained randomly from the population, an assumption already made in the NHST framework. The primary drawback of the bootstrap approach is that it is computationally expensive, therefore, it can take time to peform the procedure.

Bootstrapping Steps

The following are the steps when performing a bootstrap.

  1. Treat the sample data as the population.
  2. Resample, with replacement, from the sample data, ensuring the new sample is the same size as the original.
  3. Estimate the model using the resampled data from step 2.
  4. Repeat steps 2 and 3 many many times (eg, 10,000 or more).
  5. Visualize distribution of effect of interest

Resampling with replacement

What is meant by sampling with replacement? Let’s do an example.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
fruit <- data.frame(name = c('watermelon', 'apple', 'orange', 'kumquat', 'grapes', 'canteloupe', 'kiwi', 'banana')) %>%
    mutate(obs_num = 1:n())

fruit
##         name obs_num
## 1 watermelon       1
## 2      apple       2
## 3     orange       3
## 4    kumquat       4
## 5     grapes       5
## 6 canteloupe       6
## 7       kiwi       7
## 8     banana       8
slice_sample(fruit, n = nrow(fruit), replace = FALSE)
##         name obs_num
## 1     orange       3
## 2       kiwi       7
## 3 watermelon       1
## 4    kumquat       4
## 5 canteloupe       6
## 6     grapes       5
## 7      apple       2
## 8     banana       8
slice_sample(fruit, n = nrow(fruit), replace = TRUE)
##         name obs_num
## 1 watermelon       1
## 2     banana       8
## 3       kiwi       7
## 4 canteloupe       6
## 5       kiwi       7
## 6       kiwi       7
## 7       kiwi       7
## 8 watermelon       1

More practical example

Let’s load some data to do a more practical example. The following data come from a TIMSS on science achievement. The data provided is a subset of the available data and is not intended to be representative. Below is a short description of the data. Please don’t hesitate to send any data related questions, happy to provide additional help on interpreting the data appropriately.

  • IDSTUD: A unique student ID
  • ITBIRTHM: The birth month of the student
  • ITBIRTHY: The birth year of the student
  • ITSEX: The birth sex of the student
  • ASDAGE: The age of the student (at time of test)
  • ASSSCI01: Overall science scale score
  • ASSEAR01: Earth science scale score
  • ASSLIF01: Life science scale score
  • ASSPHY01: Physics scale score
  • ASSKNO01: Science knowing scale score
  • ASSAPP01: Science applying scale score
  • ASSREA01: Science reasoning scale score
library(tidyverse)

timss <- readr::read_csv('https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/timss_grade4_science.csv') |> 
   filter(ASDAGE < 15)
## Rows: 10029 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (12): IDSTUD, ITBIRTHM, ITBIRTHY, ITSEX, ASDAGE, ASSSCI01, ASSEAR01, ASS...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(timss)
## # A tibble: 6 × 12
##   IDSTUD ITBIRTHM ITBIRTHY ITSEX ASDAGE ASSSCI01 ASSEA…¹ ASSLI…² ASSPH…³ ASSKN…⁴
##    <dbl>    <dbl>    <dbl> <dbl>  <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1  10101        6     2005     2   9.92     505.    445.    408.    458.    530.
## 2  10102       10     2005     2   9.58     401.    291.    323.    297.    379.
## 3  10103        1     2005     1  10.3      542.    440.    529.    529.    521.
## 4  10104        6     2005     2   9.92     544.    534.    552.    557.    538.
## 5  10105        6     2005     2   9.92     588.    555.    500.    559.    533.
## 6  10106       12     2005     1   9.42     583.    556.    585.    562.    526.
## # … with 2 more variables: ASSAPP01 <dbl>, ASSREA01 <dbl>, and abbreviated
## #   variable names ¹​ASSEAR01, ²​ASSLIF01, ³​ASSPHY01, ⁴​ASSKNO01

Two Guiding Questions

  1. Is the age of a student related to their overall science scale score?
  2. Is the life science scale score related to the overall science scale score?

Bivariate exploration

It is important to explore relationships bivariately before going to the model phase. To do this, fill in the outcome of interest in place of “%%” below and fill in the appropriate predictor in place of “^^”. You may also want to fill in an appropriate axis labels in place of “@@” below.

  • Summarize the bivariate association
library(ggformula)
## Loading required package: ggstance
## 
## Attaching package: 'ggstance'
## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
## Loading required package: ggridges
## 
## New to ggformula?  Try the tutorials: 
## 	learnr::run_tutorial("introduction", package = "ggformula")
## 	learnr::run_tutorial("refining", package = "ggformula")
theme_set(theme_bw(base_size = 18))

gf_point(ASSSCI01 ~ ASDAGE, data = timss, size = 4) |> 
  gf_smooth(method = 'lm', size = 1.5) |>
  gf_smooth(method = 'loess', size = 1.5, color = 'green') |>
  gf_labs(x = "Age",
          y = "Overall Science Score")

Estimate Parameters

To do this, fill in the outcome of interest in place of “%%” below and fill in the appropriate predictor in place of “^^”.

  • Interpret the following parameter estimates in the context of the current problem. (ie., what do these parameter estimates mean?)
timss_model <- lm(ASSSCI01 ~ ASDAGE, data = timss)

summary(timss_model)
## 
## Call:
## lm(formula = ASSSCI01 ~ ASDAGE, data = timss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -305.769  -51.969    4.919   56.707  272.217 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  613.563     17.064  35.956  < 2e-16 ***
## ASDAGE        -6.687      1.669  -4.007 6.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.89 on 10025 degrees of freedom
## Multiple R-squared:  0.001599,	Adjusted R-squared:  0.0015 
## F-statistic: 16.06 on 1 and 10025 DF,  p-value: 6.183e-05
confint(timss_model)
##                  2.5 %     97.5 %
## (Intercept) 580.114047 647.012616
## ASDAGE       -9.957345  -3.415915

Resample these data

For the astronauts data, to resample, a similar idea can be made. Essentially, we are treating these data as a random sample of some population of space missions. We again, would resample, with replacement, which means that multiple missions would likely show up in the resampling procedure.

slice_sample(timss, n = nrow(timss), replace = TRUE) %>% 
  count(IDSTUD)
## # A tibble: 6,356 × 2
##    IDSTUD     n
##     <dbl> <int>
##  1  10105     3
##  2  10108     1
##  3  10110     1
##  4  10111     1
##  5  10112     2
##  6  10113     1
##  7  10116     2
##  8  10117     1
##  9  10118     1
## 10  10119     1
## # … with 6,346 more rows
resamp_lm <- slice_sample(timss, n = nrow(timss), replace = TRUE) %>% 
  lm(ASSSCI01 ~ ASDAGE, data = .)

summary(resamp_lm)
## 
## Call:
## lm(formula = ASSSCI01 ~ ASDAGE, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -303.996  -51.161    4.522   57.275  271.654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  580.886     17.169  33.834   <2e-16 ***
## ASDAGE        -3.571      1.680  -2.126   0.0335 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.45 on 10025 degrees of freedom
## Multiple R-squared:  0.0004508,	Adjusted R-squared:  0.0003511 
## F-statistic: 4.521 on 1 and 10025 DF,  p-value: 0.03351
confint(resamp_lm)
##                  2.5 %      97.5 %
## (Intercept) 547.232078 614.5395568
## ASDAGE       -6.863595  -0.2788938

Questions to consider

  1. Do the parameter estimates differ from before? Why or why not?
  2. Would you come to substantially different conclusions from the original analysis? Why or why not?

Let’s now repeat this a bunch of times.

set.seed(2022)

resample_timss <- function(data, model_equation) {
  timss_resample <- slice_sample(data, n = nrow(data), replace = TRUE)

  
  lm(model_equation, data = timss_resample) %>%
    coef()
}

Run this function a bunch of times manually, what happens to the estimates? Why is this happening?

resample_timss(data = timss, model_equation = ASSSCI01 ~ ASDAGE)
## (Intercept)      ASDAGE 
##  593.768824   -4.636762

Make sure future.apply is installed

The following code chunk makes sure the future.apply function is installed for parallel processing in R. If you get an error, you can uncomment (delete the # symbol) in the first line of code to (hopefully) install the package yourself.

# install.packages("future.apply")
library(future.apply)
## Loading required package: future
plan(multisession)

Replicate

The following code replicates the analysis many times. Pick an initial value for N and fill in the model equation to match your code above. I will ask you to change this N value later.

timss_coef <- future.apply::future_replicate(10000, resample_timss(data = timss, 
                        model_equation = ASSSCI01 ~ ASDAGE), simplify = FALSE) |>
                        dplyr::bind_rows() |> 
                        tidyr::pivot_longer(cols = everything())

head(timss_coef)
## # A tibble: 6 × 2
##   name         value
##   <chr>        <dbl>
## 1 (Intercept) 610.  
## 2 ASDAGE       -6.56
## 3 (Intercept) 603.  
## 4 ASDAGE       -5.67
## 5 (Intercept) 620.  
## 6 ASDAGE       -7.25

Visualize results

The following code visualizes the results of the analysis above. Explore the following questions.

  1. What does this distribution show/represent?
  2. What are key features of this distribution?
  3. How do these values compare to the original linear regression results?
    • Are there comparable statistics here compared to those?
gf_density(~ value, data = timss_coef) %>%
  gf_facet_wrap(~ name, scales = 'free') %>%
  gf_labs(x = "Regression Estimates")

Change the N value

Now, change the N value for the replicate step (you can either add a new cell to copy/paste the code or just change it in the code above).

  • What happens to the resulting figure when the N increases?
  • What value for N seems to be reasonable? Why?
Previous
Next