Multiple Categorical Attributes

Add more than one categorical predictor

Let’s explore the ability to add more than one categorical predictor to our model. This will then build into thinking through interactions. Let’s load some new data. These are bike data from Philadelphia program, called Indego. The data loaded below are from Q3 of 2021.

Here is information about the columns in the data.

  • trip_id: Locally unique integer that identifies the trip
  • duration: Length of trip in minutes
  • start_time: The date/time when the trip began, presented in ISO 8601 format in local time
  • end_time: The date/time when the trip ended, presented in ISO 8601 format in local time
  • start_station: The station ID where the trip originated (for station name and more information on each station see the Station Table)
  • start_lat: The latitude of the station where the trip originated
  • start_lon: The longitude of the station where the trip originated
  • end_station: The station ID where the trip terminated (for station name and more information on each station see the Station Table)
  • end_lat: The latitude of the station where the trip terminated
  • end_lon: The longitude of the station where the trip terminated
  • bike_id: Locally unique integer that identifies the bike
  • plan_duration: The number of days that the plan the passholder is using entitles them to ride; 0 is used for a single ride plan (Walk-up)
  • trip_route_category: “Round Trip” for trips starting and ending at the same station or “One Way” for all other trips
  • passholder_type: The name of the passholder’s plan
  • bike_type: The kind of bike used on the trip, including standard pedal-powered bikes or electric assist bikes
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggformula)
## Loading required package: ggstance
## 
## Attaching package: 'ggstance'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     geom_errorbarh, GeomErrorbarh
## 
## Loading required package: scales
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## Loading required package: ggridges
## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")
theme_set(theme_bw(base_size = 18))

temp <- tempfile()
download.file("https://bicycletransit.wpenginepowered.com/wp-content/uploads/2021/10/indego-trips-2021-q3.zip", temp)
bike <- readr::read_csv(unz(temp, "indego-trips-2021-q3.csv")) %>%
   filter(duration <= 120 & passholder_type != 'Walk-up')
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 300432 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): start_time, end_time, trip_route_category, passholder_type, bike_type
## dbl (10): trip_id, duration, start_station, start_lat, start_lon, end_statio...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
unlink(temp)

head(bike)
## # A tibble: 6 × 15
##     trip_id duration start_time  end_t…¹ start…² start…³ start…⁴ end_s…⁵ end_lat
##       <dbl>    <dbl> <chr>       <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 398698761       11 7/1/2021 0… 7/1/20…    3045    39.9   -75.2    3030    39.9
## 2 398698759        4 7/1/2021 0… 7/1/20…    3052    39.9   -75.2    3238    39.9
## 3 398698757       56 7/1/2021 0… 7/1/20…    3192    40.0   -75.1    3161    40.0
## 4 398698755       55 7/1/2021 0… 7/1/20…    3192    40.0   -75.1    3161    40.0
## 5 398698753        5 7/1/2021 0… 7/1/20…    3052    39.9   -75.2    3046    40.0
## 6 398698750       13 7/1/2021 0… 7/1/20…    3213    39.9   -75.2    3039    40.0
## # … with 6 more variables: end_lon <dbl>, bike_id <dbl>, plan_duration <dbl>,
## #   trip_route_category <chr>, passholder_type <chr>, bike_type <chr>, and
## #   abbreviated variable names ¹​end_time, ²​start_station, ³​start_lat,
## #   ⁴​start_lon, ⁵​end_station
dim(bike)
## [1] 297792     15

Suppose we were interested in the following research questions.

  • Does the trip route, type of pass, and bike type explain variation in the duration the bike was rented?

Let’s explore this descriptively.

gf_density(~ duration, data = bike) %>%
  gf_labs(x = "Duration, in minutes")
## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.

gf_violin(bike_type ~ duration, data = bike, fill = 'gray85', draw_quantiles = c(0.1, 0.5, 0.9)) %>%
  gf_labs(x = "Duration, in minutes", y = "") 

gf_violin(passholder_type ~ duration, data = bike, fill = 'gray85', draw_quantiles = c(0.1, 0.5, 0.9)) %>%
  gf_labs(x = "Duration, in minutes", y = "") 

gf_violin(trip_route_category ~ duration, data = bike, fill = 'gray85', draw_quantiles = c(0.1, 0.5, 0.9)) %>%
  gf_labs(x = "Duration, in minutes", y = "") 

Regression model with multiple categorical predictors

Similar to multiple linear regression, including multiple categorical predictors is similar to that case. The simplest model is the additive model, also commonly referred to as the main effect model. Let’s start with two categorical attributes that take on two different values.

\[ duration = \beta_{0} + \beta_{1} bike\_type + \beta_{2} trip\_route\_category + \epsilon \]

bike_lm <- lm(duration ~ bike_type + trip_route_category, data = bike)

broom::glance(bike_lm)
## # A tibble: 1 × 12
##   r.squared adj.r.sq…¹ sigma stati…² p.value    df  logLik    AIC    BIC devia…³
##       <dbl>      <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
## 1    0.0177     0.0176  15.2   2676.       0     2 -1.23e6 2.47e6 2.47e6  6.90e7
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
broom::tidy(bike_lm) |>
  mutate_if(is.double, round, 3)
## # A tibble: 3 × 5
##   term                          estimate std.error statistic p.value
##   <chr>                            <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)                     14.9       0.048     310.        0
## 2 bike_typestandard                0.375     0.059       6.4       0
## 3 trip_route_categoryRound Trip    7.25      0.099      73.0       0
two_way_predict <- broom::augment(bike_lm) %>%
   distinct(bike_type, trip_route_category, .fitted)

two_way_predict
## # A tibble: 4 × 3
##   bike_type trip_route_category .fitted
##   <chr>     <chr>                 <dbl>
## 1 standard  One Way                15.3
## 2 electric  One Way                14.9
## 3 electric  Round Trip             22.2
## 4 standard  Round Trip             22.6
mean_duration <- df_stats(duration ~ bike_type + trip_route_category, data =bike, mean, length)
mean_duration
##   response bike_type trip_route_category     mean length
## 1 duration  electric             One Way 15.21481  93807
## 2 duration  standard             One Way 15.16518 178219
## 3 duration  electric          Round Trip 19.52470   9819
## 4 duration  standard          Round Trip 24.19370  15947
gf_point(.fitted ~ bike_type, color = ~ trip_route_category, data = two_way_predict, size = 5) %>%
  gf_line(size = 1.5, group = ~ trip_route_category) %>%
  gf_labs(x = "", y = "Model Predicted Values", color = 'Trip Route') %>%
  gf_point(mean ~ bike_type, color = ~ trip_route_category, data = mean_duration, shape = 15, size = 5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Back to the coefficients to really understand what these are.

broom::tidy(bike_lm) |>
  mutate_if(is.double, round, 3)
## # A tibble: 3 × 5
##   term                          estimate std.error statistic p.value
##   <chr>                            <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)                     14.9       0.048     310.        0
## 2 bike_typestandard                0.375     0.059       6.4       0
## 3 trip_route_categoryRound Trip    7.25      0.099      73.0       0
  • Intercept: The average bike rental duration for the reference group (electric bikes and one-way trips).
  • bike_typestandard: This is like a slope, so for a one unit change in bike type (ie, moving from an electric bike to a standard bike), the estimated mean change in bike duration decreased by 0.375 minutes, holding other attributes constant.
  • trip_route_categoryRound Trip: This is again like a slope, for a one unit change in trip route (ie, moving from a one-way trip to a round trip), the estimated mean change in bike duraction increased by 7.246 minutes, holder other attributes constant.

These are like weighted marginal means, where the weighting is occuring due to different sample sizes among the groups.

count(bike, bike_type, trip_route_category)
## # A tibble: 4 × 3
##   bike_type trip_route_category      n
##   <chr>     <chr>                <int>
## 1 electric  One Way              93807
## 2 electric  Round Trip            9819
## 3 standard  One Way             178219
## 4 standard  Round Trip           15947
df_stats(duration ~ bike_type + trip_route_category, data = bike, mean) %>%
  pivot_wider(names_from = 'trip_route_category', values_from = "mean")
## # A tibble: 2 × 4
##   response bike_type `One Way` `Round Trip`
##   <chr>    <chr>         <dbl>        <dbl>
## 1 duration electric       15.2         19.5
## 2 duration standard       15.2         24.2

Interaction Model

The interaction model expands on the main effect only model, by allowing the effect of one category to differ based on elements of the other category. One way to think about this in the case with two categorical attributes, is that the mean change differs based on levels of the second attribute. This model, in the case where the attributes are each two groups, adds one additional term.

\[ duration = \beta_{0} + \beta_{1} bike\_type + \beta_{2} trip\_route\_category + \beta_{3} bike\_type:trip\_route\_category + \epsilon \]

bike_lm_int <- lm(duration ~ bike_type + trip_route_category + bike_type:trip_route_category, data = bike)

broom::glance(bike_lm_int)
## # A tibble: 1 × 12
##   r.squared adj.r.sq…¹ sigma stati…² p.value    df  logLik    AIC    BIC devia…³
##       <dbl>      <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
## 1    0.0194     0.0194  15.2   1964.       0     3 -1.23e6 2.47e6 2.47e6  6.89e7
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
broom::tidy(bike_lm_int)  |>
  mutate_if(is.double, round, 3)
## # A tibble: 4 × 5
##   term                                           estim…¹ std.e…² stati…³ p.value
##   <chr>                                            <dbl>   <dbl>   <dbl>   <dbl>
## 1 (Intercept)                                      15.2    0.05  306.      0    
## 2 bike_typestandard                                -0.05   0.061  -0.809   0.419
## 3 trip_route_categoryRound Trip                     4.31   0.161  26.7     0    
## 4 bike_typestandard:trip_route_categoryRound Tr…    4.72   0.205  23.1     0    
## # … with abbreviated variable names ¹​estimate, ²​std.error, ³​statistic
two_way_predict_int <- broom::augment(bike_lm_int) %>%
   distinct(bike_type, trip_route_category, .fitted)

two_way_predict_int
## # A tibble: 4 × 3
##   bike_type trip_route_category .fitted
##   <chr>     <chr>                 <dbl>
## 1 standard  One Way                15.2
## 2 electric  One Way                15.2
## 3 electric  Round Trip             19.5
## 4 standard  Round Trip             24.2
gf_point(.fitted ~ bike_type, color = ~ trip_route_category, data = two_way_predict_int, size = 5) %>%
  gf_line(size = 1.5, group = ~ trip_route_category) %>%
  gf_labs(x = "", y = "Model Predicted Values", color = 'Trip Route') %>%
  gf_point(mean ~ bike_type, color = ~ trip_route_category, data = mean_duration, shape = 15, size = 5)

Back to the coefficients and what do these mean here.

broom::tidy(bike_lm_int)  |>
  mutate_if(is.double, round, 3)
## # A tibble: 4 × 5
##   term                                           estim…¹ std.e…² stati…³ p.value
##   <chr>                                            <dbl>   <dbl>   <dbl>   <dbl>
## 1 (Intercept)                                      15.2    0.05  306.      0    
## 2 bike_typestandard                                -0.05   0.061  -0.809   0.419
## 3 trip_route_categoryRound Trip                     4.31   0.161  26.7     0    
## 4 bike_typestandard:trip_route_categoryRound Tr…    4.72   0.205  23.1     0    
## # … with abbreviated variable names ¹​estimate, ²​std.error, ³​statistic
  • Intercept: The average bike rental duration for the reference group (electric bikes and one-way trips).
  • bike_typestandard: This is like a slope, so for a one unit change in bike type (ie, moving from an electric bike to a standard bike), the estimated mean change in bike duration decreased by .05 minutes, holding other attributes constant.
  • trip_route_categoryRound Trip: This is again like a slope, for a one unit change in trip route (ie, moving from a one-way trip to a round trip), the estimated mean change in bike duraction increased by 4.3 minutes, holder other attributes constant.
  • bike_typestandard:trip_route_categoryRound Trip: This is the interaction effect and is the additional mean level change for standard bikes and round trips, holding other attributes constant.

We can get the estimated means for the 4 groups as follows:

\[ \hat{\mu}_{elec-1way} = 15.2 \] \[ \hat{\mu}_{elec-RT} = 15.2 + 4.3 \] \[ \hat{\mu}_{stand-1way} = 15.2 - .05 \] \[ \hat{\mu}_{stand-RT} = 15.2 - 0.05 + 4.3 + 4.7 \]

Example

What about the example with two categorical attributes bike type and passholder type? Fit this model below and interpret the parameter estimates, both one without and with interaction effects.

Second order (three-way) interaction

The more attributes that interact with one another makes the model more complicated and difficult to interpret. Still, let’s try a second order or three way interaction.

bike_lm_3way <- lm(duration ~ bike_type * trip_route_category * passholder_type, data = bike)

broom::glance(bike_lm_3way)
## # A tibble: 1 × 12
##   r.squared adj.r.sq…¹ sigma stati…² p.value    df  logLik    AIC    BIC devia…³
##       <dbl>      <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
## 1    0.0617     0.0617  14.9   1780.       0    11 -1.23e6 2.45e6 2.45e6  6.59e7
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## #   variable names ¹​adj.r.squared, ²​statistic, ³​deviance
broom::tidy(bike_lm_3way) |>
  mutate_if(is.double, round, 3)
## # A tibble: 12 × 5
##    term                                          estim…¹ std.e…² stati…³ p.value
##    <chr>                                           <dbl>   <dbl>   <dbl>   <dbl>
##  1 (Intercept)                                    23.9     0.132 181.      0    
##  2 bike_typestandard                              -0.8     0.173  -4.63    0    
##  3 trip_route_categoryRound Trip                   0.426   0.314   1.36    0.174
##  4 passholder_typeIndego30                        -9.67    0.144 -67.3     0    
##  5 passholder_typeIndego365                      -11.9     0.185 -64.0     0    
##  6 bike_typestandard:trip_route_categoryRound T…   8.04    0.424  19.0     0    
##  7 bike_typestandard:passholder_typeIndego30       1.42    0.187   7.60    0    
##  8 bike_typestandard:passholder_typeIndego365      0.738   0.233   3.17    0.002
##  9 trip_route_categoryRound Trip:passholder_typ…   3.64    0.369   9.86    0    
## 10 trip_route_categoryRound Trip:passholder_typ…  -0.249   0.661  -0.377   0.706
## 11 bike_typestandard:trip_route_categoryRound T…  -3.78    0.49   -7.72    0    
## 12 bike_typestandard:trip_route_categoryRound T…  -2.47    0.801  -3.08    0.002
## # … with abbreviated variable names ¹​estimate, ²​std.error, ³​statistic

Interpreting these coefficients can be challenging, visualizing them can be a more effective way to undersand the impact these may have. The following steps will be used to visualize these model results:

  1. Generate model-implied or predicted values for each combination of model values
  2. Plot those model implied means
three_way_predict_int <- broom::augment(bike_lm_3way) %>%
   distinct(bike_type, trip_route_category, passholder_type, .fitted)

three_way_predict_int
## # A tibble: 12 × 4
##    bike_type trip_route_category passholder_type .fitted
##    <chr>     <chr>               <chr>             <dbl>
##  1 standard  One Way             Indego30           14.8
##  2 electric  One Way             Indego30           14.2
##  3 standard  One Way             Indego365          12.0
##  4 electric  Round Trip          Day Pass           24.3
##  5 standard  One Way             Day Pass           23.1
##  6 standard  Round Trip          Indego30           23.2
##  7 standard  Round Trip          Day Pass           31.5
##  8 electric  One Way             Day Pass           23.9
##  9 electric  Round Trip          Indego30           18.3
## 10 electric  One Way             Indego365          12.0
## 11 standard  Round Trip          Indego365          17.7
## 12 electric  Round Trip          Indego365          12.2
gf_point(.fitted ~ bike_type, color = ~ trip_route_category, data = three_way_predict_int, size = 5) %>%
  gf_line(size = 1.5, group = ~ trip_route_category) %>%
  gf_facet_wrap(~ passholder_type) %>%
  gf_labs(y = "", x = "Model Predicted Values", color = 'Trip Route')

Previous
Next