Multiple Categorical Attributes
Add more than one categorical predictor
Let’s explore the ability to add more than one categorical predictor to our model. This will then build into thinking through interactions. Let’s load some new data. These are bike data from Philadelphia program, called Indego. The data loaded below are from Q3 of 2021.
Here is information about the columns in the data.
- trip_id: Locally unique integer that identifies the trip
- duration: Length of trip in minutes
- start_time: The date/time when the trip began, presented in ISO 8601 format in local time
- end_time: The date/time when the trip ended, presented in ISO 8601 format in local time
- start_station: The station ID where the trip originated (for station name and more information on each station see the Station Table)
- start_lat: The latitude of the station where the trip originated
- start_lon: The longitude of the station where the trip originated
- end_station: The station ID where the trip terminated (for station name and more information on each station see the Station Table)
- end_lat: The latitude of the station where the trip terminated
- end_lon: The longitude of the station where the trip terminated
- bike_id: Locally unique integer that identifies the bike
- plan_duration: The number of days that the plan the passholder is using entitles them to ride; 0 is used for a single ride plan (Walk-up)
- trip_route_category: “Round Trip” for trips starting and ending at the same station or “One Way” for all other trips
- passholder_type: The name of the passholder’s plan
- bike_type: The kind of bike used on the trip, including standard pedal-powered bikes or electric assist bikes
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggformula)
## Loading required package: ggstance
##
## Attaching package: 'ggstance'
##
## The following objects are masked from 'package:ggplot2':
##
## geom_errorbarh, GeomErrorbarh
##
## Loading required package: scales
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
##
## Loading required package: ggridges
##
## New to ggformula? Try the tutorials:
## learnr::run_tutorial("introduction", package = "ggformula")
## learnr::run_tutorial("refining", package = "ggformula")
theme_set(theme_bw(base_size = 18))
temp <- tempfile()
download.file("https://bicycletransit.wpenginepowered.com/wp-content/uploads/2021/10/indego-trips-2021-q3.zip", temp)
bike <- readr::read_csv(unz(temp, "indego-trips-2021-q3.csv")) %>%
filter(duration <= 120 & passholder_type != 'Walk-up')
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 300432 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): start_time, end_time, trip_route_category, passholder_type, bike_type
## dbl (10): trip_id, duration, start_station, start_lat, start_lon, end_statio...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
unlink(temp)
head(bike)
## # A tibble: 6 × 15
## trip_id duration start_time end_t…¹ start…² start…³ start…⁴ end_s…⁵ end_lat
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 398698761 11 7/1/2021 0… 7/1/20… 3045 39.9 -75.2 3030 39.9
## 2 398698759 4 7/1/2021 0… 7/1/20… 3052 39.9 -75.2 3238 39.9
## 3 398698757 56 7/1/2021 0… 7/1/20… 3192 40.0 -75.1 3161 40.0
## 4 398698755 55 7/1/2021 0… 7/1/20… 3192 40.0 -75.1 3161 40.0
## 5 398698753 5 7/1/2021 0… 7/1/20… 3052 39.9 -75.2 3046 40.0
## 6 398698750 13 7/1/2021 0… 7/1/20… 3213 39.9 -75.2 3039 40.0
## # … with 6 more variables: end_lon <dbl>, bike_id <dbl>, plan_duration <dbl>,
## # trip_route_category <chr>, passholder_type <chr>, bike_type <chr>, and
## # abbreviated variable names ¹end_time, ²start_station, ³start_lat,
## # ⁴start_lon, ⁵end_station
dim(bike)
## [1] 297792 15
Suppose we were interested in the following research questions.
- Does the trip route, type of pass, and bike type explain variation in the duration the bike was rented?
Let’s explore this descriptively.
gf_density(~ duration, data = bike) %>%
gf_labs(x = "Duration, in minutes")
## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
gf_violin(bike_type ~ duration, data = bike, fill = 'gray85', draw_quantiles = c(0.1, 0.5, 0.9)) %>%
gf_labs(x = "Duration, in minutes", y = "")
gf_violin(passholder_type ~ duration, data = bike, fill = 'gray85', draw_quantiles = c(0.1, 0.5, 0.9)) %>%
gf_labs(x = "Duration, in minutes", y = "")
gf_violin(trip_route_category ~ duration, data = bike, fill = 'gray85', draw_quantiles = c(0.1, 0.5, 0.9)) %>%
gf_labs(x = "Duration, in minutes", y = "")
Regression model with multiple categorical predictors
Similar to multiple linear regression, including multiple categorical predictors is similar to that case. The simplest model is the additive model, also commonly referred to as the main effect model. Let’s start with two categorical attributes that take on two different values.
\[ duration = \beta_{0} + \beta_{1} bike\_type + \beta_{2} trip\_route\_category + \epsilon \]
bike_lm <- lm(duration ~ bike_type + trip_route_category, data = bike)
broom::glance(bike_lm)
## # A tibble: 1 × 12
## r.squared adj.r.sq…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0177 0.0176 15.2 2676. 0 2 -1.23e6 2.47e6 2.47e6 6.90e7
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
broom::tidy(bike_lm) |>
mutate_if(is.double, round, 3)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 14.9 0.048 310. 0
## 2 bike_typestandard 0.375 0.059 6.4 0
## 3 trip_route_categoryRound Trip 7.25 0.099 73.0 0
two_way_predict <- broom::augment(bike_lm) %>%
distinct(bike_type, trip_route_category, .fitted)
two_way_predict
## # A tibble: 4 × 3
## bike_type trip_route_category .fitted
## <chr> <chr> <dbl>
## 1 standard One Way 15.3
## 2 electric One Way 14.9
## 3 electric Round Trip 22.2
## 4 standard Round Trip 22.6
mean_duration <- df_stats(duration ~ bike_type + trip_route_category, data =bike, mean, length)
mean_duration
## response bike_type trip_route_category mean length
## 1 duration electric One Way 15.21481 93807
## 2 duration standard One Way 15.16518 178219
## 3 duration electric Round Trip 19.52470 9819
## 4 duration standard Round Trip 24.19370 15947
gf_point(.fitted ~ bike_type, color = ~ trip_route_category, data = two_way_predict, size = 5) %>%
gf_line(size = 1.5, group = ~ trip_route_category) %>%
gf_labs(x = "", y = "Model Predicted Values", color = 'Trip Route') %>%
gf_point(mean ~ bike_type, color = ~ trip_route_category, data = mean_duration, shape = 15, size = 5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Back to the coefficients to really understand what these are.
broom::tidy(bike_lm) |>
mutate_if(is.double, round, 3)
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 14.9 0.048 310. 0
## 2 bike_typestandard 0.375 0.059 6.4 0
## 3 trip_route_categoryRound Trip 7.25 0.099 73.0 0
- Intercept: The average bike rental duration for the reference group (electric bikes and one-way trips).
- bike_typestandard: This is like a slope, so for a one unit change in bike type (ie, moving from an electric bike to a standard bike), the estimated mean change in bike duration decreased by 0.375 minutes, holding other attributes constant.
- trip_route_categoryRound Trip: This is again like a slope, for a one unit change in trip route (ie, moving from a one-way trip to a round trip), the estimated mean change in bike duraction increased by 7.246 minutes, holder other attributes constant.
These are like weighted marginal means, where the weighting is occuring due to different sample sizes among the groups.
count(bike, bike_type, trip_route_category)
## # A tibble: 4 × 3
## bike_type trip_route_category n
## <chr> <chr> <int>
## 1 electric One Way 93807
## 2 electric Round Trip 9819
## 3 standard One Way 178219
## 4 standard Round Trip 15947
df_stats(duration ~ bike_type + trip_route_category, data = bike, mean) %>%
pivot_wider(names_from = 'trip_route_category', values_from = "mean")
## # A tibble: 2 × 4
## response bike_type `One Way` `Round Trip`
## <chr> <chr> <dbl> <dbl>
## 1 duration electric 15.2 19.5
## 2 duration standard 15.2 24.2
Interaction Model
The interaction model expands on the main effect only model, by allowing the effect of one category to differ based on elements of the other category. One way to think about this in the case with two categorical attributes, is that the mean change differs based on levels of the second attribute. This model, in the case where the attributes are each two groups, adds one additional term.
\[ duration = \beta_{0} + \beta_{1} bike\_type + \beta_{2} trip\_route\_category + \beta_{3} bike\_type:trip\_route\_category + \epsilon \]
bike_lm_int <- lm(duration ~ bike_type + trip_route_category + bike_type:trip_route_category, data = bike)
broom::glance(bike_lm_int)
## # A tibble: 1 × 12
## r.squared adj.r.sq…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0194 0.0194 15.2 1964. 0 3 -1.23e6 2.47e6 2.47e6 6.89e7
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
broom::tidy(bike_lm_int) |>
mutate_if(is.double, round, 3)
## # A tibble: 4 × 5
## term estim…¹ std.e…² stati…³ p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 15.2 0.05 306. 0
## 2 bike_typestandard -0.05 0.061 -0.809 0.419
## 3 trip_route_categoryRound Trip 4.31 0.161 26.7 0
## 4 bike_typestandard:trip_route_categoryRound Tr… 4.72 0.205 23.1 0
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic
two_way_predict_int <- broom::augment(bike_lm_int) %>%
distinct(bike_type, trip_route_category, .fitted)
two_way_predict_int
## # A tibble: 4 × 3
## bike_type trip_route_category .fitted
## <chr> <chr> <dbl>
## 1 standard One Way 15.2
## 2 electric One Way 15.2
## 3 electric Round Trip 19.5
## 4 standard Round Trip 24.2
gf_point(.fitted ~ bike_type, color = ~ trip_route_category, data = two_way_predict_int, size = 5) %>%
gf_line(size = 1.5, group = ~ trip_route_category) %>%
gf_labs(x = "", y = "Model Predicted Values", color = 'Trip Route') %>%
gf_point(mean ~ bike_type, color = ~ trip_route_category, data = mean_duration, shape = 15, size = 5)
Back to the coefficients and what do these mean here.
broom::tidy(bike_lm_int) |>
mutate_if(is.double, round, 3)
## # A tibble: 4 × 5
## term estim…¹ std.e…² stati…³ p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 15.2 0.05 306. 0
## 2 bike_typestandard -0.05 0.061 -0.809 0.419
## 3 trip_route_categoryRound Trip 4.31 0.161 26.7 0
## 4 bike_typestandard:trip_route_categoryRound Tr… 4.72 0.205 23.1 0
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic
- Intercept: The average bike rental duration for the reference group (electric bikes and one-way trips).
- bike_typestandard: This is like a slope, so for a one unit change in bike type (ie, moving from an electric bike to a standard bike), the estimated mean change in bike duration decreased by .05 minutes, holding other attributes constant.
- trip_route_categoryRound Trip: This is again like a slope, for a one unit change in trip route (ie, moving from a one-way trip to a round trip), the estimated mean change in bike duraction increased by 4.3 minutes, holder other attributes constant.
- bike_typestandard:trip_route_categoryRound Trip: This is the interaction effect and is the additional mean level change for standard bikes and round trips, holding other attributes constant.
We can get the estimated means for the 4 groups as follows:
\[ \hat{\mu}_{elec-1way} = 15.2 \] \[ \hat{\mu}_{elec-RT} = 15.2 + 4.3 \] \[ \hat{\mu}_{stand-1way} = 15.2 - .05 \] \[ \hat{\mu}_{stand-RT} = 15.2 - 0.05 + 4.3 + 4.7 \]
Example
What about the example with two categorical attributes bike type and passholder type? Fit this model below and interpret the parameter estimates, both one without and with interaction effects.
Second order (three-way) interaction
The more attributes that interact with one another makes the model more complicated and difficult to interpret. Still, let’s try a second order or three way interaction.
bike_lm_3way <- lm(duration ~ bike_type * trip_route_category * passholder_type, data = bike)
broom::glance(bike_lm_3way)
## # A tibble: 1 × 12
## r.squared adj.r.sq…¹ sigma stati…² p.value df logLik AIC BIC devia…³
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0617 0.0617 14.9 1780. 0 11 -1.23e6 2.45e6 2.45e6 6.59e7
## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
## # variable names ¹adj.r.squared, ²statistic, ³deviance
broom::tidy(bike_lm_3way) |>
mutate_if(is.double, round, 3)
## # A tibble: 12 × 5
## term estim…¹ std.e…² stati…³ p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 23.9 0.132 181. 0
## 2 bike_typestandard -0.8 0.173 -4.63 0
## 3 trip_route_categoryRound Trip 0.426 0.314 1.36 0.174
## 4 passholder_typeIndego30 -9.67 0.144 -67.3 0
## 5 passholder_typeIndego365 -11.9 0.185 -64.0 0
## 6 bike_typestandard:trip_route_categoryRound T… 8.04 0.424 19.0 0
## 7 bike_typestandard:passholder_typeIndego30 1.42 0.187 7.60 0
## 8 bike_typestandard:passholder_typeIndego365 0.738 0.233 3.17 0.002
## 9 trip_route_categoryRound Trip:passholder_typ… 3.64 0.369 9.86 0
## 10 trip_route_categoryRound Trip:passholder_typ… -0.249 0.661 -0.377 0.706
## 11 bike_typestandard:trip_route_categoryRound T… -3.78 0.49 -7.72 0
## 12 bike_typestandard:trip_route_categoryRound T… -2.47 0.801 -3.08 0.002
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic
Interpreting these coefficients can be challenging, visualizing them can be a more effective way to undersand the impact these may have. The following steps will be used to visualize these model results:
- Generate model-implied or predicted values for each combination of model values
- Plot those model implied means
three_way_predict_int <- broom::augment(bike_lm_3way) %>%
distinct(bike_type, trip_route_category, passholder_type, .fitted)
three_way_predict_int
## # A tibble: 12 × 4
## bike_type trip_route_category passholder_type .fitted
## <chr> <chr> <chr> <dbl>
## 1 standard One Way Indego30 14.8
## 2 electric One Way Indego30 14.2
## 3 standard One Way Indego365 12.0
## 4 electric Round Trip Day Pass 24.3
## 5 standard One Way Day Pass 23.1
## 6 standard Round Trip Indego30 23.2
## 7 standard Round Trip Day Pass 31.5
## 8 electric One Way Day Pass 23.9
## 9 electric Round Trip Indego30 18.3
## 10 electric One Way Indego365 12.0
## 11 standard Round Trip Indego365 17.7
## 12 electric Round Trip Indego365 12.2
gf_point(.fitted ~ bike_type, color = ~ trip_route_category, data = three_way_predict_int, size = 5) %>%
gf_line(size = 1.5, group = ~ trip_route_category) %>%
gf_facet_wrap(~ passholder_type) %>%
gf_labs(y = "", x = "Model Predicted Values", color = 'Trip Route')