Changing names in the tidyverse: An example for many regressions

A collaborator posed an interesting R question to me today. She wanted to do several regressions using different outcomes, with models being computed on different strata defined by a combination of experimental design variables. She then just wanted to extract the p-values for the slopes for each of the models, and then filter the strata based on p-value levels.

This seems straighforward, right? Let’s set up a toy example:

library(tidyverse)
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
dat <- as_tibble(expand.grid(letters[1:4], 1:5))
d <- vector('list', nrow(dat))
set.seed(102)
for(i in 1:nrow(dat)){
  x <- rnorm(100)
  d[[i]] <- tibble(x = x, y1 = 3 - 2*x + rnorm(100), y2 = -4+5*x+rnorm(100))
}
dat <- as_tibble(bind_cols(dat, tibble(dat=d))) %>% unnest()
## Warning: `cols` is now required.
## Please use `cols = c(dat)`
knitr::kable(head(dat), digits=3, format='html')
Var1 Var2 x y1 y2
a 1 0.181 4.260 -3.005
a 1 0.785 0.002 -2.105
a 1 -1.353 3.171 -9.157
a 1 1.983 -0.714 5.966
a 1 1.238 0.352 2.131
a 1 1.201 0.627 1.752

Now we’re going to perform two regressions, one using y1 and one using y2 as the dependent variables, for each stratum defined by Var1 and Var2.

out <- dat %>%
  nest(-Var1, -Var2) %>%
  mutate(model1 = map(data, ~lm(y1~x, data=.)),
         model2 = map(data, ~lm(y2~x, data=.)))
## Warning: All elements of `...` must be named.
## Did you want `data = c(x, y1, y2)`?

Now conceptually, all we do is tidy up the output for the models using the broom package, filter on the rows containg the slope information, and extract the p-values, right? Not quite….

library(broom)
out_problem <- out %>% mutate(output1 = map(model1, ~tidy(.)),
                              output2 = map(model2, ~tidy(.))) %>%
  select(-data, -model1, -model2) %>%
  unnest()
## Warning: `cols` is now required.
## Please use `cols = c(output1, output2)`
names(out_problem)
##  [1] "Var1"       "Var2"       "term"       "estimate"   "std.error" 
##  [6] "statistic"  "p.value"    "term1"      "estimate1"  "std.error1"
## [11] "statistic1" "p.value1"

We’ve got two sets of output, but with the same column names!!! This is a problem! An easy solution would be to preface the column names with the name of the response variable. I struggled with this today until I discovered the secret function.

out_nice <- out %>% mutate(output1 = map(model1, ~tidy(.)),
                           output2 = map(model2, ~tidy(.)),
                           output1 = map(output1, ~setNames(., paste('y1', names(.), sep='_'))),
                           output2 = map(output2, ~setNames(., paste('y2', names(.), sep='_')))) %>%
  select(-data, -model1, -model2) %>%
  unnest()
## Warning: `cols` is now required.
## Please use `cols = c(output1, output2)`

This is a compact representation of the results of both regressions by strata, and we can extract the information we would like very easily. For example, to extract the stratum-specific slope estimates:

out_nice %>% filter(y1_term=='x') %>%
  select(Var1, Var2, ends_with('estimate')) %>%
  knitr::kable(digits=3, format='html')
Var1 Var2 y1_estimate y2_estimate
a 1 -1.897 5.036
b 1 -2.000 5.022
c 1 -1.988 4.888
d 1 -2.089 5.089
a 2 -2.052 5.015
b 2 -1.922 5.004
c 2 -1.936 4.969
d 2 -1.961 4.959
a 3 -2.043 5.017
b 3 -2.045 4.860
c 3 -1.996 5.009
d 3 -1.922 4.894
a 4 -2.000 4.942
b 4 -2.000 4.932
c 4 -2.033 5.042
d 4 -2.165 5.049
a 5 -2.094 5.010
b 5 -1.961 5.122
c 5 -2.106 5.153
d 5 -1.974 5.009
comments powered by Disqus