Changing names in the tidyverse: An example for many regressions

Mar 9, 2017 3 min read R

A collaborator posed an interesting R question to me today. She wanted to do several regressions using different outcomes, with models being computed on different strata defined by a combination of experimental design variables. She then just wanted to extract the p-values for the slopes for each of the models, and then filter the strata based on p-value levels.

This seems straighforward, right? Let’s set up a toy example:

library(tidyverse)

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

dat <- as_tibble(expand.grid(letters[1:4], 1:5))
d <- vector('list', nrow(dat))
set.seed(102)
for(i in 1:nrow(dat)){
  x <- rnorm(100)
  d[[i]] <- tibble(x = x, y1 = 3 - 2*x + rnorm(100), y2 = -4+5*x+rnorm(100))
}
dat <- as_tibble(bind_cols(dat, tibble(dat=d))) %>% unnest()

## Warning: `cols` is now required.
## Please use `cols = c(dat)`

knitr::kable(head(dat), digits=3, format='html')

Var1	Var2	x	y1	y2
a	1	0.181	4.260	-3.005
a	1	0.785	0.002	-2.105
a	1	-1.353	3.171	-9.157
a	1	1.983	-0.714	5.966
a	1	1.238	0.352	2.131
a	1	1.201	0.627	1.752

Now we’re going to perform two regressions, one using y1 and one using y2 as the dependent variables, for each stratum defined by Var1 and Var2.

out <- dat %>%
  nest(-Var1, -Var2) %>%
  mutate(model1 = map(data, ~lm(y1~x, data=.)),
         model2 = map(data, ~lm(y2~x, data=.)))

## Warning: All elements of `...` must be named.
## Did you want `data = c(x, y1, y2)`?

Now conceptually, all we do is tidy up the output for the models using the broom package, filter on the rows containg the slope information, and extract the p-values, right? Not quite….

library(broom)
out_problem <- out %>% mutate(output1 = map(model1, ~tidy(.)),
                              output2 = map(model2, ~tidy(.))) %>%
  select(-data, -model1, -model2) %>%
  unnest()

## Warning: `cols` is now required.
## Please use `cols = c(output1, output2)`

names(out_problem)

##  [1] "Var1"       "Var2"       "term"       "estimate"   "std.error" 
##  [6] "statistic"  "p.value"    "term1"      "estimate1"  "std.error1"
## [11] "statistic1" "p.value1"

We’ve got two sets of output, but with the same column names!!! This is a problem! An easy solution would be to preface the column names with the name of the response variable. I struggled with this today until I discovered the secret function.

out_nice <- out %>% mutate(output1 = map(model1, ~tidy(.)),
                           output2 = map(model2, ~tidy(.)),
                           output1 = map(output1, ~setNames(., paste('y1', names(.), sep='_'))),
                           output2 = map(output2, ~setNames(., paste('y2', names(.), sep='_')))) %>%
  select(-data, -model1, -model2) %>%
  unnest()

## Warning: `cols` is now required.
## Please use `cols = c(output1, output2)`

This is a compact representation of the results of both regressions by strata, and we can extract the information we would like very easily. For example, to extract the stratum-specific slope estimates:

out_nice %>% filter(y1_term=='x') %>%
  select(Var1, Var2, ends_with('estimate')) %>%
  knitr::kable(digits=3, format='html')

Var1	Var2	y1_estimate	y2_estimate
a	1	-1.897	5.036
b	1	-2.000	5.022
c	1	-1.988	4.888
d	1	-2.089	5.089
a	2	-2.052	5.015
b	2	-1.922	5.004
c	2	-1.936	4.969
d	2	-1.961	4.959
a	3	-2.043	5.017
b	3	-2.045	4.860
c	3	-1.996	5.009
d	3	-1.922	4.894
a	4	-2.000	4.942
b	4	-2.000	4.932
c	4	-2.033	5.042
d	4	-2.165	5.049
a	5	-2.094	5.010
b	5	-1.961	5.122
c	5	-2.106	5.153
d	5	-1.974	5.009