19 Univariate and multivariable regression
This page demonstrates the use of base R regression functions such as glm()
and the gtsummary package to
look at associations between variables (e.g. odds ratios, risk ratios and hazard
ratios). It also uses functions like tidy()
from the broom package to clean-up regression outputs.
- Univariate: two-by-two tables
- Stratified: mantel-haenszel estimates
- Multivariable: variable selection, model selection, final table
- Forest plots
For Cox proportional hazard regression, see the Survival analysis page.
NOTE: We use the term multivariable to refer to a regression with multiple explanatory variables. In this sense a multivariate model would be a regression with several outcomes - see this editorial for detail
19.1 Preparation
Load packages
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load()
from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library()
from base R. See the page on [R basics] for more information on R packages.
pacman::p_load(
rio, # File import
here, # File locator
tidyverse, # data management + ggplot2 graphics,
stringr, # manipulate text strings
purrr, # loop over objects in a tidy way
gtsummary, # summary statistics and tests
broom, # tidy up results from regressions
lmtest, # likelihood-ratio tests
parameters, # alternative to tidy up results from regressions
see # alternative to visualise forest plots
)
Import data
We import the dataset of cases from a simulated Ebola epidemic. If you want to follow along, click to download the “clean” linelist (as .rds file). Import your data with the import()
function from the rio package (it accepts many file types like .xlsx, .rds, .csv - see the [Import and export] page for details).
# import the linelist
linelist <- import("linelist_cleaned.rds")
The first 50 rows of the linelist are displayed below.
Clean data
Store explanatory variables
We store the names of the explanatory columns as a character vector. This will be referenced later.
## define variables of interest
explanatory_vars <- c("gender", "fever", "chills", "cough", "aches", "vomit")
Convert to 1’s and 0’s
Below we convert the explanatory columns from “yes”/“no”, “m”/“f”, and “dead”/“alive” to 1 / 0, to cooperate with the expectations of logistic regression models. To do this efficiently, used across()
from dplyr to transform multiple columns at one time. The function we apply to each column is case_when()
(also dplyr) which applies logic to convert specified values to 1’s and 0’s. See sections on across()
and case_when()
in the Cleaning data and core functions page).
Note: the “.” below represents the column that is being processed by across()
at that moment.
## convert dichotomous variables to 0/1
linelist <- linelist %>%
mutate(across(
.cols = all_of(c(explanatory_vars, "outcome")), ## for each column listed and "outcome"
.fns = ~case_when(
. %in% c("m", "yes", "Death") ~ 1, ## recode male, yes and death to 1
. %in% c("f", "no", "Recover") ~ 0, ## female, no and recover to 0
TRUE ~ NA_real_) ## otherwise set to missing
)
)
Drop rows with missing values
To drop rows with missing values, can use the tidyr function drop_na()
. However, we only want to do this for rows that are missing values in the columns of interest.
The first thing we must to is make sure our explanatory_vars
vector includes the column age
(age
would have produced an error in the previous case_when()
operation, which was only for dichotomous variables). Then we pipe the linelist
to drop_na()
to remove any rows with missing values in the outcome
column or any of the explanatory_vars
columns.
Before running the code, the number of rows in the linelist
is nrow(linelist)
.
## add in age_category to the explanatory vars
explanatory_vars <- c(explanatory_vars, "age_cat")
## drop rows with missing information for variables of interest
linelist <- linelist %>%
drop_na(any_of(c("outcome", explanatory_vars)))
The number of rows remaining in linelist
is nrow(linelist)
.
19.2 Univariate
Just like in the page on Descriptive tables, your use case will determine which R package you use. We present two options for doing univariate analysis:
- Use functions available in base R to quickly print results to the console. Use the broom package to tidy up the outputs.
- Use the gtsummary package to model and get publication-ready outputs
base R
Linear regression
The base R function lm()
perform linear regression, assessing the relationship between numeric response and explanatory variables that are assumed to have a linear relationship.
Provide the equation as a formula, with the response and explanatory column names separated by a tilde ~
. Also, specify the dataset to data =
. Define the model results as an R object, to use later.
lm_results <- lm(ht_cm ~ age, data = linelist)
You can then run summary()
on the model results to see the coefficients (Estimates), P-value, residuals, and other measures.
summary(lm_results)
##
## Call:
## lm(formula = ht_cm ~ age, data = linelist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128.579 -15.854 1.177 15.887 175.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.9051 0.5979 116.9 <2e-16 ***
## age 3.4354 0.0293 117.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.75 on 4165 degrees of freedom
## Multiple R-squared: 0.7675, Adjusted R-squared: 0.7674
## F-statistic: 1.375e+04 on 1 and 4165 DF, p-value: < 2.2e-16
Alternatively you can use the tidy()
function from the broom package to pull
the results in to a table. What the results tell us is that for each year increase in age the height increases
by 3.5 cm and this is statistically significant.
tidy(lm_results)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 69.9 0.598 117. 0
## 2 age 3.44 0.0293 117. 0
You can then also use this regression to add it to a ggplot, to do this we
first pull the points for the observed data and the fitted line in to one data frame
using the augment()
function from broom.
## pull the regression points and observed data in to one dataset
points <- augment(lm_results)
## plot the data using age as the x-axis
ggplot(points, aes(x = age)) +
## add points for height
geom_point(aes(y = ht_cm)) +
## add your regression line
geom_line(aes(y = .fitted), colour = "red")
It is also possible to add a simple linear regression straight straight in ggplot
using the geom_smooth()
function.
## add your data to a plot
ggplot(linelist, aes(x = age, y = ht_cm)) +
## show points
geom_point() +
## add a linear regression
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
See the Resource section at the end of this chapter for more detailed tutorials.
Logistic regression
The function glm()
from the stats package (part of base R) is used to fit Generalized Linear Models (GLM).
glm()
can be used for univariate and multivariable logistic regression (e.g. to get Odds Ratios). Here are the core parts:
# arguments for glm()
glm(formula, family, data, weights, subset, ...)
-
formula =
The model is provided toglm()
as an equation, with the outcome on the left and explanatory variables on the right of a tilde~
.
-
family =
This determines the type of model to run. For logistic regression, usefamily = "binomial"
, for poisson usefamily = "poisson"
. Other examples are in the table below.
-
data =
Specify your data frame
If necessary, you can also specify the link function via the syntax family = familytype(link = "linkfunction"))
. You can read more in the documentation about other families and optional arguments such as weights =
and subset =
(?glm
).
Family | Default link function |
---|---|
"binomial" |
(link = "logit") |
"gaussian" |
(link = "identity") |
"Gamma" |
(link = "inverse") |
"inverse.gaussian" |
(link = "1/mu^2") |
"poisson" |
(link = "log") |
"quasi" |
(link = "identity", variance = "constant") |
"quasibinomial" |
(link = "logit") |
"quasipoisson" |
(link = "log") |
When running glm()
it is most common to save the results as a named R object. Then you can print the results to your console using summary()
as shown below, or perform other operations on the results (e.g. exponentiate).
If you need to run a negative binomial regression you can use the MASS package; the glm.nb()
uses the same syntax as glm()
.
For a walk-through of different regressions, see the UCLA stats page.
Univariate glm()
In this example we are assessing the association between different age categories and the outcome of death (coded as 1 in the Preparation section). Below is a univariate model of outcome
by age_cat
. We save the model output as model
and then print it with summary()
to the console. Note the estimates provided are the log odds and that the baseline level is the first factor level of age_cat
(“0-4”).
##
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = linelist)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.339 -1.278 1.024 1.080 1.354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.233738 0.072805 3.210 0.00133 **
## age_cat5-9 -0.062898 0.101733 -0.618 0.53640
## age_cat10-14 0.138204 0.107186 1.289 0.19726
## age_cat15-19 -0.005565 0.113343 -0.049 0.96084
## age_cat20-29 0.027511 0.102133 0.269 0.78765
## age_cat30-49 0.063764 0.113771 0.560 0.57517
## age_cat50-69 -0.387889 0.259240 -1.496 0.13459
## age_cat70+ -0.639203 0.915770 -0.698 0.48518
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5712.4 on 4166 degrees of freedom
## Residual deviance: 5705.1 on 4159 degrees of freedom
## AIC: 5721.1
##
## Number of Fisher Scoring iterations: 4
To alter the baseline level of a given variable, ensure the column is class Factor and move the desired level to the first position with fct_relevel()
(see page on Factors). For example, below we take column age_cat
and set “20-29” as the baseline before piping the modified data frame into glm()
.
linelist %>%
mutate(age_cat = fct_relevel(age_cat, "20-29", after = 0)) %>%
glm(formula = outcome ~ age_cat, family = "binomial") %>%
summary()
##
## Call:
## glm(formula = outcome ~ age_cat, family = "binomial", data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.339 -1.278 1.024 1.080 1.354
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.26125 0.07163 3.647 0.000265 ***
## age_cat0-4 -0.02751 0.10213 -0.269 0.787652
## age_cat5-9 -0.09041 0.10090 -0.896 0.370220
## age_cat10-14 0.11069 0.10639 1.040 0.298133
## age_cat15-19 -0.03308 0.11259 -0.294 0.768934
## age_cat30-49 0.03625 0.11302 0.321 0.748390
## age_cat50-69 -0.41540 0.25891 -1.604 0.108625
## age_cat70+ -0.66671 0.91568 -0.728 0.466546
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5712.4 on 4166 degrees of freedom
## Residual deviance: 5705.1 on 4159 degrees of freedom
## AIC: 5721.1
##
## Number of Fisher Scoring iterations: 4
Printing results
For most uses, several modifications must be made to the above outputs. The function tidy()
from the package broom is convenient for making the model results presentable.
Here we demonstrate how to combine model outputs with a table of counts.
- Get the exponentiated log odds ratio estimates and confidence intervals by passing the model to
tidy()
and settingexponentiate = TRUE
andconf.int = TRUE
.
model <- glm(outcome ~ age_cat, family = "binomial", data = linelist) %>%
tidy(exponentiate = TRUE, conf.int = TRUE) %>% # exponentiate and produce CIs
mutate(across(where(is.numeric), round, digits = 2)) # round all numeric columns
Below is the outputted tibble model
:
- Combine these model results with a table of counts. Below, we create the a counts cross-table with the
tabyl()
function from janitor, as covered in the Descriptive tables page.
counts_table <- linelist %>%
janitor::tabyl(age_cat, outcome)
Here is what this counts_table
data frame looks like:
Now we can bind the counts_table
and the model
results together horizontally with bind_cols()
(dplyr). Remember that with bind_cols()
the rows in the two data frames must be aligned perfectly. In this code, because we are binding within a pipe chain, we use .
to represent the piped object counts_table
as we bind it to model
. To finish the process, we use select()
to pick the desired columns and their order, and finally apply the base R round()
function across all numeric columns to specify 2 decimal places.
combined <- counts_table %>% # begin with table of counts
bind_cols(., model) %>% # combine with the outputs of the regression
select(term, 2:3, estimate, # select and re-order cols
conf.low, conf.high, p.value) %>%
mutate(across(where(is.numeric), round, digits = 2)) ## round to 2 decimal places
Here is what the combined data frame looks like, printed nicely as an image with a function from flextable. The Tables for presentation explains how to customize such tables with flextable, or or you can use numerous other packages such as knitr or GT.
combined <- combined %>%
flextable::qflextable()
Looping multiple univariate models
Below we present a method using glm()
and tidy()
for a more simple approach, see the section on gtsummary.
To run the models on several exposure variables to produce univariate odds ratios (i.e. not controlling for each other), you can use the approach below. It uses str_c()
from stringr to create univariate formulas (see Characters and strings), runs the glm()
regression on each formula, passes each glm()
output to tidy()
and finally collapses all the model outputs together with bind_rows()
from tidyr. This approach uses map()
from the package purrr to iterate - see the page on Iteration, loops, and lists for more information on this tool.
Create a vector of column names of the explanatory variables. We already have this as
explanatory_vars
from the Preparation section of this page.Use
str_c()
to create multiple string formulas, withoutcome
on the left, and a column name fromexplanatory_vars
on the right. The period.
substitutes for the column name inexplanatory_vars
.
explanatory_vars %>% str_c("outcome ~ ", .)
## [1] "outcome ~ gender" "outcome ~ fever" "outcome ~ chills" "outcome ~ cough" "outcome ~ aches" "outcome ~ vomit"
## [7] "outcome ~ age_cat"
Pass these string formulas to
map()
and set~glm()
as the function to apply to each input. Withinglm()
, set the regression formula asas.formula(.x)
where.x
will be replaced by the string formula defined in the step above.map()
will loop over each of the string formulas, running regressions for each one.The outputs of this first
map()
are passed to a secondmap()
command, which appliestidy()
to the regression outputs.Finally the output of the second
map()
(a list of tidied data frames) is condensed withbind_rows()
, resulting in one data frame with all the univariate results.
models <- explanatory_vars %>% # begin with variables of interest
str_c("outcome ~ ", .) %>% # combine each variable into formula ("outcome ~ variable of interest")
# iterate through each univariate formula
map(
.f = ~glm( # pass the formulas one-by-one to glm()
formula = as.formula(.x), # within glm(), the string formula is .x
family = "binomial", # specify type of glm (logistic)
data = linelist)) %>% # dataset
# tidy up each of the glm regression outputs from above
map(
.f = ~tidy(
.x,
exponentiate = TRUE, # exponentiate
conf.int = TRUE)) %>% # return confidence intervals
# collapse the list of regression outputs in to one data frame
bind_rows() %>%
# round all numeric columns
mutate(across(where(is.numeric), round, digits = 2))
This time, the end object models
is longer because it now represents the combined results of several univariate regressions. Click through to see all the rows of model
.
As before, we can create a counts table from the linelist
for each explanatory variable, bind it to models
, and make a nice table. We begin with the variables, and iterate through them with map()
. We iterate through a user-defined function which involves creating a counts table with dplyr functions. Then the results are combined and bound with the models
model results.
## for each explanatory variable
univ_tab_base <- explanatory_vars %>%
map(.f =
~{linelist %>% ## begin with linelist
group_by(outcome) %>% ## group data set by outcome
count(.data[[.x]]) %>% ## produce counts for variable of interest
pivot_wider( ## spread to wide format (as in cross-tabulation)
names_from = outcome,
values_from = n) %>%
drop_na(.data[[.x]]) %>% ## drop rows with missings
rename("variable" = .x) %>% ## change variable of interest column to "variable"
mutate(variable = as.character(variable))} ## convert to character, else non-dichotomous (categorical) variables come out as factor and cant be merged
) %>%
## collapse the list of count outputs in to one data frame
bind_rows() %>%
## merge with the outputs of the regression
bind_cols(., models) %>%
## only keep columns interested in
select(term, 2:3, estimate, conf.low, conf.high, p.value) %>%
## round decimal places
mutate(across(where(is.numeric), round, digits = 2))
Below is what the data frame looks like. See the page on Tables for presentation for ideas on how to further convert this table to pretty HTML output (e.g. with flextable).
gtsummary package
Below we present the use of tbl_uvregression()
from the gtsummary package. Just like in the page on Descriptive tables, gtsummary functions do a good job of running statistics and producing professional-looking outputs. This function produces a table of univariate regression results.
We select only the necessary columns from the linelist
(explanatory variables and the outcome variable) and pipe them into tbl_uvregression()
. We are going to run univariate regression on each of the columns we defined as explanatory_vars
in the data Preparation section (gender, fever, chills, cough, aches, vomit, and age_cat).
Within the function itself, we provide the method =
as glm
(no quotes), the y =
outcome column (outcome
), specify to method.args =
that we want to run logistic regression via family = binomial
, and we tell it to exponentiate the results.
The output is HTML and contains the counts
univ_tab <- linelist %>%
dplyr::select(explanatory_vars, outcome) %>% ## select variables of interest
tbl_uvregression( ## produce univariate table
method = glm, ## define regression want to run (generalised linear model)
y = outcome, ## define outcome variable
method.args = list(family = binomial), ## define what type of glm want to run (logistic)
exponentiate = TRUE ## exponentiate to produce odds ratios (rather than log odds)
)
## view univariate results table
univ_tab
Characteristic | N | OR1 | 95% CI1 | p-value |
---|---|---|---|---|
gender | 4167 | 1.00 | 0.88, 1.13 | >0.9 |
fever | 4167 | 1.00 | 0.85, 1.17 | >0.9 |
chills | 4167 | 1.03 | 0.89, 1.21 | 0.7 |
cough | 4167 | 1.15 | 0.97, 1.37 | 0.11 |
aches | 4167 | 0.93 | 0.76, 1.14 | 0.5 |
vomit | 4167 | 1.09 | 0.96, 1.23 | 0.2 |
age_cat | 4167 | |||
0-4 | — | — | ||
5-9 | 0.94 | 0.77, 1.15 | 0.5 | |
10-14 | 1.15 | 0.93, 1.42 | 0.2 | |
15-19 | 0.99 | 0.80, 1.24 | >0.9 | |
20-29 | 1.03 | 0.84, 1.26 | 0.8 | |
30-49 | 1.07 | 0.85, 1.33 | 0.6 | |
50-69 | 0.68 | 0.41, 1.13 | 0.13 | |
70+ | 0.53 | 0.07, 3.20 | 0.5 | |
1
OR = Odds Ratio, CI = Confidence Interval
|
There are many modifications you can make to this table output, such as adjusting the text labels, bolding rows by their p-value, etc. See tutorials here and elsewhere online.
19.3 Stratified
Stratified analysis is currently still being worked on for gtsummary, this page will be updated in due course.
19.4 Multivariable
For multivariable analysis, we again present two approaches:
-
glm()
andtidy()
- gtsummary package
The workflow is similar for each and only the last step of pulling together a final table is different.
Conduct multivariable
Here we use glm()
but add more variables to the right side of the equation, separated by plus symbols (+
).
To run the model with all of our explanatory variables we would run:
mv_reg <- glm(outcome ~ gender + fever + chills + cough + aches + vomit + age_cat, family = "binomial", data = linelist)
summary(mv_reg)
##
## Call:
## glm(formula = outcome ~ gender + fever + chills + cough + aches +
## vomit + age_cat, family = "binomial", data = linelist)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.383 -1.279 1.029 1.078 1.346
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.069054 0.131726 0.524 0.600
## gender 0.002448 0.065133 0.038 0.970
## fever 0.004309 0.080522 0.054 0.957
## chills 0.034112 0.078924 0.432 0.666
## cough 0.138584 0.089909 1.541 0.123
## aches -0.070705 0.104078 -0.679 0.497
## vomit 0.086098 0.062618 1.375 0.169
## age_cat5-9 -0.063562 0.101851 -0.624 0.533
## age_cat10-14 0.136372 0.107275 1.271 0.204
## age_cat15-19 -0.011074 0.113640 -0.097 0.922
## age_cat20-29 0.026552 0.102780 0.258 0.796
## age_cat30-49 0.059569 0.116402 0.512 0.609
## age_cat50-69 -0.388964 0.262384 -1.482 0.138
## age_cat70+ -0.647443 0.917375 -0.706 0.480
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5712.4 on 4166 degrees of freedom
## Residual deviance: 5700.2 on 4153 degrees of freedom
## AIC: 5728.2
##
## Number of Fisher Scoring iterations: 4
If you want to include two variables and an interaction between them you can separate them with an asterisk *
instead of a +
. Separate them with a colon :
if you are only specifying the interaction. For example:
glm(outcome ~ gender + age_cat * fever, family = "binomial", data = linelist)
Optionally, you can use this code to leverage the pre-defined vector of column names and re-create the above command using str_c()
. This might be useful if your explanatory variable names are changing, or you don’t want to type them all out again.
## run a regression with all variables of interest
mv_reg <- explanatory_vars %>% ## begin with vector of explanatory column names
str_c(collapse = "+") %>% ## combine all names of the variables of interest separated by a plus
str_c("outcome ~ ", .) %>% ## combine the names of variables of interest with outcome in formula style
glm(family = "binomial", ## define type of glm as logistic,
data = linelist) ## define your dataset
Building the model
You can build your model step-by-step, saving various models that include certain explanatory variables. You can compare these models with likelihood-ratio tests using lrtest()
from the package lmtest, as below:
NOTE: Using base anova(model1, model2, test = "Chisq)
produces the same results
model1 <- glm(outcome ~ age_cat, family = "binomial", data = linelist)
model2 <- glm(outcome ~ age_cat + gender, family = "binomial", data = linelist)
lmtest::lrtest(model1, model2)
## Likelihood ratio test
##
## Model 1: outcome ~ age_cat
## Model 2: outcome ~ age_cat + gender
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -2852.6
## 2 9 -2852.6 1 0.0002 0.9883
Another option is to take the model object and apply the step()
function from the stats package. Specify which variable selection direction you want use when building the model.
## choose a model using forward selection based on AIC
## you can also do "backward" or "both" by adjusting the direction
final_mv_reg <- mv_reg %>%
step(direction = "forward", trace = FALSE)
You can also turn off scientific notation in your R session, for clarity:
options(scipen=999)
As described in the section on univariate analysis, pass the model output to tidy()
to exponentiate the log odds and CIs. Finally we round all numeric columns to two decimal places. Scroll through to see all the rows.
mv_tab_base <- final_mv_reg %>%
broom::tidy(exponentiate = TRUE, conf.int = TRUE) %>% ## get a tidy dataframe of estimates
mutate(across(where(is.numeric), round, digits = 2)) ## round
Here is what the resulting data frame looks like:
Combine univariate and multivariable
Combine with gtsummary
The gtsummary package provides the tbl_regression()
function, which will
take the outputs from a regression (glm()
in this case) and produce an nice
summary table.
## show results table of final regression
mv_tab <- tbl_regression(final_mv_reg, exponentiate = TRUE)
Let’s see the table:
mv_tab
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
gender | 1.00 | 0.88, 1.14 | >0.9 |
fever | 1.00 | 0.86, 1.18 | >0.9 |
chills | 1.03 | 0.89, 1.21 | 0.7 |
cough | 1.15 | 0.96, 1.37 | 0.12 |
aches | 0.93 | 0.76, 1.14 | 0.5 |
vomit | 1.09 | 0.96, 1.23 | 0.2 |
age_cat | |||
0-4 | — | — | |
5-9 | 0.94 | 0.77, 1.15 | 0.5 |
10-14 | 1.15 | 0.93, 1.41 | 0.2 |
15-19 | 0.99 | 0.79, 1.24 | >0.9 |
20-29 | 1.03 | 0.84, 1.26 | 0.8 |
30-49 | 1.06 | 0.85, 1.33 | 0.6 |
50-69 | 0.68 | 0.40, 1.13 | 0.14 |
70+ | 0.52 | 0.07, 3.19 | 0.5 |
1
OR = Odds Ratio, CI = Confidence Interval
|
You can also combine several different output tables produced by gtsummary with
the tbl_merge()
function. We now combine the multivariable results with the gtsummary univariate results that we created above:
## combine with univariate results
tbl_merge(
tbls = list(univ_tab, mv_tab), # combine
tab_spanner = c("**Univariate**", "**Multivariable**")) # set header names
Characteristic | Univariate | Multivariable | |||||
---|---|---|---|---|---|---|---|
N | OR1 | 95% CI1 | p-value | OR1 | 95% CI1 | p-value | |
gender | 4167 | 1.00 | 0.88, 1.13 | >0.9 | 1.00 | 0.88, 1.14 | >0.9 |
fever | 4167 | 1.00 | 0.85, 1.17 | >0.9 | 1.00 | 0.86, 1.18 | >0.9 |
chills | 4167 | 1.03 | 0.89, 1.21 | 0.7 | 1.03 | 0.89, 1.21 | 0.7 |
cough | 4167 | 1.15 | 0.97, 1.37 | 0.11 | 1.15 | 0.96, 1.37 | 0.12 |
aches | 4167 | 0.93 | 0.76, 1.14 | 0.5 | 0.93 | 0.76, 1.14 | 0.5 |
vomit | 4167 | 1.09 | 0.96, 1.23 | 0.2 | 1.09 | 0.96, 1.23 | 0.2 |
age_cat | 4167 | ||||||
0-4 | — | — | — | — | |||
5-9 | 0.94 | 0.77, 1.15 | 0.5 | 0.94 | 0.77, 1.15 | 0.5 | |
10-14 | 1.15 | 0.93, 1.42 | 0.2 | 1.15 | 0.93, 1.41 | 0.2 | |
15-19 | 0.99 | 0.80, 1.24 | >0.9 | 0.99 | 0.79, 1.24 | >0.9 | |
20-29 | 1.03 | 0.84, 1.26 | 0.8 | 1.03 | 0.84, 1.26 | 0.8 | |
30-49 | 1.07 | 0.85, 1.33 | 0.6 | 1.06 | 0.85, 1.33 | 0.6 | |
50-69 | 0.68 | 0.41, 1.13 | 0.13 | 0.68 | 0.40, 1.13 | 0.14 | |
70+ | 0.53 | 0.07, 3.20 | 0.5 | 0.52 | 0.07, 3.19 | 0.5 | |
1
OR = Odds Ratio, CI = Confidence Interval
|
Combine with dplyr
An alternative way of combining the glm()
/tidy()
univariate and multivariable outputs is with the dplyr join functions.
- Join the univariate results from earlier (
univ_tab_base
, which contains counts) with the tidied multivariable resultsmv_tab_base
- Use
select()
to keep only the columns we want, specify their order, and re-name them
- Use
round()
with two decimal places on all the column that are class Double
## combine univariate and multivariable tables
left_join(univ_tab_base, mv_tab_base, by = "term") %>%
## choose columns and rename them
select( # new name = old name
"characteristic" = term,
"recovered" = "0",
"dead" = "1",
"univ_or" = estimate.x,
"univ_ci_low" = conf.low.x,
"univ_ci_high" = conf.high.x,
"univ_pval" = p.value.x,
"mv_or" = estimate.y,
"mvv_ci_low" = conf.low.y,
"mv_ci_high" = conf.high.y,
"mv_pval" = p.value.y
) %>%
mutate(across(where(is.double), round, 2))
## # A tibble: 20 x 11
## characteristic recovered dead univ_or univ_ci_low univ_ci_high univ_pval mv_or mvv_ci_low mv_ci_high mv_pval
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 909 1168 1.28 1.18 1.4 0 1.07 0.83 1.39 0.6
## 2 gender 916 1174 1 0.88 1.13 0.97 1 0.88 1.14 0.97
## 3 (Intercept) 340 436 1.28 1.11 1.48 0 1.07 0.83 1.39 0.6
## 4 fever 1485 1906 1 0.85 1.17 0.99 1 0.86 1.18 0.96
## 5 (Intercept) 1472 1877 1.28 1.19 1.37 0 1.07 0.83 1.39 0.6
## 6 chills 353 465 1.03 0.89 1.21 0.68 1.03 0.89 1.21 0.67
## 7 (Intercept) 272 309 1.14 0.97 1.34 0.13 1.07 0.83 1.39 0.6
## 8 cough 1553 2033 1.15 0.97 1.37 0.11 1.15 0.96 1.37 0.12
## 9 (Intercept) 1636 2114 1.29 1.21 1.38 0 1.07 0.83 1.39 0.6
## 10 aches 189 228 0.93 0.76 1.14 0.51 0.93 0.76 1.14 0.5
## 11 (Intercept) 931 1144 1.23 1.13 1.34 0 1.07 0.83 1.39 0.6
## 12 vomit 894 1198 1.09 0.96 1.23 0.17 1.09 0.96 1.23 0.17
## 13 (Intercept) 338 427 1.26 1.1 1.46 0 1.07 0.83 1.39 0.6
## 14 age_cat5-9 365 433 0.94 0.77 1.15 0.54 0.94 0.77 1.15 0.53
## 15 age_cat10-14 273 396 1.15 0.93 1.42 0.2 1.15 0.93 1.41 0.2
## 16 age_cat15-19 238 299 0.99 0.8 1.24 0.96 0.99 0.79 1.24 0.92
## 17 age_cat20-29 345 448 1.03 0.84 1.26 0.79 1.03 0.84 1.26 0.8
## 18 age_cat30-49 228 307 1.07 0.85 1.33 0.58 1.06 0.85 1.33 0.61
## 19 age_cat50-69 35 30 0.68 0.41 1.13 0.13 0.68 0.4 1.13 0.14
## 20 age_cat70+ 3 2 0.53 0.07 3.2 0.49 0.52 0.07 3.19 0.48
19.5 Forest plot
This section shows how to produce a plot with the outputs of your regression. There are two options, you can build a plot yourself using ggplot2 or use a meta-package called easystats (a package that includes many packages).
See the page on ggplot basics if you are unfamiliar with the ggplot2 plotting package.
ggplot2 package
You can build a forest plot with ggplot()
by plotting elements of the multivariable regression results. Add the layers of the plots using these “geoms”:
- estimates with
geom_point()
- confidence intervals with
geom_errorbar()
- a vertical line at OR = 1 with
geom_vline()
Before plotting, you may want to use fct_relevel()
from the forcats package to set the order of the variables/levels on the y-axis. ggplot()
may display them in alpha-numeric order which would not work well for these age category values (“30” would appear before “5”). See the page on Factors for more details.
## remove the intercept term from your multivariable results
mv_tab_base %>%
#set order of levels to appear along y-axis
mutate(term = fct_relevel(
term,
"vomit", "gender", "fever", "cough", "chills", "aches",
"age_cat5-9", "age_cat10-14", "age_cat15-19", "age_cat20-29",
"age_cat30-49", "age_cat50-69", "age_cat70+")) %>%
# remove "intercept" row from plot
filter(term != "(Intercept)") %>%
## plot with variable on the y axis and estimate (OR) on the x axis
ggplot(aes(x = estimate, y = term)) +
## show the estimate as a point
geom_point() +
## add in an error bar for the confidence intervals
geom_errorbar(aes(xmin = conf.low, xmax = conf.high)) +
## show where OR = 1 is for reference as a dashed line
geom_vline(xintercept = 1, linetype = "dashed")
easystats packages
An alternative, if you do not want to the fine level of control that ggplot2 provides, is to use a combination of easystats packages.
The function model_parameters()
from the parameters package does the equivalent
of the broom package function tidy()
. The see package then accepts those outputs
and creates a default forest plot as a ggplot()
object.
pacman::p_load(easystats)
## Installing package into 'C:/Users/Temuulen/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)
## Warning: package 'easystats' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.1:
## cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.1/PACKAGES'
## Warning in p_install(package, character.only = TRUE, ...):
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, : there is no package called
## 'easystats'
## Warning in pacman::p_load(easystats): Failed to install/load:
## easystats
## remove the intercept term from your multivariable results
final_mv_reg %>%
model_parameters(exponentiate = TRUE) %>%
plot()