PIPING HOT DATA: Polished summary tables in R with gtsummary

Shannon Pileggi

A happy R lounging in a beach chair with tropical scenery holding a cocktail.

Figure 1: Happy R adapted from artwork by @allison_horst; the beach and cocktail images are from pngtree.com,

TL; DR

The gtsummary package in R creates amazing publication / presentation / whatever-you-need-it-for ready tables of summary statistics. Try it out!

Background

A colleague learning R just told me that he spent 45 minutes searching for a summary table function and couldn’t quite find anything that met his needs. When I showed him gtsummary in 5 minutes, his reaction was all

via GIPHY

This blog post is to promote gtsummary and make it more searchable for those still seeking the one table to rule them all. The gtsummary documentation is excellent so I won’t cover all of its awesome functionality, but I will add a bit of my specific experience.

If you are still searching for your favorite table package, here are two round up resources:

How to make beautiful tables in R by R for the Rest of Us (2019).
My favourite R package for: summarising data by Dabbling with data (2018)

The data

I’ll demonstrate with the Youth Risk Behavior Surveillance System (YRBSS) data; my previous post Leveraging labelled data in R has more background details. You can download the .sav data from my github repo.

# import data ----
dat_raw <- haven::read_sav(here::here( "_posts", "2020-12-23-leveraging-labelled-data-in-r", "data", "sadc_2017_national.sav"))

Getting started

This material was developed using:

Software / package	Version
R	4.0.5
RStudio	1.4.1103
`tidyverse`	1.3.0
`here`	1.0.1
`haven`	2.3.1
`labelled`	2.5.0
`gtsummary`	1.3.5
`writexl`	1.4.0

library(tidyverse)  # general use ----
library(here)       # file paths  ----
library(haven)      # import .sav files ----  
library(labelled)   # tools for labelled data ----
library(gtsummary)  # produce summary tables ----
library(writexl)    # export excel tables ----

Data preparation

First, I import the data with haven::read_sav().

# import data ----
dat_raw <- haven::read_sav(here::here( "_posts", "2020-12-23-leveraging-labelled-data-in-r", "data", "sadc_2017_national.sav"))

Then I keep just two years of data for comparison on three question.

dat <- dat_raw %>% 
  # two most recent years ----
  dplyr::filter(year %in% c(2015, 2017)) %>% 
  # a few variables
  dplyr::select(record, year, stheight, q12, q69)

Here is a quick preview of the data.

# print data for viewing----
dat

# A tibble: 30,389 x 5
    record  year stheight              q12                       q69
     <dbl> <dbl>    <dbl>        <dbl+lbl>                 <dbl+lbl>
 1 1349684  2015       NA NA                3 [Stay the same weight]
 2 1349685  2015       NA  1 [0 days]       3 [Stay the same weight]
 3 1349686  2015       NA NA               NA                       
 4 1349687  2015       NA  1 [0 days]       1 [Lose weight]         
 5 1349688  2015       NA NA                2 [Gain weight]         
 6 1349689  2015       NA  1 [0 days]       3 [Stay the same weight]
 7 1349690  2015       NA NA                2 [Gain weight]         
 8 1349691  2015       NA  1 [0 days]       1 [Lose weight]         
 9 1349692  2015       NA  3 [2 or 3 days]  1 [Lose weight]         
10 1349693  2015       NA  1 [0 days]       3 [Stay the same weight]
# ... with 30,379 more rows

I recommend reading Leveraging labelled data in R for more details on labelled data, but for now here is a quick preview of the metadata.

# create data dictionary ----
dictionary <- labelled::look_for(dat, details = TRUE)

# view select items from dictionary ----
dictionary %>% 
  dplyr::select(variable, label, value_labels) %>% 
  knitr::kable()

variable	label	value_labels
record	Record ID
year	4-digit Year of survey
stheight	Height in meters
q12	Weapon carrying	[1] 0 days; [2] 1 day; [3] 2 or 3 days; [4] 4 or 5 days; [5] 6 or more days
q69	Weight loss	[1] Lose weight; [2] Gain weight; [3] Stay the same weight; [4] Not trying to do anything

The best, quickest table

To get a quick summary table, first retain only the variables you want to see in the summary table. In this case, I removed record to avoid summary statistics of an ID. Next, because the imported data has value labels from the .sav data file, I converted those variables to a factor for summarizing.

dat %>% 
  # remove from summary table ----
  dplyr::select(-record) %>% 
  # convert labelled values to a factor ----
  mutate_at(vars(matches("q")), haven::as_factor) %>% 
  # create a table with columns for year ----
  gtsummary::tbl_summary(by = year)

Characteristic	2015, N = 15,624¹	2017, N = 14,765¹
Height in meters	1.68 (1.60, 1.75)	1.68 (1.60, 1.75)
Unknown	1,266	1,619
Weapon carrying
0 days	11,897 (82%)	10,027 (85%)
1 day	494 (3.4%)	349 (3.0%)
2 or 3 days	580 (4.0%)	398 (3.4%)
4 or 5 days	202 (1.4%)	166 (1.4%)
6 or more days	1,250 (8.7%)	798 (6.8%)
Unknown	1,201	3,027
Weight loss
Lose weight	6,676 (48%)	5,462 (47%)
Gain weight	2,439 (18%)	2,247 (19%)
Stay the same weight	2,354 (17%)	1,833 (16%)
Not trying to do anything	2,375 (17%)	1,999 (17%)
Unknown	1,780	3,224
¹ Statistics presented: Median (IQR); n (%)

So easy to obtain, and so readable!

And wait - did you see that?! The raw data had variable names of q12, stheight, and q69 but the table printed the variable label! (I previously tweeted about the awesome package pairing of haven and gtsummary.) If your data does not come with handy labels, you can create them with the label option in tbl_summary or with the var_label function in the labelled package.

A few modifications

Here are a few modifications you might be interested in trying to customize your table, including adding an overall column, custom statistic formatting, and table styling. Note that there is an overall N that corresponds to the number of observations, and each each variable can have its own N that corresponds to the number of non-missing observations for that variable.

dat %>% 
  # remove from summary table
  dplyr::select(-record) %>% 
  # covert labelled values to a factor ----
  mutate_at(vars(matches("q")), haven::as_factor) %>% 
  tbl_summary(by = year,
    # custom statistic formats ----
    statistic = list(all_continuous() ~ "{mean} ({sd})",
                     all_categorical() ~ "{p}% ({n} / {N})")
  ) %>%
  add_overall() %>% # add an overall column
  add_n() %>% # add column with total number of non-missing observations
  add_p() %>% # test for a difference between groups
  bold_labels()

Characteristic	N	Overall, N = 30,389¹	2015, N = 15,624¹	2017, N = 14,765¹	p-value²
Height in meters	27,504	1.69 (0.10)	1.69 (0.10)	1.69 (0.10)	0.12
Unknown		2,885	1,266	1,619
Weapon carrying	26,161				<0.001
0 days		84% (21,924 / 26,161)	82% (11,897 / 14,423)	85% (10,027 / 11,738)
1 day		3.2% (843 / 26,161)	3.4% (494 / 14,423)	3.0% (349 / 11,738)
2 or 3 days		3.7% (978 / 26,161)	4.0% (580 / 14,423)	3.4% (398 / 11,738)
4 or 5 days		1.4% (368 / 26,161)	1.4% (202 / 14,423)	1.4% (166 / 11,738)
6 or more days		7.8% (2,048 / 26,161)	8.7% (1,250 / 14,423)	6.8% (798 / 11,738)
Unknown		4,228	1,201	3,027
Weight loss	25,385				<0.001
Lose weight		48% (12,138 / 25,385)	48% (6,676 / 13,844)	47% (5,462 / 11,541)
Gain weight		18% (4,686 / 25,385)	18% (2,439 / 13,844)	19% (2,247 / 11,541)
Stay the same weight		16% (4,187 / 25,385)	17% (2,354 / 13,844)	16% (1,833 / 11,541)
Not trying to do anything		17% (4,374 / 25,385)	17% (2,375 / 13,844)	17% (1,999 / 11,541)
Unknown		5,004	1,780	3,224
¹ Statistics presented: Mean (SD); % (n / N) ² Statistical tests performed: Wilcoxon rank-sum test; chi-square test of independence

In addition, gtsummary makes an educated guess on how to summarize your data and which statistical test to use. Pay attention to the footnote on the statistical tests performed and adjust if needed with the test argument in the add_p function..

Discrete numeric values

One default I frequently correct is treatment of discrete numeric values. For example, consider a rating scale with possible values of 1, 2, 3, … 7, but in which respondents only select values of 3, 4, 5. Making an educated guess and only seeing three unique values, gtsummary will treat this as a categorical variable and return frequencies of those values; however, you may still want a mean. You can override these potentially undesirable defaults in gtsummary. 🙌

This data set does not have a great example of this, so I’ll make one. I am going to pretend that q12 is numeric for demonstration,

dat %>% 
  # keep q12 for summary table ----
  dplyr::select(q12) %>% 
  # convert labelled values to numeric by removing value labels ----
  mutate_at(vars(q12), haven::zap_labels) %>% 
  # subtract one to make pretend example more realistic ----
  mutate_at(vars(q12), ~ . - 1) %>% 
  # create a table with columns for year ----
  gtsummary::tbl_summary()

Characteristic	N = 30,389¹
Weapon carrying
0	21,924 (84%)
1	843 (3.2%)
2	978 (3.7%)
3	368 (1.4%)
4	2,048 (7.8%)
Unknown	4,228
¹ Statistics presented: n (%)

In this table, a 4 represents carrying a weapon to school 4 days in the past month. Notice the default is to provide a frequency summary as if it is a categorical variable due the few values present. If instead you want a mean to summarize this variable, specify the variable as continuous with the type argument in tbl_summary.

dat %>% 
  # keep q12 for summary table ----
  dplyr::select(q12) %>% 
  # convert labelled values to numeric by removing value labels ----
  mutate_at(vars(q12), haven::zap_labels) %>% 
  # subtract one to make pretend example more realistic ----
  mutate_at(vars(q12), ~ . - 1) %>% 
  # create a table with columns for year ----
  gtsummary::tbl_summary(
    # treat q12 variable as continuous ----
    type = list(q12 ~ "continuous"),
    # custom statistic format ----
    statistic = list(all_continuous() ~ "{mean} ({sd})")
  )

Characteristic	N = 30,389¹
Weapon carrying	0.46 (1.16)
Unknown	4,228
¹ Statistics presented: Mean (SD)

Missing data

Before we part, let’s make sure we understand how the package handles missing data and our options.

dat %>% 
  dplyr::select(q69) %>% 
  # convert labelled values to a factor ----
  mutate_at(vars(matches("q")), haven::as_factor) %>% 
  gtsummary::tbl_summary()

Characteristic	N = 30,389¹
Weight loss
Lose weight	12,138 (48%)
Gain weight	4,686 (18%)
Stay the same weight	4,187 (16%)
Not trying to do anything	4,374 (17%)
Unknown	5,004
¹ Statistics presented: n (%)

This summary table tells us that there are 30,389 records in the years 2015 and 2017; 25,385 of them have non-missing values and 5,004 have missing values. The percents shown in the table are the percent of the non-missing base N; that is, 12,138 / 25,385 is 48%.

You can suppress printing the count of unknown values if you like, which does not change any of other numbers in your table.

dat %>% 
  dplyr::select(q69) %>% 
  # convert labelled values to a factor ----
  mutate_at(vars(q69), haven::as_factor) %>% 
  gtsummary::tbl_summary(
    # supress printing count of unknown values ----
    missing = "no"
  )

Characteristic	N = 30,389¹
Weight loss
Lose weight	12,138 (48%)
Gain weight	4,686 (18%)
Stay the same weight	4,187 (16%)
Not trying to do anything	4,374 (17%)
¹ Statistics presented: n (%)

If you want the N at the top of the column to reflect the N for non-missing observations, I would remove those in your data cleaning process.

dat %>% 
  dplyr::select(q69) %>% 
  # convert labelled values to a factor ----
  mutate_at(vars(q69), haven::as_factor) %>% 
  # remove all observations with missing values ----
  drop_na() %>% 
  gtsummary::tbl_summary()

Characteristic	N = 25,385¹
Weight loss
Lose weight	12,138 (48%)
Gain weight	4,686 (18%)
Stay the same weight	4,187 (16%)
Not trying to do anything	4,374 (17%)
¹ Statistics presented: n (%)

On the other hand, if the missing data represents a valid category that you want counted, you could replace the missing values (which would shift the percentages).

dat %>% 
  dplyr::select(q69) %>% 
  # convert labelled values to a factor ----
  mutate_at(vars(q69), haven::as_factor) %>% 
  # replace missing values for factor ----
  mutate_at(vars(q69), forcats::fct_explicit_na, "Missing weight action") %>% 
  gtsummary::tbl_summary()

Characteristic	N = 30,389¹
Weight loss
Lose weight	12,138 (40%)
Gain weight	4,686 (15%)
Stay the same weight	4,187 (14%)
Not trying to do anything	4,374 (14%)
Missing weight action	5,004 (16%)
¹ Statistics presented: n (%)

Outputs

gtsummay has a variety of supported outputs, including html and word. I work with many who prefer to see tables in excel, and this an output not directly supported in gtsummary. (No shade here, I’m waiting patiently for this to happen in the gt package. 😁) However, gtsummary does have a work around available with the gtsummary::as_tibble() function. Save the table as an object, convert to a tibble, and then export.

# 1 - save table ----
gt_table <- dat %>% 
  dplyr::select(q69) %>% 
  # convert labelled values to a factor ----
  mutate_at(vars(matches("q")), haven::as_factor) %>% 
  gtsummary::tbl_summary() %>% 
  # 2 convert to tibble ----
  gtsummary::as_tibble()

And now you can export to excel.

# 3 - export to excel ----
writexl::write_xlsx(gt_table, here::here( "_posts", "2021-07-14-polished-summary-tables-in-r-with-gtsummary",  "example_gtsummary.xlsx"))

You can download the exported output example_gtsummary.xlsx.

Conclusion

The gtsummary package has readable output that is easily customizable. My quest for the best presentation ready R table is over! 🥳

Acknowledgements

Daniel Sjoberg + team rock for their work on gtsummary! Thanks to my colleague Patrick Freeman for reviewing this post and providing feedback, and thanks to new learneRs who inspire me to write. 💜

Polished summary tables in R with gtsummary