Also plays well with labelled data
The gtsummary
package in R creates amazing publication / presentation / whateveryouneeditfor ready tables of summary statistics. Try it out!
A colleague learning R just told me that he spent 45 minutes searching for a summary table function and couldn’t quite find anything that met his needs. When I showed him gtsummary
in 5 minutes, his reaction was all
This blog post is to promote gtsummary
and make it more searchable for those still seeking the one table to rule them all. The gtsummary
documentation is excellent so I won’t cover all of its awesome functionality, but I will add a bit of my specific experience.
If you are still searching for your favorite table package, here are two round up resources:
How to make beautiful tables in R by R for the Rest of Us (2019).
My favourite R package for: summarising data by Dabbling with data (2018)
I’ll demonstrate with the Youth Risk Behavior Surveillance System (YRBSS) data; my previous post Leveraging labelled data in R has more background details. You can download the .sav
data from my github repo.
This material was developed using:
Software / package  Version 

R  4.0.5 
RStudio  1.4.1103 
tidyverse 
1.3.0 
here 
1.0.1 
haven 
2.3.1 
labelled 
2.5.0 
gtsummary 
1.3.5 
writexl 
1.4.0 
First, I import the data with haven::read_sav()
.
Then I keep just two years of data for comparison on three question.
Here is a quick preview of the data.
# print data for viewing
dat
# A tibble: 30,389 x 5
record year stheight q12 q69
<dbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl>
1 1349684 2015 NA NA 3 [Stay the same weight]
2 1349685 2015 NA 1 [0 days] 3 [Stay the same weight]
3 1349686 2015 NA NA NA
4 1349687 2015 NA 1 [0 days] 1 [Lose weight]
5 1349688 2015 NA NA 2 [Gain weight]
6 1349689 2015 NA 1 [0 days] 3 [Stay the same weight]
7 1349690 2015 NA NA 2 [Gain weight]
8 1349691 2015 NA 1 [0 days] 1 [Lose weight]
9 1349692 2015 NA 3 [2 or 3 days] 1 [Lose weight]
10 1349693 2015 NA 1 [0 days] 3 [Stay the same weight]
# ... with 30,379 more rows
I recommend reading Leveraging labelled data in R for more details on labelled data, but for now here is a quick preview of the metadata.
# create data dictionary 
dictionary < labelled::look_for(dat, details = TRUE)
# view select items from dictionary 
dictionary %>%
dplyr::select(variable, label, value_labels) %>%
knitr::kable()
variable  label  value_labels 

record  Record ID  
year  4digit Year of survey  
stheight  Height in meters  
q12  Weapon carrying  [1] 0 days; [2] 1 day; [3] 2 or 3 days; [4] 4 or 5 days; [5] 6 or more days 
q69  Weight loss  [1] Lose weight; [2] Gain weight; [3] Stay the same weight; [4] Not trying to do anything 
To get a quick summary table, first retain only the variables you want to see in the summary table. In this case, I removed record
to avoid summary statistics of an ID. Next, because the imported data has value labels from the .sav
data file, I converted those variables to a factor for summarizing.
dat %>%
# remove from summary table 
dplyr::select(record) %>%
# convert labelled values to a factor 
mutate_at(vars(matches("q")), haven::as_factor) %>%
# create a table with columns for year 
gtsummary::tbl_summary(by = year)
Characteristic  2015, N = 15,624^{1}  2017, N = 14,765^{1} 

Height in meters  1.68 (1.60, 1.75)  1.68 (1.60, 1.75) 
Unknown  1,266  1,619 
Weapon carrying  
0 days  11,897 (82%)  10,027 (85%) 
1 day  494 (3.4%)  349 (3.0%) 
2 or 3 days  580 (4.0%)  398 (3.4%) 
4 or 5 days  202 (1.4%)  166 (1.4%) 
6 or more days  1,250 (8.7%)  798 (6.8%) 
Unknown  1,201  3,027 
Weight loss  
Lose weight  6,676 (48%)  5,462 (47%) 
Gain weight  2,439 (18%)  2,247 (19%) 
Stay the same weight  2,354 (17%)  1,833 (16%) 
Not trying to do anything  2,375 (17%)  1,999 (17%) 
Unknown  1,780  3,224 
^{
1
}
Statistics presented: Median (IQR); n (%)

So easy to obtain, and so readable!
And wait  did you see that?! The raw data had variable names of q12
, stheight
, and q69
but the table printed the variable label! (I previously tweeted about the awesome package pairing of haven
and gtsummary
.) If your data does not come with handy labels, you can create them with the label
option in tbl_summary
or with the var_label
function in the labelled
package.
Here are a few modifications you might be interested in trying to customize your table, including adding an overall column, custom statistic formatting, and table styling. Note that there is an overall N that corresponds to the number of observations, and each each variable can have its own N that corresponds to the number of nonmissing observations for that variable.
dat %>%
# remove from summary table
dplyr::select(record) %>%
# covert labelled values to a factor 
mutate_at(vars(matches("q")), haven::as_factor) %>%
tbl_summary(by = year,
# custom statistic formats 
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{p}% ({n} / {N})")
) %>%
add_overall() %>% # add an overall column
add_n() %>% # add column with total number of nonmissing observations
add_p() %>% # test for a difference between groups
bold_labels()
Characteristic  N  Overall, N = 30,389^{1}  2015, N = 15,624^{1}  2017, N = 14,765^{1}  pvalue^{2} 

Height in meters  27,504  1.69 (0.10)  1.69 (0.10)  1.69 (0.10)  0.12 
Unknown  2,885  1,266  1,619  
Weapon carrying  26,161  <0.001  
0 days  84% (21,924 / 26,161)  82% (11,897 / 14,423)  85% (10,027 / 11,738)  
1 day  3.2% (843 / 26,161)  3.4% (494 / 14,423)  3.0% (349 / 11,738)  
2 or 3 days  3.7% (978 / 26,161)  4.0% (580 / 14,423)  3.4% (398 / 11,738)  
4 or 5 days  1.4% (368 / 26,161)  1.4% (202 / 14,423)  1.4% (166 / 11,738)  
6 or more days  7.8% (2,048 / 26,161)  8.7% (1,250 / 14,423)  6.8% (798 / 11,738)  
Unknown  4,228  1,201  3,027  
Weight loss  25,385  <0.001  
Lose weight  48% (12,138 / 25,385)  48% (6,676 / 13,844)  47% (5,462 / 11,541)  
Gain weight  18% (4,686 / 25,385)  18% (2,439 / 13,844)  19% (2,247 / 11,541)  
Stay the same weight  16% (4,187 / 25,385)  17% (2,354 / 13,844)  16% (1,833 / 11,541)  
Not trying to do anything  17% (4,374 / 25,385)  17% (2,375 / 13,844)  17% (1,999 / 11,541)  
Unknown  5,004  1,780  3,224  
^{
1
}
Statistics presented: Mean (SD); % (n / N)
^{
2
}
Statistical tests performed: Wilcoxon ranksum test; chisquare test of independence

In addition, gtsummary
makes an educated guess on how to summarize your data and which statistical test to use. Pay attention to the footnote on the statistical tests performed and adjust if needed with the test
argument in the add_p
function..
One default I frequently correct is treatment of discrete numeric values. For example, consider a rating scale with possible values of 1, 2, 3, … 7, but in which respondents only select values of 3, 4, 5. Making an educated guess and only seeing three unique values, gtsummary
will treat this as a categorical variable and return frequencies of those values; however, you may still want a mean. You can override these potentially undesirable defaults in gtsummary
. 🙌
This data set does not have a great example of this, so I’ll make one. I am going to pretend that q12
is numeric for demonstration,
dat %>%
# keep q12 for summary table 
dplyr::select(q12) %>%
# convert labelled values to numeric by removing value labels 
mutate_at(vars(q12), haven::zap_labels) %>%
# subtract one to make pretend example more realistic 
mutate_at(vars(q12), ~ .  1) %>%
# create a table with columns for year 
gtsummary::tbl_summary()
Characteristic  N = 30,389^{1} 

Weapon carrying  
0  21,924 (84%) 
1  843 (3.2%) 
2  978 (3.7%) 
3  368 (1.4%) 
4  2,048 (7.8%) 
Unknown  4,228 
^{
1
}
Statistics presented: n (%)

In this table, a 4 represents carrying a weapon to school 4 days in the past month. Notice the default is to provide a frequency summary as if it is a categorical variable due the few values present. If instead you want a mean to summarize this variable, specify the variable as continuous with the type
argument in tbl_summary
.
dat %>%
# keep q12 for summary table 
dplyr::select(q12) %>%
# convert labelled values to numeric by removing value labels 
mutate_at(vars(q12), haven::zap_labels) %>%
# subtract one to make pretend example more realistic 
mutate_at(vars(q12), ~ .  1) %>%
# create a table with columns for year 
gtsummary::tbl_summary(
# treat q12 variable as continuous 
type = list(q12 ~ "continuous"),
# custom statistic format 
statistic = list(all_continuous() ~ "{mean} ({sd})")
)
Characteristic  N = 30,389^{1} 

Weapon carrying  0.46 (1.16) 
Unknown  4,228 
^{
1
}
Statistics presented: Mean (SD)

Before we part, let’s make sure we understand how the package handles missing data and our options.
dat %>%
dplyr::select(q69) %>%
# convert labelled values to a factor 
mutate_at(vars(matches("q")), haven::as_factor) %>%
gtsummary::tbl_summary()
Characteristic  N = 30,389^{1} 

Weight loss  
Lose weight  12,138 (48%) 
Gain weight  4,686 (18%) 
Stay the same weight  4,187 (16%) 
Not trying to do anything  4,374 (17%) 
Unknown  5,004 
^{
1
}
Statistics presented: n (%)

This summary table tells us that there are 30,389 records in the years 2015 and 2017; 25,385 of them have nonmissing values and 5,004 have missing values. The percents shown in the table are the percent of the nonmissing base N; that is, 12,138 / 25,385 is 48%.
You can suppress printing the count of unknown values if you like, which does not change any of other numbers in your table.
dat %>%
dplyr::select(q69) %>%
# convert labelled values to a factor 
mutate_at(vars(q69), haven::as_factor) %>%
gtsummary::tbl_summary(
# supress printing count of unknown values 
missing = "no"
)
Characteristic  N = 30,389^{1} 

Weight loss  
Lose weight  12,138 (48%) 
Gain weight  4,686 (18%) 
Stay the same weight  4,187 (16%) 
Not trying to do anything  4,374 (17%) 
^{
1
}
Statistics presented: n (%)

If you want the N at the top of the column to reflect the N for nonmissing observations, I would remove those in your data cleaning process.
dat %>%
dplyr::select(q69) %>%
# convert labelled values to a factor 
mutate_at(vars(q69), haven::as_factor) %>%
# remove all observations with missing values 
drop_na() %>%
gtsummary::tbl_summary()
Characteristic  N = 25,385^{1} 

Weight loss  
Lose weight  12,138 (48%) 
Gain weight  4,686 (18%) 
Stay the same weight  4,187 (16%) 
Not trying to do anything  4,374 (17%) 
^{
1
}
Statistics presented: n (%)

On the other hand, if the missing data represents a valid category that you want counted, you could replace the missing values (which would shift the percentages).
dat %>%
dplyr::select(q69) %>%
# convert labelled values to a factor 
mutate_at(vars(q69), haven::as_factor) %>%
# replace missing values for factor 
mutate_at(vars(q69), forcats::fct_explicit_na, "Missing weight action") %>%
gtsummary::tbl_summary()
Characteristic  N = 30,389^{1} 

Weight loss  
Lose weight  12,138 (40%) 
Gain weight  4,686 (15%) 
Stay the same weight  4,187 (14%) 
Not trying to do anything  4,374 (14%) 
Missing weight action  5,004 (16%) 
^{
1
}
Statistics presented: n (%)

gtsummay
has a variety of supported outputs, including html and word. I work with many who prefer to see tables in excel, and this an output not directly supported in gtsummary
. (No shade here, I’m waiting patiently for this to happen in the gt
package. 😁) However, gtsummary
does have a work around available with the gtsummary::as_tibble()
function. Save the table as an object, convert to a tibble, and then export.
And now you can export to excel.
# 3  export to excel 
writexl::write_xlsx(gt_table, here::here( "_posts", "20210714polishedsummarytablesinrwithgtsummary", "example_gtsummary.xlsx"))
You can download the exported output example_gtsummary.xlsx.
The gtsummary
package has readable output that is easily customizable. My quest for the best presentation ready R table is over! 🥳
Daniel Sjoberg + team rock for their work on gtsummary
! Thanks to my colleague Patrick Freeman for reviewing this post and providing feedback, and thanks to new learneRs who inspire me to write. 💜
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Pileggi (2021, July 14). PIPING HOT DATA: Polished summary tables in R with gtsummary. Retrieved from https://www.pipinghotdata.com/posts/20210714polishedsummarytablesinrwithgtsummary/
BibTeX citation
@misc{pileggi2021polished, author = {Pileggi, Shannon}, title = {PIPING HOT DATA: Polished summary tables in R with gtsummary}, url = {https://www.pipinghotdata.com/posts/20210714polishedsummarytablesinrwithgtsummary/}, year = {2021} }