21 Standardised rates
This page will show you two ways to standardize an outcome, such as hospitalizations or mortality, by characteristics such as age and sex.
- Using dsr package
- Using PHEindicatormethods package
We begin by extensively demonstrating the processes of data preparation/cleaning/joining, as this is common when combining population data from multiple countries, standard population data, deaths, etc.
21.1 Overview
There are two main ways to standardize: direct and indirect standardization. Let’s say we would like to the standardize mortality rate by age and sex for country A and country B, and compare the standardized rates between these countries.
- For direct standardization, you will have to know the number of the at-risk population and the number of deaths for each stratum of age and sex, for country A and country B. One stratum in our example could be females between ages 15-44.
- For indirect standardization, you only need to know the total number of deaths and the age- and sex structure of each country. This option is therefore feasible if age- and sex-specific mortality rates or population numbers are not available. Indirect standardization is furthermore preferable in case of small numbers per stratum, as estimates in direct standardization would be influenced by substantial sampling variation.
21.2 Preparation
To show how standardization is done, we will use fictitious population counts and death counts from country A and country B, by age (in 5 year categories) and sex (female, male). To make the datasets ready for use, we will perform the following preparation steps:
- Load packages
- Load datasets
- Join the population and death data from the two countries
- Pivot longer so there is one row per age-sex stratum
- Clean the reference population (world standard population) and join it to the country data
In your scenario, your data may come in a different format. Perhaps your data are by province, city, or other catchment area. You may have one row for each death and information on age and sex for each (or a significant proportion) of these deaths. In this case, see the pages on Grouping data, Pivoting data, and Descriptive tables to create a dataset with event and population counts per age-sex stratum.
We also need a reference population, the standard population. For the purposes of this exercise we will use the world_standard_population_by_sex
. The World standard population is based on the populations of 46 countries and was developed in 1960. There are many “standard” populations - as one example, the website of NHS Scotland is quite informative on the European Standard Population, World Standard Population and Scotland Standard Population.
Load packages
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load()
from pacman, which installs the package if necessary and loads it for use. You can also load installed packages with library()
from base R. See the page on [R basics] for more information on R packages.
pacman::p_load(
rio, # import/export data
here, # locate files
tidyverse, # data management and visualization
stringr, # cleaning characters and strings
frailtypack, # needed for dsr, for frailty models
dsr, # standardise rates
PHEindicatormethods) # alternative for rate standardisation
CAUTION: If you have a newer version of R, the dsr package cannot be directly downloaded from CRAN. However, it is still available from the CRAN archive. You can install and use this one.
For non-Mac users:
packageurl <- "https://cran.r-project.org/src/contrib/Archive/dsr/dsr_0.2.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")
# Other solution that may work
require(devtools)
devtools::install_version("dsr", version="0.2.2", repos="http:/cran.us.r.project.org")
For Mac users:
require(devtools)
devtools::install_version("dsr", version="0.2.2", repos="https://mac.R-project.org")
Load population data
See the [Download handbook and data] page for instructions on how to download all the example data in the handbook. You can import the Standardisation page data directly into R from our Github repository by running the following import()
commands:
# import demographics for country A directly from Github
A_demo <- import("https://github.com/epirhandbook/Epi_R_handbook/raw/master/data/standardization/country_demographics.csv")
# import deaths for country A directly from Github
A_deaths <- import("https://github.com/epirhandbook/Epi_R_handbook/raw/master/data/standardization/deaths_countryA.csv")
# import demographics for country B directly from Github
B_demo <- import("https://github.com/epirhandbook/Epi_R_handbook/raw/master/data/standardization/country_demographics_2.csv")
# import deaths for country B directly from Github
B_deaths <- import("https://github.com/epirhandbook/Epi_R_handbook/raw/master/data/standardization/deaths_countryB.csv")
# import demographics for country B directly from Github
standard_pop_data <- import("https://github.com/epirhandbook/Epi_R_handbook/raw/master/data/standardization/world_standard_population_by_sex.csv")
First we load the demographic data (counts of males and females by 5-year age category) for the two countries that we will be comparing, “Country A” and “Country B”.
# Country A
A_demo <- import("country_demographics.csv")
# Country B
B_demo <- import("country_demographics_2.csv")
Load death counts
Conveniently, we also have the counts of deaths during the time period of interest, by age and sex. Each country’s counts are in a separate file, shown below.
Deaths in Country A
Deaths in Country B
Clean populations and deaths
We need to join and transform these data in the following ways:
- Combine country populations into one dataset and pivot “long” so that each age-sex stratum is one row
- Combine country death counts into one dataset and pivot “long” so each age-sex stratum is one row
- Join the deaths to the populations
First, we combine the country populations datasets, pivot longer, and do minor cleaning. See the page on Pivoting data for more detail.
pop_countries <- A_demo %>% # begin with country A dataset
bind_rows(B_demo) %>% # bind rows, because cols are identically named
pivot_longer( # pivot longer
cols = c(m, f), # columns to combine into one
names_to = "Sex", # name for new column containing the category ("m" or "f")
values_to = "Population") %>% # name for new column containing the numeric values pivoted
mutate(Sex = recode(Sex, # re-code values for clarity
"m" = "Male",
"f" = "Female"))
The combined population data now look like this (click through to see countries A and B):
And now we perform similar operations on the two deaths datasets.
deaths_countries <- A_deaths %>% # begin with country A deaths dataset
bind_rows(B_deaths) %>% # bind rows with B dataset, because cols are identically named
pivot_longer( # pivot longer
cols = c(Male, Female), # column to transform into one
names_to = "Sex", # name for new column containing the category ("m" or "f")
values_to = "Deaths") %>% # name for new column containing the numeric values pivoted
rename(age_cat5 = AgeCat) # rename for clarity
The deaths data now look like this, and contain data from both countries:
We now join the deaths and population data based on common columns Country
, age_cat5
, and Sex
. This adds the column Deaths
.
country_data <- pop_countries %>%
left_join(deaths_countries, by = c("Country", "age_cat5", "Sex"))
We can now classify Sex
, age_cat5
, and Country
as factors and set the level order using fct_relevel()
function from the forcats package, as described in the page on Factors. Note, classifying the factor levels doesn’t visibly change the data, but the arrange()
command does sort it by Country, age category, and sex.
country_data <- country_data %>%
mutate(
Country = fct_relevel(Country, "A", "B"),
Sex = fct_relevel(Sex, "Male", "Female"),
age_cat5 = fct_relevel(
age_cat5,
"0-4", "5-9", "10-14", "15-19",
"20-24", "25-29", "30-34", "35-39",
"40-44", "45-49", "50-54", "55-59",
"60-64", "65-69", "70-74",
"75-79", "80-84", "85")) %>%
arrange(Country, age_cat5, Sex)
CAUTION: If you have few deaths per stratum, consider using 10-, or 15-year categories, instead of 5-year categories for age.
Load reference population
Lastly, for the direct standardisation, we import the reference population (world “standard population” by sex)
# Reference population
standard_pop_data <- import("world_standard_population_by_sex.csv")
Clean reference population
The age category values in the country_data
and standard_pop_data
data frames will need to be aligned.
Currently, the values of the column age_cat5
from the standard_pop_data
data frame contain the word “years” and “plus”, while those of the country_data
data frame do not. We will have to make the age category values match. We use str_replace_all()
from the stringr package, as described in the page on Characters and strings, to replace these patterns with no space ""
.
Furthermore, the package dsr expects that in the standard population, the column containing counts will be called "pop"
. So we rename that column accordingly.
# Remove specific string from column values
standard_pop_clean <- standard_pop_data %>%
mutate(
age_cat5 = str_replace_all(age_cat5, "years", ""), # remove "year"
age_cat5 = str_replace_all(age_cat5, "plus", ""), # remove "plus"
age_cat5 = str_replace_all(age_cat5, " ", "")) %>% # remove " " space
rename(pop = WorldStandardPopulation) # change col name to "pop", as this is expected by dsr package
CAUTION: If you try to use str_replace_all()
to remove a plus symbol, it won’t work because it is a special symbol. “Escape” the specialnes by putting two back slashes in front, as in str_replace_call(column, "\\+", "")
.
Create dataset with standard population
Finally, the package PHEindicatormethods, detailed below, expects the standard populations joined to the country event and population counts. So, we will create a dataset all_data
for that purpose.
all_data <- left_join(country_data, standard_pop_clean, by=c("age_cat5", "Sex"))
This complete dataset looks like this:
21.3 dsr package
Below we demonstrate calculating and comparing directly standardized rates using the dsr package. The dsr package allows you to calculate and compare directly standardized rates (no indirectly standardized rates!).
In the data Preparation section, we made separate datasets for country counts and standard population:
- the
country_data
object, which is a population table with the number of population and number of deaths per stratum per country
- the
standard_pop_clean
object, containing the number of population per stratum for our reference population, the World Standard Population
We will use these separate datasets for the dsr approach.
Standardized rates
Below, we calculate rates per country directly standardized for age and sex. We use the dsr()
function.
Of note - dsr()
expects one data frame for the country populations and event counts (deaths), and a separate data frame with the reference population. It also expects that in this reference population dataset the unit-time column name is “pop” (we assured this in the data Preparation section).
There are many arguments, as annotated in the code below. Notably, event =
is set to the column Deaths
, and the fu =
(“follow-up”) is set to the Population
column. We set the subgroups of comparison as the column Country
and we standardize based on age_cat5
and Sex
. These last two columns are not assigned a particular named argument. See ?dsr
for details.
# Calculate rates per country directly standardized for age and sex
mortality_rate <- dsr::dsr(
data = country_data, # specify object containing number of deaths per stratum
event = Deaths, # column containing number of deaths per stratum
fu = Population, # column containing number of population per stratum
subgroup = Country, # units we would like to compare
age_cat5, # other columns - rates will be standardized by these
Sex,
refdata = standard_pop_clean, # reference population data frame, with column called pop
method = "gamma", # method to calculate 95% CI
sig = 0.95, # significance level
mp = 100000, # we want rates per 100.000 population
decimals = 2) # number of decimals)
# Print output as nice-looking HTML table
knitr::kable(mortality_rate) # show mortality rate before and after direct standardization
Subgroup | Numerator | Denominator | Crude Rate (per 100000) | 95% LCL (Crude) | 95% UCL (Crude) | Std Rate (per 100000) | 95% LCL (Std) | 95% UCL (Std) |
---|---|---|---|---|---|---|---|---|
A | 11344 | 86790567 | 13.07 | 12.83 | 13.31 | 23.57 | 23.08 | 24.06 |
B | 9955 | 52898281 | 18.82 | 18.45 | 19.19 | 19.33 | 18.46 | 20.22 |
Above, we see that while country A had a lower crude mortality rate than country B, it has a higher standardized rate after direct age and sex standardization.
Standardized rate ratios
# Calculate RR
mortality_rr <- dsr::dsrr(
data = country_data, # specify object containing number of deaths per stratum
event = Deaths, # column containing number of deaths per stratum
fu = Population, # column containing number of population per stratum
subgroup = Country, # units we would like to compare
age_cat5,
Sex, # characteristics to which we would like to standardize
refdata = standard_pop_clean, # reference population, with numbers in column called pop
refgroup = "B", # reference for comparison
estimate = "ratio", # type of estimate
sig = 0.95, # significance level
mp = 100000, # we want rates per 100.000 population
decimals = 2) # number of decimals
# Print table
knitr::kable(mortality_rr)
Comparator | Reference | Std Rate (per 100000) | Rate Ratio (RR) | 95% LCL (RR) | 95% UCL (RR) |
---|---|---|---|---|---|
A | B | 23.57 | 1.22 | 1.17 | 1.27 |
B | B | 19.33 | 1.00 | 0.94 | 1.06 |
The standardized mortality rate is 1.22 times higher in country A compared to country B (95% CI 1.17-1.27).
Standardized rate difference
# Calculate RD
mortality_rd <- dsr::dsrr(
data = country_data, # specify object containing number of deaths per stratum
event = Deaths, # column containing number of deaths per stratum
fu = Population, # column containing number of population per stratum
subgroup = Country, # units we would like to compare
age_cat5, # characteristics to which we would like to standardize
Sex,
refdata = standard_pop_clean, # reference population, with numbers in column called pop
refgroup = "B", # reference for comparison
estimate = "difference", # type of estimate
sig = 0.95, # significance level
mp = 100000, # we want rates per 100.000 population
decimals = 2) # number of decimals
# Print table
knitr::kable(mortality_rd)
Comparator | Reference | Std Rate (per 100000) | Rate Difference (RD) | 95% LCL (RD) | 95% UCL (RD) |
---|---|---|---|---|---|
A | B | 23.57 | 4.24 | 3.24 | 5.24 |
B | B | 19.33 | 0.00 | -1.24 | 1.24 |
Country A has 4.24 additional deaths per 100.000 population (95% CI 3.24-5.24) compared to country A.
21.4 PHEindicatormethods package
Another way of calculating standardized rates is with the PHEindicatormethods package. This package allows you to calculate directly as well as indirectly standardized rates. We will show both.
This section will use the all_data
data frame created at the end of the Preparation section. This data frame includes the country populations, death events, and the world standard reference population. You can view it here.
Directly standardized rates
Below, we first group the data by Country and then pass it to the function phe_dsr()
to get directly standardized rates per country.
Of note - the reference (standard) population can be provided as a column within the country-specific data frame or as a separate vector. If provided within the country-specific data frame, you have to set stdpoptype = "field"
. If provided as a vector, set stdpoptype = "vector"
. In the latter case, you have to make sure the ordering of rows by strata is similar in both the country-specific data frame and the reference population, as records will be matched by position. In our example below, we provided the reference population as a column within the country-specific data frame.
See the help with ?phr_dsr
or the links in the References section for more information.
# Calculate rates per country directly standardized for age and sex
mortality_ds_rate_phe <- all_data %>%
group_by(Country) %>%
PHEindicatormethods::phe_dsr(
x = Deaths, # column with observed number of events
n = Population, # column with non-standard pops for each stratum
stdpop = pop, # standard populations for each stratum
stdpoptype = "field") # either "vector" for a standalone vector or "field" meaning std populations are in the data
# Print table
knitr::kable(mortality_ds_rate_phe)
Country | total_count | total_pop | value | lowercl | uppercl | confidence | statistic | method |
---|---|---|---|---|---|---|---|---|
A | 11344 | 86790567 | 23.56686 | 23.08107 | 24.05944 | 95% | dsr per 100000 | Dobson |
B | 9955 | 52898281 | 19.32549 | 18.45516 | 20.20882 | 95% | dsr per 100000 | Dobson |
Indirectly standardized rates
For indirect standardization, you need a reference population with the number of deaths and number of population per stratum. In this example, we will be calculating rates for country A using country B as the reference population, as the standard_pop_clean
reference population does not include number of deaths per stratum.
Below, we first create the reference population from country B. Then, we pass mortality and population data for country A, combine it with the reference population, and pass it to the function phe_isr()
, to get indirectly standardized rates. Of course, you can do it also vice versa.
Of note - in our example below, the reference population is provided as a separate data frame. In this case, we make sure that x =
, n =
, x_ref =
and n_ref =
vectors are all ordered by the same standardization category (stratum) values as that in our country-specific data frame, as records will be matched by position.
See the help with ?phr_isr
or the links in the References section for more information.
# Create reference population
refpopCountryB <- country_data %>%
filter(Country == "B")
# Calculate rates for country A indirectly standardized by age and sex
mortality_is_rate_phe_A <- country_data %>%
filter(Country == "A") %>%
PHEindicatormethods::phe_isr(
x = Deaths, # column with observed number of events
n = Population, # column with non-standard pops for each stratum
x_ref = refpopCountryB$Deaths, # reference number of deaths for each stratum
n_ref = refpopCountryB$Population) # reference population for each stratum
# Print table
knitr::kable(mortality_is_rate_phe_A)
observed | expected | ref_rate | value | lowercl | uppercl | confidence | statistic | method |
---|---|---|---|---|---|---|---|---|
11344 | 15847.42 | 18.81914 | 13.47123 | 13.22446 | 13.72145 | 95% | isr per 100000 | Byars |
21.5 Resources
If you would like to see another reproducible example using dsr please see this vignette
For another example using PHEindicatormethods, please go to this website
See the PHEindicatormethods reference pdf file