Integrating survival curves

In this blog post, we will see how to use numerical integration to integrate survival curves in R.

Published

March 29, 2020

A survival curve $S(t)$ represents the probability of a given event to happen at a given time point $t$. It is often estimated using the Kaplan-Meier (product-limit) estimator.

However, sometimes it is useful to calculate the integral of a survival function, e.g. when computing restricted mean survival time.

The restricted mean is a measure of average survival from time 0 to a specified time point, and may be estimated as the area under the survival curve up to that point.

Formally, the restricted mean survival time $\mu$ is the mean of the survival time, limited to some horizon time $t^*$. It equals the area under the survival curve $S(t)$ from time zero to $t^*$:

\[ \mu = \int_0^{t^*} S(u) \ du \]

Most interestingly, the difference in restricted mean survival time between groups provides an alternative measure of treatment effect. Remember: the hazard ratio is not (in general) an appropriate measure of treatment effect in randomised trials with time to event outcomes.

The restricted mean survival time can be easily computed analytically if $S(t)$ follows a (piecewise) exponential distribution. Otherwise, we can always compute the integral above numerically. The goal of this post is to show how to do so.

Data

Every example needs an example dataset: here I’ll be using data from the German breast cancer study, which comes bundled with Stata.

We can import it in R directly from the web:

library(haven)
brcancer <- read_dta("https://www.stata-press.com/data/r16/brcancer.dta")
brcancer

# A tibble: 686 × 14
      id hormon    x1    x2    x3    x4    x5    x6    x7 rectime censrec   x4a
   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>
 1     1      0    70     2    21     2     3    48    66    1814       1     1
 2     2      1    56     2    12     2     7    61    77    2018       1     1
 3     3      1    58     2    35     2     9    52   271     712       1     1
 4     4      1    59     2    17     2     4    60    29    1807       1     1
 5     5      0    73     2    35     2     1    26    65     772       1     1
 6     6      0    32     1    57     3    24     0    13     448       1     1
 7     7      1    59     2     8     2     2   181     0    2172       0     1
 8     8      0    65     2    16     2     1   192    25    2161       0     1
 9     9      0    80     2    39     2    30     0    59     471       1     1
10    10      0    66     2    18     2     7     0     3    2014       0     1
# ℹ 676 more rows
# ℹ 2 more variables: x4b <dbl>, x5e <dbl>

For simplicity, I will only include the treatment of interest (hormon, which represents hormonal therapy) as a covariate in the model. The survival time is rectime (in days), and the event indicator variable is censrec.

I am also creating a new variable hormon2 that will be useful for plotting:

brcancer$hormon2 <- ifelse(brcancer$hormon == 0, "Control arm", "Treatment arm")

Modelling

I fit a flexible parametric model using the rstpm2 package and assuming 5 degrees of freedom for the baseline hazard:

library(rstpm2)
fit <- stpm2(Surv(rectime, censrec) ~ hormon, data = brcancer, df = 5)
summary(fit)

Maximum likelihood estimation

Call:
stpm2(formula = Surv(rectime, censrec) ~ hormon, data = brcancer, 
    df = 5)

Coefficients:
                           Estimate Std. Error z value     Pr(z)    
(Intercept)                -6.82308    0.78776 -8.6613 < 2.2e-16 ***
hormon                     -0.36445    0.12493 -2.9173   0.00353 ** 
nsx(log(rectime), df = 5)1  5.35779    0.77579  6.9062 4.976e-12 ***
nsx(log(rectime), df = 5)2  5.88972    0.78948  7.4602 8.639e-14 ***
nsx(log(rectime), df = 5)3  5.01742    0.50802  9.8764 < 2.2e-16 ***
nsx(log(rectime), df = 5)4  9.94593    1.53864  6.4641 1.019e-10 ***
nsx(log(rectime), df = 5)5  4.92988    0.36824 13.3875 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

-2 log L: 5213.05

The estimated hazard ratio is 0.695… but we don’t care about this measure, right?

A benefit of modelling the baseline hazard (as compared to a Cox model) is that we can easily obtain smooth fitted survival curves for each combination of covariates of interest. For instance, the fitted survival curves for the two treatment arms are:

# Prediction:
fitted_curves <- predict(fit, type = "surv", se.fit = TRUE)
brcancer$S_hat <- fitted_curves$Estimate
brcancer$S_hat_lower <- fitted_curves$lower
brcancer$S_hat_upper <- fitted_curves$upper

# Plot:
ggplot(brcancer, aes(x = rectime, y = S_hat, linetype = hormon2)) +
  geom_ribbon(aes(ymin = S_hat_lower, ymax = S_hat_upper), alpha = 0.25) +
  geom_line() +
  coord_cartesian(ylim = c(0, 1)) +
  scale_x_continuous(breaks = 365 * seq(7)) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Follow-up time (days)", y = "Fitted survival", linetype = "")

So easy!

Computing restricted mean survival time

Now, let’s get down to business.

The fitted model is a flexible parametric model; in this setting, it is not straightforward to obtain an analytical solution.

We can, however, use spline-interpolation followed by numerical integration to compute the above integral very easily, and without needing to install any extra package. It will be a three-steps procedure:

We obtain a fine grid of predictions for each survival curve that we want to integrate (over follow-up time);
We obtain a function representing the spline interpolation of each curve;
We integrate the function that we just obtained.
Piece of cake!

Ok, there are four steps above, but don’t we deserve some cake?

Step number one:

data_grid <- expand.grid(
  rectime = seq(.Machine$double.eps, 365 * 5, length.out = 100),
  hormon = unique(brcancer$hormon)
)

Here we set $t^*$ as 5 years, hence creating an appropriate grid for time and each treatment arm. Also, we start from .Machine$double.eps (roughly 2e-16) instead of zero as survival times need to be strictly positive; nevertheless, the difference will be negligible.

Then, onto step two. We predict the survival curve based on this grid:

data_grid$fitted <- predict(fit, newdata = data_grid, type = "surv")

Done.

Step three consists of fitting an interpolating spline for each fitted survival curve:

data_grid_0 <- data_grid[data_grid$hormon == 0, ]
int_spline_0 <- splinefun(
  x = data_grid_0$rectime,
  y = data_grid_0$fitted,
  method = "natural"
)

data_grid_1 <- data_grid[data_grid$hormon == 1, ]
int_spline_1 <- splinefun(
  x = data_grid_1$rectime,
  y = data_grid_1$fitted,
  method = "natural"
)

Here I am using a natural spline for the interpolation, but other methods work well too (in my experience) given that the function we are trying to interpolate is quite regular.

The cool thing about splinefun is that it returns a function that we can call. For instance, to obtain fitted (interpolated) survival at one year:

int_spline_0(365)

[1] 0.9072905

int_spline_1(365)

[1] 0.9346556

Finally, step four consist of using the built-in function integrate to (what a surprise!) integrate the two functions we just obtained:

RMST_0 <- integrate(f = int_spline_0, lower = 0, upper = 365 * 5)$value
RMST_1 <- integrate(f = int_spline_1, lower = 0, upper = 365 * 5)$value

We now have restricted mean survival time (in days) for both treatment arms:

RMST_0

[1] 1266.97

RMST_1

[1] 1406.486

In years, that will be:

RMST_0 / 365

[1] 3.471151

RMST_1 / 365

[1] 3.853386

Calculating the difference between the two is now trivial:

RMST_1 - RMST_0

[1] 139.5155

(RMST_1 - RMST_0) / 365

[1] 0.3822341

Turns out, women treated with hormonal therapy have a 0.382-years longer time to recurrence compared to controls (and focussing on the first five years of follow-up only). Cool!

We would need to quantify the uncertainty in our estimate (and there is a closed-form formula for the standard error of the difference), but that’s a lesson for another day.

What are we actually calculating?

The two fitted survival curves that we are comparing are the following:

data_grid$hormon2 <- ifelse(data_grid$hormon == 0, "Control arm", "Treatment arm")

ggplot(data_grid, aes(x = rectime, y = fitted, linetype = hormon2)) +
  geom_line() +
  coord_cartesian(ylim = c(0, 1)) +
  scale_x_continuous(breaks = 365 * seq(5)) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Follow-up time (days)", y = "Fitted survival", linetype = "")

RMST_0 is the area below the survival curve of the control arm:

ggplot() +
  geom_ribbon(data = data_grid_0, aes(x = rectime, ymin = 0, ymax = fitted), alpha = 0.2) +
  geom_line(data = data_grid, aes(x = rectime, y = fitted, linetype = hormon2)) +
  coord_cartesian(ylim = c(0, 1)) +
  scale_x_continuous(breaks = 365 * seq(5)) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Follow-up time (days)", y = "Fitted survival", linetype = "")

RMST_1 is the area below the survival curve of the treatment arm:

ggplot() +
  geom_ribbon(data = data_grid_1, aes(x = rectime, ymin = 0, ymax = fitted), alpha = 0.2) +
  geom_line(data = data_grid, aes(x = rectime, y = fitted, linetype = hormon2)) +
  coord_cartesian(ylim = c(0, 1)) +
  scale_x_continuous(breaks = 365 * seq(5)) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Follow-up time (days)", y = "Fitted survival", linetype = "")

The difference between RMST_1 and RMST_0 is the area between the curves:

library(tidyr)
data_grid_diff <- pivot_wider(data_grid[, -4], names_from = hormon, values_from = fitted)

ggplot() +
  geom_line(data = data_grid, aes(x = rectime, y = fitted, linetype = hormon2)) +
  geom_ribbon(data = data_grid_diff, aes(x = rectime, ymin = `0`, ymax = `1`), alpha = 0.2) +
  coord_cartesian(ylim = c(0, 1)) +
  scale_x_continuous(breaks = 365 * seq(5)) +
  theme(legend.position = c(1, 1), legend.justification = c(1, 1)) +
  labs(x = "Follow-up time (days)", y = "Fitted survival", linetype = "")

data_grid_diff is a dataset that was reshaped for plotting purposes, but otherwise contains the same information of data_grid.

Wrapping up

This post showed how to compute restricted mean survival time after fitting a flexible parametric model, but it can be replicated with every model that yields predictions for the survival function $S(t)$.

I started writing this blog post with the idea of describing how to integrate survival functions in general terms but got carried away with the estimation of restricted mean survival time. Nevertheless, knowing how to integrate a survival function can be broadly useful e.g. when comparing different predictions from competing models or when comparing with the truth (if known, as in simulation studies). And it’s terribly easy!

This example used functions that come bundled with R for simplicity, but other functions and packages can be used as well. For instance, I prefer using the quadinf function from the pracma package for numerical integration, as I found it (empirically, of course) to be more robust and reliable than integrate. Other approaches for numerical integration can be used as well, such as Gaussian quadrature or more simple trapezoidal rules (and so on).

As always, let me know if I got something horribly wrong. Cheers!