Modelling Longitudinal Electronic Health Records Data with an Informative Observation Process and Drop-Out

# Modelling Longitudinal Electronic Health Records Data with an Informative Observation Process and Drop-Out
### Alessandro Gasparini
### 2019-10-31

---

# Introduction

---

# Background

Electronic Health Records [EHRs] are medical records of patients attending medical care (e.g. visiting the GP) and recorded in a digital format.

We can construct data cohorts for research use by extracting and linking:

* EHRs from primary, specialist, and hospital care;

* nationwide registries;

* any other data source that could be linked to the above.

This kind of data (sometimes referred to as _health care consumption data_) is being increasingly used in medical research.

For instance:

* Kidney disease;

* Cardiovascular disease;

* End-of-life healthcare.

---

# Background

Health care consumption data cohorts have thousands - if not millions - of individuals with hundreds of measurements each.

The availability to researchers of such a vast amount of data allows answering more relevant and detailed clinical questions but poses new (methodological) challenges.
Among others:

1. Informative censoring (drop-out);

1. Informative observation process;

1. Reporting (REPORT guidelines, Benchimol _et al_., 2015).

In health care records:

1. Observation times are likely correlated with disease severity;

1. Dropout (censoring) is likely informative.

<!---
1. is generally not true in clinical trials and observational studies when visit times occur at random.

Individuals tend to have irregular observation times as patients with more severe conditions (or showing early symptoms of a disease) tend to visit their GP or go to the hospital more often than those with milder conditions (and no symptoms).
Their worse disease status is also likely to be reflected in worse biomarker values being recorded as such visits, causing abnormal values of such biomarkers to be overrepresented and normal values to be underrepresented.
-->

---

---

# Informative Observation / Drop-Out

Common assumptions with traditional methods for analysing longitudinal data:

> The mechanism that controls the observation times is independent of disease severity;
> 
> The drop-out process is independent of disease severity.

* Joint models for longitudinal-survival data can account for informative drop-out by modelling the censoring process;

* Research is scarce on whether inference is valid when the observation process is informative.

If the observation plan is dynamic, we must account for it in the analysis.

Otherwise, two types of bias can arise: selection bias and confounding (more details elsewhere1).

.footnote[
[1] MA Hernan, M McAdams, N McGrath, E Lanoy, D Costagliola (2009).
_Observation plans in longitudinal studies with time-varying treatments._ 
Statistical Methods in Medical Research 18(1):27-52
]

---

## We focus on the observation process (for now)

---

# Informative Observation

Despite the potential for bias, there is some evidence pointing towards a lack of awareness in longitudinal studies with healthcare data irregularly collected over time.

> A recent review2 showed that 86% of the included studies did not report enough information to evaluate whether the visiting process was informative or not;

.footnote[
[2] D Farzanfar, A Abumuamar, J Kim, E Sirotich, Y Wang, EM Pullenayegum (2017).
_Longitudinal studies that use data collected as part of usual care risk reporting biased results: a systematic review._ 
BMC Medical Research Methodology 17(1):133
]

---

# Characteristics of the Observation Process

Let's assume the observation process is the counting process `$N_i(t)$`.

When the visiting pattern is irregular, `$N_i(t)$` can be defined to be completely at random when visit times and outcome(s) are independent:

$$
E[\Delta N_i(t) | \bar{Y}_i(\infty), \bar{X}_i(\infty)] = E[\Delta N_i(t)]
$$

The observation process can be deemed _informative_ when it is not completely at random:

* Observation process at random:

$$
E[\Delta N_i(t) | \bar{X}_i(t), \bar{N}_i(t^{-}), \bar{Y}_i^{\text{obs}}(t^{-}), Y_i(t)] = E[\Delta N_i(t) | \bar{X}_i^{\text{obs}}(t), \bar{N}_i(t^{-}), \bar{Y}_i^{\text{obs}}(t^{-})]
$$

* Observation process not at random:

$$
E[\Delta N_i(t) | \bar{X}_i(t), \bar{N}_i(t^{-}), \bar{Y}_i^{\text{obs}}(t^{-}), Y_i(t)] \neq E[\Delta N_i(t) | \bar{X}_i^{\text{obs}}(t), \bar{N}_i(t^{-}), \bar{Y}_i^{\text{obs}}(t^{-})]
$$

---

# Observation Models

Gruger _et al_.3 illustrate four possible models that could be linked to the above-mentioned scenarios:

1. The _examination at regular intervals_ model, consisting of observation times that are pre-defined and equal for all patients (as in clinical trials);

1. The _random sampling_ model, consisting of a sampling scheme (e.g. an observation process) that is not pre-defined, but still independent of the disease history of the study subjects;

1. The _doctor's care_ model, consisting of an observation process that depends on the characteristics of the patient at the moment of the current doctor's examination;

1. The _patient self-selection_ model, yielding observations that are triggered by the patients themselves.

.footnote[
[3] J Gruger, R Kay, M Schumacher (1991).
_The validity of inferences based on incomplete observations in disease state models._
Biometrics 47(2):595-605
]

Models (1) and (2) could be characterised as _observation completely at random_; model (3) could be characterised as _observation at random_; finally, model (4) could be characterised as _observation not at random_.

---

# What Can We Do About It?

---

# Inverse Intensity of Visiting Weighting [IIVW]

This approach accommodates an informative observation process in a marginal regression model by weighting each observation by the inverse of the probability of each measurement to be recorded.

This approach creates a pseudo-population in which the observation process is static (e.g. completely at random) and can, therefore, be ignored.

Inverse intensity of visiting weighting (IIVW) was first proposed by Lin _et al_.4 and Robins _et al_.5 and further extended by Buzkova and Lumley6.

.footnote[
[4] JM Robins, A Rotnitzky, LP Zhao (1995).
_Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data._
Journal of the American Statistical Association 90(429):106-121

[5] H Lin, DO Scharfstein, R Rosenheck (2004).
_Analysis of longitudinal data with irregular, outcome-dependent follow-up._
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(3):791-813

[6]	P Buzkova, T Lumley (2007).
_Longitudinal data analysis for generalized linear models with follow-up dependent on outcome-related variables._
Canadian Journal of Statistics 35(4):485-500
]

---

# IIVW

The longitudinal model is:

$$
g[\mu_i(t)] = X_i(t) \beta
$$

We assume that we can identify a set of auxiliary variables `$Z_i(t)$` such that the visiting process is independent of the current outcome given such variables.

The model for the weights is a proportional hazards model for the intensity of visiting:

$$
h(t, Z_i(t)) = h_0(t) \exp(Z_i(t) \gamma),
$$

which yields the following weights:

$$
w_i(t) = \frac{s(t)}{h_0(t) \exp(Z_i(t) \gamma)}
$$

`$s(t)$` is a stabilising function, e.g. the baseline hazard function `$h_0(t)$`.

---

# Joint Modelling

We can fit a generalised multi-equation joint model for the informative visit times and the longitudinal outcome:

$$
`\begin{align*}
r_i &= r_0(t) \exp(w_i \beta + u_i) \tag{1} \\
y_{ij} | N_i(t) &= z_{ij} \alpha + \gamma u_i + v_i + \epsilon_{ij} \tag{2}
\end{align*}`
$$

* `$i$` and `$j$` index individuals and observations, respectively;

* `$y_{ij}$` are the observed values of the longitudinal outcome;

* `$z_{ij}$` and `$w_i$` are covariate vectors;

* `$u_i$`, `$v_i$` normally distributed random effects with `$E(u) = E(v) = 0$`;

* `$\gamma$` is the association parameter.

This model follows from Liu _et al_.7 and can be fitted with readily available software.

.footnote[
[7] L Liu, X Huang, J O'Quigley (2008).
_Analysis of longitudinal data in presence of informative observational times and a dependent terminal event, with application to medical cost data._
Biometrics 64:950-958
]

---

# Adjusting for the Number of Measurements

Methods of this kind are based on work by Goldstein _et al_. (2016).

They investigate what they named as _informed presence bias_ and show that:

1. Conditioning on the number of health-care encounters it is possible to remove bias due to an informative observation process;

2. Such approach can result in selection bias under some settings.

Anecdotally, this approach seems to be quite popular in practice.

Adjusting for the cumulative number of observations seems to be popular as well.

Intuitively, responses from individuals with several previous visits would differ from individuals with only a few visits, and including the number of prior visits as a covariate could control for these differences.

---

# Do Nothing...

Neuhaus _et al_.8 showed that in their settings the standard mixed model analyses had essentially no bias for covariates that did not have associated random effects in the model and little bias otherwise.

They also give the following advice:

> Combining a small number of regular visits with the irregular (and highly outcome dependent) visits greatly reduced even this small bias.

.footnote[
[8] JM Neuhaus, CE McCulloch, RD Boylan (2018).
_Analysis of longitudinal data from outcome-dependent visit processes: Failure of proposed methods in realistic settings and potential improvements_.
Statistics in Medicine, 37(29):4457-4471
]

---

# Does It Really Matter?

---

# A Monte Carlo Simulation Study

Comprehensive comparisons of the performance of different methods are (very) scarce in the literature.
The only paper I could find was Neuhaus _et al_. (2018), which was mentioned before.

There is a low awareness of the potential for bias and no guidance (Farzanfar _et al_., 2017).

The aims of this simulation study are:

1. Comparing the performance of the methods previously described;

1. Studying the consequences of ignoring the visiting process.

---

# Data-Generating Mechanism (1)

Simulating data from the joint model:

$$
`\begin{align*}
r_i &= r_0(t) \exp(Z_i \beta + u_i) \\
y_{ij} | N_i(t) &= \alpha_0 + Z_i \alpha_1 + t_{ij} \alpha_2 + \gamma u_i + v_i + \epsilon_{ij}
\end{align*}`
$$

* binary treatment `$Z_i$`;

* `$\beta$` = 1, `$\alpha_0$` = 0, `$\alpha_1$` = 1, `$\alpha_2$` = 0.2;

* `$\sigma^2_u$` = 1, `$\sigma^2_v$` = 0.5, `$\sigma^2_{\epsilon}$` = 1;

* `$r_0(t)$`: Weibull with shape p = 1.05 and scale `$\lambda$` = {0.10, 0.30, 1.00};

* `$\gamma$` = {0.00, 1.50};

* 200 individuals, with independent censoring from Unif(6, 12).

---

# Data-Generating Mechanism (2)

Simulating observation times from a `$\Gamma$` distribution with a given shape (= 2.0) and scale defined as `$\exp(-\beta \theta Z_i + \rho Y_{i, j - 1} + \xi_i)$`.

`$\xi_i$` is random noise from a Normal distribution.

Scenarios:

1. `$\theta$` = 0.00 and `$\rho$` = 0.00;

2. `$\theta$` = 2.00 and `$\rho$` = 0.00;

3. `$\theta$` = 2.00 and `$\rho$` = 0.20.

After generating the observation process, the longitudinal process is simulated from the same model as before.

Finally, we simulate a scenario from a JM with a sparse observation process strongly associated with the outcome, to which we add scheduled measurements every year.
Assumed parameters: `$\lambda$` = 0.05, `$\gamma$` = 3.00.

---

# Estimands

The main estimands of interest are the regression coefficients of the longitudinal model:

1. `$\alpha_0$`, the intercept;

2. `$\alpha_1$`, the treatment effect;

3. `$\alpha_2$`, the effect of time.

.Large[
> The main estimand of interest will be the treatment effect `$\alpha_1$`.
]

---

# Models Included in this Comparison

1. The joint model used to simulate data;

2. A mixed-effects model disregarding the observation process;

3. A mixed-effects model, adjusting for the total number of measurements;

4. A mixed-effects model, adjusting for the cumulative number of measurements;

5. A model fit using generalised estimating equations [GEE] and IIVW, following the approach outlined in Van Ness _et al_.9.

.footnote[
[9] PH Van Ness, HG Allore, T Fried _et al_. (2009).
_Inverse intensity weighting in generalized linear models as an option for analyzing longitudinal data with triggered observations_.
American Journal of Epidemiology 171(1):105-112
]

---

# Performance Measures and Number of Replications

We focus on the following performance measures:

1. bias, i.e. whether an estimator targets the true value on average;

2. coverage, i.e. the proportion of times that a confidence interval around each estimated value contains the true value.

We run 1,000 replications:

1. Assuming (1) a variance of each estimate of 0.1 or lower and (2) a Monte Carlo standard error for bias of 0.01 or lower, we require 1,000 replications;

1. The expected Monte Carlo standard error for coverage, assuming a worst case scenario of coverage = 0.50, would be 0.02.

---

# Results

---

# Bias of Treatment Effect

---

# Bias of Intercept

---

# Bias of Time Coefficient

---

# Coverage of Treatment Effect

---

# Coverage of Intercept

---

# Coverage of Time Coefficient

---

# Application

---

# Data

We illustrate the results of this simulation study using data from the PSP-CKD study10.

* Outcome: estimated renal function (eGFR);

* Intervention: enhanced CKD care compared to routine care, with randomisation at the practice level;

* Covariates: age at baseline, sex;

* 187,671 observations;

* 35,822 individuals;

* Approximately 3 years of follow-up.

.footnote[
[10] RW Major, C Brown, D Shepherd _et al_. (2019).
_The Primary-Secondary Care Partnership to Improve Outcomes in Chronic Kidney Disease (PSP-CKD) Study: A Cluster Randomized Trial in Primary Care_.
Journal of the American Society of Nephrology 30(7):1261-1270
]

---

# Informative Observation?

There is no formal test for the hypotesis of an informative observation process.
However:

Correlations between observation times (gap times between observations) and treatment, age, sex, eGFR are significant.

Fitting a mixed model for gap times:

1. Females had 10.51-days longer gap times (95% C.I.: 7.79 to 13.23);

1. Treated individuals had 3.80-days shorter gap times (95% C.I.: 1.14 to 6.46);

1. Each 5-years age difference was associated with 0.43-days shorter gap times (95% C.I.: -0.14 to 1.00).

1. Each 5-units eGFR difference was associated with 12.31-days longer gap times (95% C.I.: 11.94 to 12.68).

Fitting an Andersen-Gill model for the observation process yielded similar associations.

---

---

---

---
class: inverse, middle, center

# We can Extend the Joint Model

---

---

---

# Conclusions

---

# Take-Home Messages

* In the settings of electronic health records, the observation process could be informative;

* Failing to account for that in the analysis can yield biased results;

* There is a variety of methods that can be utilised, but they are severely underutilised (as highlighted by Farzanfar _et al_.);

* The joint modelling approach seems to perform best in the settings of these simulations, and can accommodate several extensions (more on this later);

* Interestingly, overmodelling the observation process did not seem to induce any spurious association.

---

# Extensions

---

# What About Drop-Out?

Drop-out can be easily incorporated in the joint modelling as well:

$$
`\begin{align*}
y &= X_y \beta_y + Z_y b_y + \epsilon_y \tag{1} \\
h(t) &= h_0(t) \exp(X_h \gamma + W_h(t | \beta, b) \eta_h) \tag{2} \\
r(t) &= r_0(t) \exp(X_r \alpha + W_r(t | \beta, b) \eta_r) \tag{3}
\end{align*}`
$$

.Large[
> This model can be fitted using readily available statistical software in Stata and R.
]

---

# Future Work

* The IIVW approach performed poorly in our simulations, somewhat surprisingly. This should be studied further (e.g. via ad-hoc simulations);

* As illustrated before, drop-out can be incorporated and modelled as well. The performance of the tri-variate joint model needs to be assessed and compared with competing approaches;

* More general formulations of time-to-events sub-models;

* Incorporating multivariate longitudinal outcomes;

* Incorporating multiple (distinct) observation and drop-out processes.