An Introduction to Monte Carlo Simulation Studies for Statisticians

class: bottom, left, inverse, title-slide

# An Introduction to Monte Carlo Simulation Studies for Statisticians
### Alessandro Gasparini
### <a href="mailto:alessandro.gasparini@ki.se" class="email">alessandro.gasparini@ki.se</a>
### 2021-05-06

---

class: outline

# Outline

## 1. Introduction

## 2. Planning a simulation study

## 3. Coding a simulation study

## 4. Analysing a simulation study

## 5. Reporting and visualising results

## 6. Wrap-up

---
class: center, middle, inverse

# 1. Introduction

---

# > `whoami`

* Currently a post-doctoral researcher in biostatistics at the Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden.
  Now mostly working on developing novel statistical models for breast cancer growth and spread;

* Graduated from my PhD in January 2020 at the University of Leicester, UK. 
  My thesis focussed on multilevel models for healthcare utilisation data;
  
--

* Graduated from the MSc in Milano-Bicocca in 2015, after spending a few months at KI to work on my thesis (supervised by Prof. Bellocco);

* (Some) other research interests: survival analysis (simple and complex), multilevel and longitudinal models, simulation methods, statistical issues with healthcare utilisation data, tools development in R, kidney disease epidemiology, ...

---
class: center, middle

## What are Monte Carlo simulation studies?

---
class: center, middle

# Monte Carlo simulation studies are computer experiments that involve generating data by pseudo-random sampling.

---
class: center, middle

.large[
From now on, I will be using the terms _Monte Carlo simulation study_ and _simulation study_ interchangeably.
]

---

# The Process&trade;

In a simulation study, we design an experiment such that:

1. Generate data from a known distribution/model (so that we know the _truth_);

2. Analyse the data;

3. Compare the analysis results with the truth;

4. Then, we (generally) repeat steps 1-3 _B_ times.

> After running the experiment _B_ times, we collect and analyse the results to draw our conclusions.

---

# Simulation studies are useful...

Monte Carlo simulation studies provide an invaluable tool for statistical and biostatistical research — and are great for learning/understanding.

They can help to answer questions such as:

* Is an estimator biased in a finite sample?

* Do confidence intervals for a given parameter achieve the desired nominal level of coverage?

* How does a newly developed method compare to an established one?

* What is the power to detect a desired effect size under complex experimental settings and analysis methods?

* ...you name it.

---

# ...and increasingly popular!

.footnote[.tiny[
Web of Science search key: `TS = ("simulation study") AND WC = ("Statistics & Probability")`]]

---
class: center, middle

# Throughout this seminar we will see how to plan, code, analyse,  report/visualise simulation studies.

---

# Tutorial paper

---
class: center, middle, inverse

# 2. Planning a simulation study

---

# There are issues...

1. Monte Carlo standard errors of performance measures are often not computed/reported;

2. Reproducibility of results;

3. Dissemination of results.

> As statisticians, we should follow our advice — and could do better — when it comes to the design, reporting, and reproducibility of our own _experiments_.

> The problem is: _typical_ statistical training does not teach _enough_ about practical aspects of simulation studies.

---

# Enter the ADEMP framework

* .cyclamen[A] is for _Aims_

* .cyclamen[D] is for _Data-generating mechanisms_

* .cyclamen[E] is for _Estimands_

* .cyclamen[M] is for _Methods_

* .cyclamen[P] is for _Performance measures_

> Carefully thinking and planning a simulation study makes your life easier (and more reproducible) in the long run.

---

# ADEMP: Aims

Before starting with a simulation study, we need to _know what we want to learn_.

For instance:

* Whether/when an estimation procedure converges?

* Consistency? (large-sample properties)

* Small-sample issues?

* Correct variance formula? (e.g. when derived from asymptotic theory)

* Comparison with other approaches?

* Robustness to model misspecification?

---

# ADEMP: Data-generating mechanisms

The data-generating mechanisms [DGM] are processes that we assume generate the data under study.
For instance, the process underlying a clinical trial (eligibility, randomisation, etc.) or an observational study.

Defining each DGM is aim-specific, but we aim to vary several factors at once: sample size, true value of the estimand of interest, additional DGM-related parameters, ...

We can vary them factorially, or use e.g. fractional designs for greater efficiency: we do need to be pragmatic, after all.

---

# ADEMP: Estimands

The _estimand_ is what we are estimating, the target of our inference.

It typically is a parameter (or a function of parameters), but it might be a summary measure (e.g. predictive performance) or a null hypothesis.

For instance, the estimand of interest in a simulation study mimicking a clinical trial might be the treatment effect.

This depends once again on the aims of the simulation study, hence the importance of defining that from the start.

---

# ADEMP: Methods

Methods are often what we are studying, e.g. a newly-developed method or a method that we aim to apply in unusual settings.

A _comparator_ is more often than not included, e.g. a gold standard.
However, it is not always necessary to include one.

It is also recommended to include familiar methods as a check.

---

# ADEMP: Performance measures

Let's define the chosen estimand as `$\theta$`, which is estimated by a method as `$\hat{\theta}$` with standard error `$\widehat{\text{se}}(\hat{\theta})$`.

Performance measures of interest might target:

* Properties of the estimator, such as _bias_, _empirical standard error_, _mean squared error_

* Properties of the standard error of the estimator, such as _average model-based standard error_

* Properties of a confidence interval for `$\hat{\theta}$`, such as _coverage probability_ and _power_

* Convergence

* Computational speed

---

# In brief...

_Bias_ estimates how close the average estimate `$\hat{\theta}$` is to the true value `$\theta$`: `$E(\hat{\theta}) - \theta$`.
For instance, the maximum likelihood estimator is asymptotically unbiased.

The _empirical standard error_ describes the precision of an estimator, and is defined as `$\sqrt{\text{var}(\hat{\theta})}$`.
Ideally, we are happier with _precise_ estimators with low variance.

The _mean squared error_ (MSE) combines bias and empirical standard error in an _overall_ performance measure.
It is defined as `$(\text{Bias})^2 + (\text{empirical standard error})^2$`.

_Coverage probability_ of a confidence interval is defined as the proportion of confidence intervals that contain the true value: `$P(\hat{\theta}_{\text{low}}, \hat{\theta}_{\text{upp}})$`. 
For a 95% confidence interval, this probability targets 95%; poor coverage might arise because of (1) biased estimator, (2) biased model standard errors, (3) poor distributional properties.

---
class: center, middle

## All of these performance measures are described in more detail in the tutorial paper.<br/>
## Make sure to check that out and read carefully before planning your first simulation study.

---
class: center, middle, inverse

# Example

---

# Confounding

.pull-left[

We have a study which can be summarised by the following DAG:

Where `$X$` is the exposure, `$Y$` the outcome, and `$Z$` a confounder.

]

.pull-right[

> Note that the exposure and the outcome are independent: there is no edge between `$X$` and `$Y$`.

]

---

# Aim

The aim of this simulation study is pedagogical.

We know that there should be no association between `$X$` and `$Y$` once we adjust for `$Z$` in our analysis.

> However:

1. Is it actually true that once we adjust for `$Z$` we observe no association between `$X$` and `$Y$`?

1. What happens if we fail to adjust for a `$Z$` in our analysis?

---

# Data-generating mechanisms

* `$X$` is a binary treatment variable, `$Z$` is a continuous confounder, and `$Y$` is a continuous outcome;

* `$Z$` is simulated from a standard normal distribution;

* `$X$` depends on `$Z$` according to a logistic model:
$$
\log \frac{P(X = 1)}{P(X = 0)} = \gamma_0 + \gamma_1 Z
$$

* `$Y$` depends on `$Z$` according to a linear model:
$$
Y = \alpha_0 + \alpha_1 Z + \varepsilon
$$
with `$\varepsilon$` following a standard normal distribution.

---

# Data-generating mechanisms

We fix the parameters `$\gamma_0 = 1$`, `$\gamma_1 = 3$`, `$\alpha_0 = 10$`.

For `$\alpha_1$`, we actually simulate two scenarios:

1. `$\alpha_1 = 5$`: in this scenario (.cyclamen[DGM = 1]), `$Z$` is actually a confounder;

1. `$\alpha_1 = 0$`: in the second scenario (.cyclamen[DGM = 2]), we remove the edge between `$Z$` and `$Y$` as well. Here `$Z$` is not a confounder anymore.

Finally, we generate `$N = 200$` independent subjects per each simulated dataset.

---

# Estimands

We have a single estimand of interest here, the regression coefficient `$\beta_1$` (which we define how to estimate in the next slide) that quantifies the association between `$X$` and `$Y$`.

Under certain structural assumptions (see e.g. [John et. al. (2019), DOI: 10.1186/s12874-019-0858-x](https://doi.org/10.1186/s12874-019-0858-x)), the regression coefficient `$\beta_1$` can be interpreted as the average causal effect (ACE) of the exposure `$X$` on the outcome `$Y$`.

> Remember that, according to our data-generating mechanisms, we expect _no_ association between `$X$` and `$Y$`.

---

# Methods

We estimate `$\beta_1$` using linear regression.
Specifically, we fit the following two models:

1. A model that fails to adjust for the observed confounder `$Z$` (.cyclamen[model 1]):
$$
Y = \beta_0 + \beta_1 X + \varepsilon
$$

1. A model that properly adjusts for `$Z$` (.cyclamen[model 2]):
$$
Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon
$$

---

# Performance measures

The key performance measure of interest is .cyclamen[bias], defined as
$$
E(\hat{\beta_1})
$$
as we know that — according to our DGMs — the true `$\beta_1 = 0$`.

We might also be interested in estimating the .cyclamen[power] of a significance test for `$\hat{\beta_1}$` at the `$\alpha = 0.05$` level.
In our example, think of that as a test for the null hypothesis `$\beta_1 = 0$`.

---

# Number of replications

The rationale is that _we need enough replications to be able to estimate the key performance measures with good precision_.

For instance, our example is simple so we don't need to worry too much about computational time. 
We can then run a large number of replications e.g. `$B = 10000$` or more, depending on how long we are willing to wait for the results to be ready.

---

# It's complicated...

In general, choosing the number of replications `$B = n_{\text{sim}}$` is not trivial.
Here is an example from one of our papers (currently under review):

---
class: center, middle, inverse

# 3. Coding a simulation study

---

# Coding a simulation study

The most important thing to remember is that _there is no right way_ of coding a simulation study.

However, your code _should_ be:

* Reproducible;

* Immune to the [bus factor](https://en.wikipedia.org/wiki/Bus_factor);

* Openly available (as much as possible).

---
class: center, middle

# It's time to switch to a live demo.

### We are going to use R for now, but code is available in Stata as well. All material is available online: <br/> https://github.com/ellessenne/simulation-study-example

---
class: center, middle, inverse

# 4. Analysing a simulation study

---

# Analysing a simulation study

The analysis of a simulation study is not different than a clinical study.

You should start with descriptive statistics:

1. Inspect your data,

1. Compute summary statistics (mean, median, etc.),

1. Create some descriptive plots e.g. to check for outliers.

Then, you can move onto estimating the performance measures of interest: _bias_, _coverage_, etc.

Now, if you feel fancy, you can go into regression modelling — but that's a whole big topic for another day.
It is also a much less established practice.

---
class: center, middle

# Let's go back to the example.

---
class: center, middle, inverse

# 5. Reporting and visualising results

---

# Reporting results

There are potentially 4 dimensions (D times E times M times P), so complexity can grow quickly.

You want to think about:

* Which is the comparison of interest?

* Across DGMs or across methods?

* Can my reporting strategy _effectively_ tell the story of the study?

As with coding before, there is no _right_ way to do this.
Generally, I recommend using plots rather than tables, as tables can easily become daunting.

---

# Example 1

---

# Example 2

See the supplementary material for the paper titled _Impact of model misspecification in shared frailty survival models_ by Gasparini _et al_., available online at:

* https://github.com/ellessenne/frailtymcsim/blob/master/suppl.pdf

.footnote[

*Warning:* it's not pretty!

]

---
class: center, middle

# Plots are (more often than not) a better option, and {rsimsum} can help with that.
# Let's go back to the example.

---

# Interactivity?

With large simulation studies, we really want/need interactive graphics that allow the user to control what is displayed (or not).

Furthermore, more info (e.g. estimated values) could appear when hovering on plots.

Frameworks that let you do that are (among others) d3.js, Shiny, Plotly, etc.

We actually did develop an interactive Shiny app that does some of that:

* Repository: https://github.com/ellessenne/interest

* Paper: https://arxiv.org/abs/1909.03813

---
class: center, middle, inverse

# 6. Wrap-up

---

# Replicability

Compared to e.g. clinical trials or observational studies, simulation studies offer one of the most straightforward opportunities for replication.

Therefore:

1. Write-up your study always assuming that other people will try to replicate it.
   Many actually do, see e.g. the [Simulation Replication Challenge](https://www.replisims.org/).
   The _A-D-E-M-P_ framework can help with this.

1. Another suggestion is to release the _full code_, e.g. on GitHub or other repositories.
   This allows direct replicability, and forces you to check (and organise, and document) your code thoroughly.
   Some journals even require you to share the analysis code.

---

# Key points

* Plan your simulation study on paper before starting to code it;

* Structure and describe your study assuming that other people will try to reproduce it;

* Build your code slowly, start small and only then generalise/expand it;

* Report Monte Carlo standard errors (which are displayed automatically by {rsimsum} in R and `simsum` in Stata);

* Make sure you're _selling the story_ of your study well, with engaging/effective visualisations and reporting.

---

# Material

All material you have seen so far is openly available online.

* Slides: https://tinyurl.com/simulation-studies-05-06

* Code: https://github.com/ellessenne/simulation-study-example

Let me know if you spot typos/errors or have any question.
I am always happy to chat about simulation studies.

.LARGE[

> That's all from me for today, thank you for listening!

]