class: bottom, left, inverse, title-slide # An Introduction to Monte Carlo Simulation Studies for Statisticians ### Alessandro Gasparini ###
alessandro.gasparini@ki.se
### 2021-05-06 --- class: outline <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { '\\_': '_' } } }); </script> # Outline ## 1. Introduction ## 2. Planning a simulation study ## 3. Coding a simulation study ## 4. Analysing a simulation study ## 5. Reporting and visualising results ## 6. Wrap-up --- class: center, middle, inverse # 1. Introduction --- # > `whoami` * Currently a post-doctoral researcher in biostatistics at the Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden. Now mostly working on developing novel statistical models for breast cancer growth and spread; -- * Graduated from my PhD in January 2020 at the University of Leicester, UK. My thesis focussed on multilevel models for healthcare utilisation data; -- * Graduated from the MSc in Milano-Bicocca in 2015, after spending a few months at KI to work on my thesis (supervised by Prof. Bellocco); -- * (Some) other research interests: survival analysis (simple and complex), multilevel and longitudinal models, simulation methods, statistical issues with healthcare utilisation data, tools development in R, kidney disease epidemiology, ... --- class: center, middle ## What are Monte Carlo simulation studies? --- class: center, middle # Monte Carlo simulation studies are computer experiments that involve generating data by pseudo-random sampling. --- class: center, middle .large[ From now on, I will be using the terms _Monte Carlo simulation study_ and _simulation study_ interchangeably. ] --- # The Process™ In a simulation study, we design an experiment such that: -- 1. Generate data from a known distribution/model (so that we know the _truth_); -- 2. Analyse the data; -- 3. Compare the analysis results with the truth; -- 4. Then, we (generally) repeat steps 1-3 _B_ times. -- > After running the experiment _B_ times, we collect and analyse the results to draw our conclusions. --- # Simulation studies are useful... Monte Carlo simulation studies provide an invaluable tool for statistical and biostatistical research — and are great for learning/understanding. -- They can help to answer questions such as: * Is an estimator biased in a finite sample? * Do confidence intervals for a given parameter achieve the desired nominal level of coverage? * How does a newly developed method compare to an established one? * What is the power to detect a desired effect size under complex experimental settings and analysis methods? * ...you name it. --- # ...and increasingly popular! <img src="20210506-simulation-studies_files/figure-html/wos-1.png" width="90%" style="display: block; margin: auto;" /> .footnote[.tiny[ Web of Science search key: `TS = ("simulation study") AND WC = ("Statistics & Probability")`]] --- class: center, middle # Throughout this seminar we will see how to plan, code, analyse, report/visualise simulation studies. --- # Tutorial paper <img src="tutorial-paper.png" width="90%" style="display: block; margin: auto;" /> --- class: center, middle, inverse # 2. Planning a simulation study --- # There are issues... 1. Monte Carlo standard errors of performance measures are often not computed/reported; -- 2. Reproducibility of results; -- 3. Dissemination of results. -- > As statisticians, we should follow our advice — and could do better — when it comes to the design, reporting, and reproducibility of our own _experiments_. -- > The problem is: _typical_ statistical training does not teach _enough_ about practical aspects of simulation studies. --- # Enter the ADEMP framework * .cyclamen[A] is for _Aims_ * .cyclamen[D] is for _Data-generating mechanisms_ * .cyclamen[E] is for _Estimands_ * .cyclamen[M] is for _Methods_ * .cyclamen[P] is for _Performance measures_ -- > Carefully thinking and planning a simulation study makes your life easier (and more reproducible) in the long run. --- # ADEMP: Aims Before starting with a simulation study, we need to _know what we want to learn_. For instance: * Whether/when an estimation procedure converges? * Consistency? (large-sample properties) * Small-sample issues? * Correct variance formula? (e.g. when derived from asymptotic theory) * Comparison with other approaches? * Robustness to model misspecification? --- # ADEMP: Data-generating mechanisms The data-generating mechanisms [DGM] are processes that we assume generate the data under study. For instance, the process underlying a clinical trial (eligibility, randomisation, etc.) or an observational study. Defining each DGM is aim-specific, but we aim to vary several factors at once: sample size, true value of the estimand of interest, additional DGM-related parameters, ... We can vary them factorially, or use e.g. fractional designs for greater efficiency: we do need to be pragmatic, after all. --- # ADEMP: Estimands The _estimand_ is what we are estimating, the target of our inference. It typically is a parameter (or a function of parameters), but it might be a summary measure (e.g. predictive performance) or a null hypothesis. For instance, the estimand of interest in a simulation study mimicking a clinical trial might be the treatment effect. This depends once again on the aims of the simulation study, hence the importance of defining that from the start. --- # ADEMP: Methods Methods are often what we are studying, e.g. a newly-developed method or a method that we aim to apply in unusual settings. A _comparator_ is more often than not included, e.g. a gold standard. However, it is not always necessary to include one. It is also recommended to include familiar methods as a check. --- # ADEMP: Performance measures Let's define the chosen estimand as `\(\theta\)`, which is estimated by a method as `\(\hat{\theta}\)` with standard error `\(\widehat{\text{se}}(\hat{\theta})\)`. -- Performance measures of interest might target: * Properties of the estimator, such as _bias_, _empirical standard error_, _mean squared error_ * Properties of the standard error of the estimator, such as _average model-based standard error_ * Properties of a confidence interval for `\(\hat{\theta}\)`, such as _coverage probability_ and _power_ * Convergence * Computational speed --- # In brief... _Bias_ estimates how close the average estimate `\(\hat{\theta}\)` is to the true value `\(\theta\)`: `\(E(\hat{\theta}) - \theta\)`. For instance, the maximum likelihood estimator is asymptotically unbiased. -- The _empirical standard error_ describes the precision of an estimator, and is defined as `\(\sqrt{\text{var}(\hat{\theta})}\)`. Ideally, we are happier with _precise_ estimators with low variance. -- The _mean squared error_ (MSE) combines bias and empirical standard error in an _overall_ performance measure. It is defined as `\((\text{Bias})^2 + (\text{empirical standard error})^2\)`. -- _Coverage probability_ of a confidence interval is defined as the proportion of confidence intervals that contain the true value: `\(P(\hat{\theta}_{\text{low}}, \hat{\theta}_{\text{upp}})\)`. For a 95% confidence interval, this probability targets 95%; poor coverage might arise because of (1) biased estimator, (2) biased model standard errors, (3) poor distributional properties. --- class: center, middle ## All of these performance measures are described in more detail in the tutorial paper.<br/> ## Make sure to check that out and read carefully before planning your first simulation study. --- class: center, middle, inverse # Example --- # Confounding .pull-left[ We have a study which can be summarised by the following DAG: <img src="20210506-simulation-studies_files/figure-html/unnamed-chunk-1-1.png" width="100%" style="display: block; margin: auto;" /> Where `\(X\)` is the exposure, `\(Y\)` the outcome, and `\(Z\)` a confounder. ] -- .pull-right[ > Note that the exposure and the outcome are independent: there is no edge between `\(X\)` and `\(Y\)`. ] --- # Aim The aim of this simulation study is pedagogical. We know that there should be no association between `\(X\)` and `\(Y\)` once we adjust for `\(Z\)` in our analysis. -- > However: 1. Is it actually true that once we adjust for `\(Z\)` we observe no association between `\(X\)` and `\(Y\)`? 1. What happens if we fail to adjust for a `\(Z\)` in our analysis? --- # Data-generating mechanisms * `\(X\)` is a binary treatment variable, `\(Z\)` is a continuous confounder, and `\(Y\)` is a continuous outcome; * `\(Z\)` is simulated from a standard normal distribution; * `\(X\)` depends on `\(Z\)` according to a logistic model: $$ \log \frac{P(X = 1)}{P(X = 0)} = \gamma_0 + \gamma_1 Z $$ * `\(Y\)` depends on `\(Z\)` according to a linear model: $$ Y = \alpha_0 + \alpha_1 Z + \varepsilon $$ with `\(\varepsilon\)` following a standard normal distribution. --- # Data-generating mechanisms We fix the parameters `\(\gamma_0 = 1\)`, `\(\gamma_1 = 3\)`, `\(\alpha_0 = 10\)`. -- For `\(\alpha_1\)`, we actually simulate two scenarios: 1. `\(\alpha_1 = 5\)`: in this scenario (.cyclamen[DGM = 1]), `\(Z\)` is actually a confounder; -- 1. `\(\alpha_1 = 0\)`: in the second scenario (.cyclamen[DGM = 2]), we remove the edge between `\(Z\)` and `\(Y\)` as well. Here `\(Z\)` is not a confounder anymore. -- Finally, we generate `\(N = 200\)` independent subjects per each simulated dataset. --- # Estimands We have a single estimand of interest here, the regression coefficient `\(\beta_1\)` (which we define how to estimate in the next slide) that quantifies the association between `\(X\)` and `\(Y\)`. Under certain structural assumptions (see e.g. [John et. al. (2019), DOI: 10.1186/s12874-019-0858-x](https://doi.org/10.1186/s12874-019-0858-x)), the regression coefficient `\(\beta_1\)` can be interpreted as the average causal effect (ACE) of the exposure `\(X\)` on the outcome `\(Y\)`. -- > Remember that, according to our data-generating mechanisms, we expect _no_ association between `\(X\)` and `\(Y\)`. --- # Methods We estimate `\(\beta_1\)` using linear regression. Specifically, we fit the following two models: 1. A model that fails to adjust for the observed confounder `\(Z\)` (.cyclamen[model 1]): $$ Y = \beta_0 + \beta_1 X + \varepsilon $$ 1. A model that properly adjusts for `\(Z\)` (.cyclamen[model 2]): $$ Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon $$ --- # Performance measures The key performance measure of interest is .cyclamen[bias], defined as $$ E(\hat{\beta_1}) $$ as we know that — according to our DGMs — the true `\(\beta_1 = 0\)`. -- We might also be interested in estimating the .cyclamen[power] of a significance test for `\(\hat{\beta_1}\)` at the `\(\alpha = 0.05\)` level. In our example, think of that as a test for the null hypothesis `\(\beta_1 = 0\)`. --- # Number of replications The rationale is that _we need enough replications to be able to estimate the key performance measures with good precision_. -- For instance, our example is simple so we don't need to worry too much about computational time. We can then run a large number of replications e.g. `\(B = 10000\)` or more, depending on how long we are willing to wait for the results to be ready. --- # It's complicated... In general, choosing the number of replications `\(B = n_{\text{sim}}\)` is not trivial. Here is an example from one of our papers (currently under review): <img src="nsim.png" width="90%" style="display: block; margin: auto;" /> --- class: center, middle, inverse # 3. Coding a simulation study --- # Coding a simulation study The most important thing to remember is that _there is no right way_ of coding a simulation study. However, your code _should_ be: * Reproducible; * Immune to the [bus factor](https://en.wikipedia.org/wiki/Bus_factor); * Openly available (as much as possible). --- class: center, middle # It's time to switch to a live demo. ### We are going to use R for now, but code is available in Stata as well. All material is available online: <br/> https://github.com/ellessenne/simulation-study-example --- class: center, middle, inverse # 4. Analysing a simulation study --- # Analysing a simulation study The analysis of a simulation study is not different than a clinical study. -- You should start with descriptive statistics: 1. Inspect your data, 1. Compute summary statistics (mean, median, etc.), 1. Create some descriptive plots e.g. to check for outliers. -- Then, you can move onto estimating the performance measures of interest: _bias_, _coverage_, etc. -- Now, if you feel fancy, you can go into regression modelling — but that's a whole big topic for another day. It is also a much less established practice. --- class: center, middle # Let's go back to the example. --- class: center, middle, inverse # 5. Reporting and visualising results --- # Reporting results There are potentially 4 dimensions (D times E times M times P), so complexity can grow quickly. You want to think about: * Which is the comparison of interest? * Across DGMs or across methods? * Can my reporting strategy _effectively_ tell the story of the study? -- As with coding before, there is no _right_ way to do this. Generally, I recommend using plots rather than tables, as tables can easily become daunting. --- # Example 1 <img src="tab1.png" width="90%" style="display: block; margin: auto;" /> --- # Example 2 See the supplementary material for the paper titled _Impact of model misspecification in shared frailty survival models_ by Gasparini _et al_., available online at: * https://github.com/ellessenne/frailtymcsim/blob/master/suppl.pdf .footnote[ *Warning:* it's not pretty! ] --- class: center, middle # Plots are (more often than not) a better option, and {rsimsum} can help with that. # Let's go back to the example. --- # Interactivity? With large simulation studies, we really want/need interactive graphics that allow the user to control what is displayed (or not). -- Furthermore, more info (e.g. estimated values) could appear when hovering on plots. -- Frameworks that let you do that are (among others) d3.js, Shiny, Plotly, etc. -- We actually did develop an interactive Shiny app that does some of that: * Repository: https://github.com/ellessenne/interest * Paper: https://arxiv.org/abs/1909.03813 --- class: center, middle, inverse # 6. Wrap-up --- # Replicability Compared to e.g. clinical trials or observational studies, simulation studies offer one of the most straightforward opportunities for replication. -- Therefore: 1. Write-up your study always assuming that other people will try to replicate it. Many actually do, see e.g. the [Simulation Replication Challenge](https://www.replisims.org/). The _A-D-E-M-P_ framework can help with this. -- 1. Another suggestion is to release the _full code_, e.g. on GitHub or other repositories. This allows direct replicability, and forces you to check (and organise, and document) your code thoroughly. Some journals even require you to share the analysis code. --- # Key points * Plan your simulation study on paper before starting to code it; * Structure and describe your study assuming that other people will try to reproduce it; * Build your code slowly, start small and only then generalise/expand it; * Report Monte Carlo standard errors (which are displayed automatically by {rsimsum} in R and `simsum` in Stata); * Make sure you're _selling the story_ of your study well, with engaging/effective visualisations and reporting. --- # Material All material you have seen so far is openly available online. * Slides: https://tinyurl.com/simulation-studies-05-06 * Code: https://github.com/ellessenne/simulation-study-example Let me know if you spot typos/errors or have any question. I am always happy to chat about simulation studies. -- .LARGE[ > That's all from me for today, thank you for listening! ]