A Statistical Methods Manuscript¶
About this case study¶
The purpose of this case study is to discuss the different components of research reproducibility implemented in designing and conducting a statistical study. With the help of their manuscript, the authors provide a catalog of methods used in their research and cross-reference them to the respective sections discussed in this Guide for Reproducible Research.
About the Manuscript¶
Title: A review of Bayesian perspectives on sample size derivation for confirmatory trials.[KGL+20]
Authors: Kevin Kunzmann, Michael J. Grayling, Kim May Lee, David S. Robertson, Kaspar Rufibach, James M. S. Wason
Publication month & year: June 2020
Overview¶
The manuscript[KGL+20] itself is concerned with the problem of deriving a suitable sample size for a clinical trial. This is a classical problem in statistics and particularly important in medical statistics where collecting trial data is extremely expensive and ethical considerations need to be addressed. The manuscript reviews and extends methods to systematically incorporate planning uncertainty into the sample size derivation.
Citation summary¶
The manuscript can be cited in plain text APA format:
Kunzmann, K., Grayling, M. J., Lee, K. M., Robertson, D. S., Rufibach, K., & Wason, J. (2020). A review of Bayesian perspectives on sample size derivation for confirmatory trials. arXiv preprint arXiv:2006.15715.
Bibtex format:
@article{
kunzmann2020,
title = {A review of Bayesian perspectives on sample size derivation for confirmatory trials},
author = {Kunzmann, Kevin and Grayling, Michael J and Lee, Kim May and Robertson, David S and Rufibach, Kaspar and Wason, James},
journal = {arXiv preprint arXiv:2006.15715},
year = {2020}
}
Catalog of different methods for reproducible research¶
Version control¶
The git repository https://github.com/kkmann/sample-size-calculation-under-uncertainty contains all code required to produce the manuscript arXiv:2006.15715 from scratch. For an in-depth explanation of the importance of version control for reproducible research, see Version Control Systems.
Research data management¶
In this particular case, data management aspects are not an issue since the manuscript is exclusively based on hypothetical examples and no external, protected data is required.
Literate programming¶
The manuscript[KGL+20] itself is written in and built with
LaTeX.
The source files are contained in the subfolder latex/
.
Plain TeX files were preferred over literate programming solutions like
knitr for R
to facilitate the use of dedicated LaTeX editors like Overleaf.
This means, however, that all figures used in the manuscript need to be
created separately.
A dedicated Jupyter notebook
notebooks/figures-for-manuscript.ipynb
combining code and rudimentary
descriptions are provided to that end.
Reproducible software environment¶
Although this means that all code required to compile the manuscript from scratch is available in a self-contained environment, it is not yet sufficient for ensuring reproducibility. Installing LaTeX, Jupyter, and R with the same specification needed to run all code can still be challenging for less experienced users. To avoid this from keeping interested readers from experimenting with the code, a combination of the Python package repo2docker and a free BinderHub hosting service is used. For details on these techniques, see the chapters on Binder and BinderHub. This allows interested individuals to start an interactive version of the repository with all required software preinstalled - in exactly the right versions! Note that it is possible to provide version stable binder links
This badge points to the state of the repository at a specific point in time (via
the git tagging feature).
This means that the links will remain valid and unchanged even if there are
later corrections to the contents of the repository!
Binder supports multiple user interfaces.
This is leveraged to provide and Jupyter lab Integrated Development Environment
view on the repository to explore file, the Jupyter notebook, or to open a shell for
further commands.
The second badge directly opens an interactive Shiny app that illustrates
some of the points discussed in the manuscript and requires no familiarity with
programming at all.
All relevant configurations for Binder are located in the subfolder .binder
.
Workflow management using Snakemake¶
Since Jupyterlab also allows to open a shell in the repository instance opened
using a Binder link,
another feature of the repository can be used to reproduce the entire manuscript from scratch.
The Python workflow manager Snakemake
was used to define all required steps in a Snakefile
.
To execute this workflow,
you can open a shell in the online version of Jupyterhub.
Once the user interface finished loading, open a new terminal and type
snakemake -F --cores 1 manuscript
This will execute all the required steps in turn:
create all plots by executing the Jupyter notebook file
compiling the actual
latex/main.pdf
file from the LaTeX sources
You should then see a main.pdf
file in the latex
subfolder.
Support for local instantiation of the software environment¶
The Python package repo2docker can also be used locally to reproduce the same computing environment. To this end, you will need to have Python and Docker installed. For details on Docker and container technologies in general, please see the chapter on reproducible environments and containers. Then simply clone the repository on your local machine using the commands
git clone git@github.com:kkmann/sample-size-calculation-under-uncertainty.git
cd sample-size-calculation-under-uncertainty
After cloning the repository,
you can build and run a Docker container locally using the configuration files
provided in the .binder/
folder using the following command
jupyter-repo2docker -E .
The container is started automatically after the build completes and you can use the usual Jupyter interface in your browser by following the link printed by repo2docker to explore the repository locally.
Use of continuous integration¶
Although not necessary for the reproducibility of this manuscript, the repository also makes use of continuous integration (CI) using GitHub actions. GitHub actions are similar in spirit to Continuous integration with Travis but the runners are provided directly from GitHub.
The repository defines two workflows in .github/workflows
directory.
The first one, .github/workflows/build_and_run.yml
,
is activated whenever the master branch of the repository is updated and the specifications in .binder
are changed.
This builds the container, pushes it to a public container repository docker hub, and then checks that the Snakemake workflow runs through without problems.
The second one, .github/workflows/run.yml
,
runs when the folder .binder
was not changed and uses the pre-built
Docker container to run the Snakemake workflow.
The latter saves a lot of computing time since the computational
environment will change much less often than the contents of the repository.
The use of CI thus facilitates checking contributions by pull requests for
technical integrity and makes the respective latest version of the required container
available for direct download.
This means that instead of building the container locally using repo2docker you could thus just
download it directly and execute the workflow using the following commands
docker run -d --name mycontainer kkmann/sample-size-calculation-under-uncertainty
docker exec --name mycontainer /
snakemake -F --cores 1 manuscript
Long term archiving and citeability¶
The GitHub repository is also linked with zenodo.org to ensure long-term archiving, see How to make software citeable?
Note that a DOI provided by Zenodo can also be used with BinderHub to turn a repository snapshot backed up on Zenodo in an interactive environment (see this blog post).