8 Reproduction
The whole concept behind open science is that other researchers or students will be able to reproduce, replicate, and extend scientific studies after they are published. This chapter is devoted to that epilogue: how to engage with a published open science study. The repurposing of a scientific study will normally start with reproducing the original study results - a process that will differ depending on the computational environment and original materials. We hope that by standardizing a research template, we will have eased the work of reproducing the study.
This chapter provides some guidance for reproducing a study that has been published with an executable research compendium. We first review procedures for studies conforming to version 2.0 or greater of the HEGSRR-Template. Then, we provide some options for studies published in other formats. For each scenario, we review options for two commonly used computational notebook formats: R with R Markdown, and Python with Jupyter.
8.1 Our template
Our reproducibility template, which can be found at HEGSRR/HEGSRR-Template on GitHub, is designed to reduce common frustrations when trying to execute a compendium for computational geography. Here are some tips on setting up the computational environments equivalent to those used in our template.
Refer to the procedure/environment/
folder for a README.md that may contain quick start instructions.
8.1.1 R
R is “a free software environment for statistical computing and graphics”. To install R, navigate to https://cloud.r-project.org/. Additionally, install RStudio (https://posit.co/downloads/), the most popular IDE (Integrated development environment) for R. 1
Those looking for a “cloud” solution may want to try out Posit Cloud.
8.1.1.1 The groundhog
package
Our template (HEGSRR/HEGSRR-Template) uses the package groundhog
(CredibilityLab/groundhog) to achieve reproducibility of R scripts.
Given a date and a list of packages needed, groundhog
can retrieve the most up-to-date version of both R and these packages on that date.
To recover the list of packages, in the relevant .Rmd
file, look for references to groundhog
.
This is typically in a code chunk that deals with setup tasks.
As recommended by groundhog
’s developers, a sensible first step is to change the specified groundhog.day
to a reasonably recent date - a date corresponding to your version of R, or a very recent date such as last week.
Then, we can execute the main function, groundhog.library()
.
Depending on the date you selected, you may need to install another version of R that was up-to-date on that date.
This is expected, because changes in R itself may impact computational results as well.
You may also need to give groundhog
permission to install its packages to a certain folder; for our template, this has been set to data/scratch/groundhog/
.
If the code functions as expected, then you are all set!
If not, then we can amend the groundhog day back to the original one;
after installing the correct version of R, groundhog
will proceed to install the packages on that date.
See the article Changing R version for the RStudio Desktop IDE for how to tell RStudio to use a specific version of R.
8.1.2 Python
Installing Python locally can be somewhat complicated.
In addition to Python, Jupyter is also needed to have interactive notebooks.
Tutorials found on the Internet also frequently mention conda, a package manager;
additional packages such as ipykernel
or nb_conda_kernels
may also be needed.
As an alternative, many platforms offer a “cloud” notebook that requires minimal setup. Additional benefits of these cloud notebooks may include:
- Disposable environment (see the Disposable environment section below)
- No hardware requirement for the user
- Access to premium hardware for more computational power, usually at a cost
- Share and collaborate with others
8.1.2.1 Disposable environment
The Python package manager, pip
, installs packages into a global library.
This may lead to conflicts, such as when different projects require different versions of the same package.
Fortunately, if a machine is disposable such that it never runs more than one project, then one does not have to worry about managing packages - one can simply install-and-forget. One example of such a “disposable environment” is Google Colab, where one is given a fresh machine whenever one connects to a Jupyter runtime. Another example is https://mybinder.org/, which provides a free virtual machine from a repository of Jupyter notebooks.
8.1.2.2 Package management
However, if you are working on a personal laptop, departmental machine, or another non-disposable environment, then keeping the system tidy is desirable.
In this case, before installing all the packages listed inside requirements.txt
, a “virtual environment” should be created.
With a virtual environment, no changes are made at the system-level.
Python’s own venv and conda are two common approaches.
Here is a tutorial from the website RealPython that walks you through using, as well as motivates, venv: Python Virtual Environments: A Primer
8.1.2.3 The pipenv
package
For Python, our template (HEGSRR/HEGSRR-Template) may have used the package pipenv
for a virtual environment.
This is indicated by a Pipfile
and a Pipfile.lock
file, found under the procedure/environment/
folder.
To recover the computational environment, first install pipenv
:
!pip install --user pipenv
Then, navigate to the procedure/environment/
folder.
If a Pipfile
and a Pipfile.lock
file are present, run:
!cd ../environment
!pipenv sync
If a Pipfile.lock
file is not present, run:
!cd ../environment
!pipenv install
!pipenv sync
8.1.2.4 The pigar
package
If no Pipfile
or Pipfile.lock
file is found under procedure/environment/
,
then the project in question may have used pigar
(damnever/pigar) to generate requirements.txt
,
a list of the packages needed.
If you are using a disposable environment, then simply run the following code (in Jupyter) to install the required packages:
!pip install -r /path/to/requirements.txt
For non-disposable environments, we recommend installing pipenv
for virtual environments:
!pip install --user pipenv
Then, utilize pipenv
to import the requirements file:
!pipenv install -r path/to/requirements.txt
8.2 Other compendiums
Here are some general tips for how to recover from a research compendium one might find on the Internet.
8.2.1 R
If you encounter a list of packages and their respective versions, you will have to install these versions manually.
You can do this with the remotes::install_version()
function documented here.
However, older version of packages on CRAN are usually only available in “source” form, and not an immediately usable “binary” form.
This means that you will need to have an R development environment installed for the package to be compiled from source to binary.
You can check if you do by running devtools::has_devel()
.
On Windows, the “development environment” is RTools; on MacOS the process is slightly more complicated.
One further complication is that packages often require system-level dependencies. These dependencies are additional software that are not always available, or consistent, across different machines. Thus, installing “source” packages is often prone to failure.
8.2.1.1 Recovering with groundhog
Fortunately, the groundhog
package (CredibilityLab/groundhog) can be used to attempt recovery of a computational environment, especially when the original versions are unknown.
Given a list of packages, the next step is to guess the “groundhog day” for these packages to be up-to-date. A good guess is when the research in question was conducted, or the date of publication. Once a date (for which the packages function correctly) is identified, that date can be recorded, and a reproducible compedium is produced.
8.2.1.2 The renv
package
Analogous to venv
in Python, renv
records the versions of R and its packages for a certain project, and isolates these packages in a separate virtual environment.
Per the documentation, a project with renv
will contain a renv
directory, as well as a renv.lock
file.
By opening the project, renv
will automatically activate itself and prompt you to run renv::restore()
, which restores the virtual environment.
Note that renv
usually has to install from “source” when recovering, so this will often fail.
See renv
’s Caveats page for details.
8.2.2 Python
Projects using virtual environments (venv
, virtualenv
, conda
, Poetry, etc.) will usually at least mention the method used.
One can also look for hints in the folder structure, for example a venv
folder.
Once the particluar solution is located, the documentation for restoring that environment can then be readily found online.
8.2.2.1 Manually create requirements.txt
If you encounter a list of package names in Python, it is possible to convert that list into a reproducible requirements.txt
file.
Simply find a package version that works, and record it into the requirements file.
Here is an example requirements file.
Using the package scipy
as an example:
-
scipy
: install the latest version from the Python Package Index (PyPI). -
scipy == 1.9.1
: install version 1.9.1 from PyPI. -
scipy == 1.9.*
: install the latest version that starts with 1.9.1. This installs 1.9.3, unlessscipy
decides to release version 1.9.4.
To ensure repeatability, it is best practice to specify an exact version of packages; this is called “version pinning”.
8.2.3 Docker
You may encounter a compedium in the form of a Docker container or Dockerfile. In this case, you will need to run Docker.
Docker is software that can package software into “containers”. These containers then run inside a virtual machine, within a “host” machine that runs the Docker program.
Docker containers can be distributed directly, usually through a “container registry” that you can “pull” from, and occasionally through a large file.
They can also come in the form of a Dockerfile, which is the “recipe” for producing a Docker container. The Dockerfile will need to be accompanied with files (data, software, etc.) that Docker uses as “ingredients” to produce the Docker container.
Installing Docker (Desktop) on Windows or MacOS is possible, but complications and caveats exist. The preferred way to use Docker (Server) is through a server running a supported Linux distribution.
A Dockerfile contains instructions on how to create (“build”) a Docker container; run docker build to turn a project-with-Dockerfile into a container. Alternatively, the Docker container may have already been built; in this case, docker pull the container onto the host machine. Once we have a container, we can use docker run to execute the container.
Note that Docker images are essentially virtual machines inside your system, so it is not usually possible to directly access the virtual machine. Often, the communication happens through a browser. For example, the Docker image may run RStudio Server (or JupyterLab), in which case you can access the service with a browser.
8.2.3.1 Cloud services
Alternatively, you can utilize an online service to run Docker.
Many platforms offer “virtual machines” that run Linux distributions, so you can install and run Docker inside these machines. Another commonly-seen approach is for the platform to run Docker for you.
When using cloud services, especially “virtual machines”, it is important to note that there is a risk of someone hacking into your cloud machine, possibly from anywhere in the world. Therefore, when using a cloud service, it is desirable to follow common best practices (e.g. use a strong password, use SSH keys, set up a firewall), and take extra precautions to guard against unauthorized access to your machine.