4 Data Observation and Curation

4.1 Data Management

Store all of your research data in the data subdirectories. It is recommended that raw data not be altered once downloaded or collected. Maintaining a separate raw data file facilitates reproducibility be preserving a common point of analytical origin. It is similarly recommended that whenever possible data processing, transformation, or manipulation be completed with code as this practice facilitates re-analysis and reduces opportunities of confusion.

Complete the data_metadata.csv file indexing each raw and derived data file, including the fields:

  • path: the path to the data folder, likely one of: raw\private, raw\public, derived\private or derived\public
  • name: the file name, including extension
  • metadata: list of metadata files for this data source, stored in the data\metadata folder. These may include ISO-191** or FGDC standard XML files, data dictionaries, licenses or attributions, user guides, webpage printouts, etc.
  • status: which may be included for data included in the repository or create or acquire for data that must created or acquired, derived for data that will be generated by code from other data files, simulated for data that replaces the true research data with a simulated data due to confidentiality or legal constraints, and unavailable for data that cannot be shared or reproduced in any way.
  • description: very brief description of the dataset.

Researchers are strongly encouraged to include additional metadata in the metadata folder. Further information about the procedures used to create data with ‘status = derive’ should be maintained in the procedure_metadata.csv.

See more about metadata in the engaging with data section of the previous chapter.

4.2 Collect preliminary data

  • metadata!
  • code/scripts for data acquisition
  • directory structure for data
    • scratch (not tracked)
    • raw / public
    • raw / private (not tracked)
    • derived / public
    • derived / private (not tracked)
  • file size limits for GitHub / GitLab
Processing Access
Private Public
Raw RPri RPub
Derived DPri DPub

4.2.1 Raw private data

Store raw data in this folder as it is collected or downloaded if the data cannot be publicly redistributed. For example, data versioning and sharing my be restricted because of large file sizes, licensing, ethics, privacy, or confidentiality. Best practices are to include code to automate the process of downloading or simulating raw private data in the first step of the methods, or to include instructions here for accessing any private or restricted-access data.

This folder is ignored by Git versioning with the exception of this readme.md file by the following lines in .gitignore

# Ignore contents of private folder, with the exception of its readme file
private/**
!private/readme.md

4.2.2 Caution: Dealing with large files

Files can come in two flavors: plain text, like source code, Markdown, or system logs; and binary files, like images, videos, or shapefiles.

As a version management tool, Git is designed to track changes in plain text files; it can store changes in binary files as well, but it can only record that the whole file changed.

GitHub will warn you of files larger than 50 MiB, and reject files larger than 100 MiB. Therefore, large files should generally be placed in private directories so that they are not tracked by Git or uploaded to GitHub.

OSF and Figshare both allow for larger file storage options, so you may store large files on those services and write code for downloading those files to private directories as the analysis runs. Significant data sources could be registered with their own DOI links.

If version management of large files is required, GitHub provides paid hosting for the Git LFS (Large File Storage) program. However, we still suggest saving large files as separate data resources, so that downstream researchers attempting to reproduce or replicate your work are not required to modify your code, or install and pay for the same large file storage options that you have used.

If you have already commit changes with large files, follow GitHub’s instructions here: Removing files from a repository’s history. On GitHub Desktop, go to the History tab of your repository and undo the last commit. The changes tab will repopulate with the changes from that commit, where you should be able uncheck any large files from inclusion in the commit. Meanwhile, move the large files into the appropriate private directory, and the gitignore file should take over and make them disappear from the list of changes in GitHub Desktop.

4.3 Updating the analysis plan

You will likely encounter unexpected challenges and the need to change your original, pre-analysis registration plan. This is normal: just be diligent about updating your analysis plan, cataloguing deviations from the original plan, and committing changes to the repository.

Document unplanned deviations as they occur in the analysis plan. If the study is a metascience study, then categorize unplanned deviations for reproduction if the aim of the deviation is still to reproduce the original methodology and original results. Categorize deviations as for reanalysis if the aim is to alter a methodological parameter of the study to compare results, e.g. as a test of sensitivity, uncertainty, or robustness. Categorize deviations as for replication if the aim is to alter the spatial-temporal coverage of the study or to otherwise repeat the study methodology with new data/observations.

For full transparency, document both the rationale and the form of each deviation.