Defining Datasets: Best Practices

From OSF Wiki
Jump to: navigation, search

Datasets are one of the fundamental dimensions for organizing content within OSF. This article discusses best practice for how to bound and scope a given dataset.

Datasets are the basis for organizing structured data within the system. They contain one or more records from a single source representing the same type of instance(s). Datasets may come from the Web or locally, and have standard metadata. At minimum, datasets have a simple structure of attribute-value pairs for each instance record.

Though technically a 'dataset' may be any collection of one or more records, it is best to manage the creation of datasets along uniform ways, such that all records in a dataset share as much commonality as possible. The ways by which different datasets should be contemplated is when any of these factors may differ:

  • Source - does the data vary by publisher or source location? For example, provenance or download location or format may be an important distinguishing factor
  • When created - does the data have periodic update or creation times? For example, it may be important to distinguish between preliminary data and final data
  • Access rights - are there any differences in how users may see or act upon the data? For example, privileged budget information may be put in a different dataset from public financial information
  • Type - does the data vary by class or kind? For example, records about schools might be desirable to keep different from records about churches, though at a different level both may be considered buildings, or
  • Attributes - are there differences in fields or attributes that describe the data? For example, a portion of records may have complete attribute descriptions, while the majority only contain a few descriptive fields.

Any of these differences may warrant creating a separate dataset. There are no limits to the number of datasets that may be managed by a given OSF instance.