Dataset Specifications and Metadata

From OSF Wiki
Jump to: navigation, search

Within the open semantic framework, datasets are objects in their own right and accessed and managed as such. This article describes how to characterize and provide metadata for a dataset. Note: Much of the information herein is drawn from the irON specification.

A dataset is used to document information about the creation of instances records, and to link external resources to them (like the linkage and structure schemas; more about this below).

A dataset can be seen as an aggregation of instance records used to keep a reference between the instance records and their source (provenance). A dataset can be split into multiple dataset slices. Each slice can be written in a separate file. Each slice of a dataset shares the same <id> of the dataset.


Dataset Description

A dataset description, or what is known in irON as Core Dataset Attributes, is what is suggested to be included with any dataset or dataset slice specification. Note some of these attributes are required, some are recommended, and others are optional.

Attribute/Keyword Requirement Note Allowed Type(s) Allowed Value(s) Definition
id Required Object String Identifier string used to uniquely identify the dataset
prefLabel Recommended

If the prefLabel of a dataset is not specified, systems displaying information about datasets will have a hard time figuring out how to present information about the dataset to users. If such a case happen, the system will have to fallback by displaying the ID of the dataset, or some generic label.

Object String Human readable label used to refer to this dataset. The prefLabel is the preferred label to use to refer to the dataset and is the preferred string used in the user interface. If not specified, the id is used as the label.
metaFile Optional

Not used if the actual metadata attributes are embedded in the dataset specification

Object String [format: uri (as a file reference)] This is a reference to an external record object; see next table for suggested dataset metadata.

The reference to the file has to be a URI. If the file is local to a file system, the "file:" schema should be used. If the file is on the Web, the "http:" schema has to be used, etc.

schema Optional

The structure schema can be embedded in a dataset file or linked from the dataset description.

If the structure schema is embedded and if a URL is specified, the schema with the biggest version will be used by the system. If both versions are the same, the system may use either one.

If no structure schema is accessible, the system will ignore the specification.

Dataset
  • String [format: url]
  • embedded schema linkage definition [object]
  • Array(String [format: url])
  • Array(embedded schema linkage definition [object])
URL reference where the structure schema can be retrieved from the Web.

Otherwise, a user can put the description of the schema in an object as the value of this attribute.

More about this below.

linkage Optional

The schema linkage can be embedded in a dataset file or linked from the dataset description.

If the schema linkage is embedded and if a URL is specified, the schema with the largest version number (most recent) will be used by the system. If both versions are the same, the system may use either one.

If no schema linkage is accessible, the system ignores the extended capabilities it gives.

Dataset
  • String [format: url]
  • embedded schema linkage definition [object]
  • Array(String [format: url])
  • Array(embedded schema linkage definition [object])
URL reference where the structure schema can be retrieved from the Web.

Otherwise, a user can put the description of the schema in an object as the value of this attribute.

There are a couple of important points regarding this listing:

  1. If the attributes or resources are already in the ontology, only the linkage information is necessary to match the source data to the OSF ontology(ies)
  2. Any attribute of your own choosing may be added to this list to accommodate your own organization's requirements and workflows.

Abstract Dataset Specification Example

Here is an example of an abstract dataset specification, with additional attributes beyond the core.

<dataset>
   <id />
   <prefLabel />
   <description />
   <source />
   <createDate />
   <creator />
   <curator />
   <maintainer />
   <prefLabel />
   <prefURL />
   <ref />
   <linkage />
   <schema />
</dataset>

Metadata

Metadata may be added to the dataset specification via the optional metaFile attribute (see above) or by embedding in the dataset specification itself.

Suggested Metadata Attributes

Note these attributes follow the general instance record object specification for irON and may contain any arbitrary attributeName attributes as desired. Alternatively, as noted, these same attributes and values may be embedded within the dataset specification or in the separate MetaFile.

Attribute Requirement Note Allowed Type(s) Cardinality Allowed Value(s) Definition
id Required Object [1] primitive: Id Identifier string used to uniquely identify the dataset
prefLabel Recommended

If the prefLabel of a dataset is not specified, systems displaying information about datasets will have a hard time figuring out how to present information about the dataset to users. If such a case happens, the system will has to fallback by displaying the ID of the dataset, or some generic label.

Object [0-1] primitive: String Human readable label used to refer to this dataset. The prefLabel is the preferred label to use to refer to the dataset and is the preferred string used in the user interface. If not specified, the id is used as the label.
description Recommended Object [0-1] primitive: String Human readable description of the dataset.
source Optional Dataset [0-*] primitive: Id An attribute describing the source of the dataset. The description of the record is available within the array of records of the dataset.
createDate Optional Dataset [0-1] primitive: Datetime Date of the creation of the dataset
creator Optional Dataset [0-*] primitive: Id An attribute describing the creator of the dataset.

The description of the record is available within the array of records of the dataset.

curator Optional Dataset [0-*] primitive: Id An attribute describing the curator of the dataset.

The description of the record is available within the array of records of the dataset.

maintainer Optional Dataset [0-*] primitive: Id An attribute describing the maintainer of the dataset.

The description of the record is available within the array of records of the dataset.

Upon definition of the dataset and its metadata, it is now time to prepare and import the datasets.