Basic Dataset Management

From OSF Wiki
Jump to: navigation, search

Basic Intro and Groundings

Information within the Open Semantic Framework may reside in many forms, but it largely exists as either:

  • Schema -- the organizational structures or ontologies that govern and define the relationships between concepts and data
  • Documents -- or unstructured data, for which tagging is used to extract out concepts and entities with which to relate to the governing schema or data, and
  • Datasets -- the structured data components that provide and describe the instance records and attributes about individual things and entities.

This article is a basic entry point to how to interact with the latter.

Datasets contain one or more data records from a single source representing the same type of instance(s). Datasets may reside on the Web as well as be stored locally. Each dataset is uniquely identified with standard metadata characterizations.

At minimum, datasets have a simple structure of attribute-value pairs for each instance record. However, they may also have more complex structure via schemas (ontologies) that also describe the relationships between concepts and attributes and may even relate those to external schema.

All OSF tools operate against one or more datasets, which can be selected for these operations. Individual users may be assigned access rights or not to each of these datasets, and whether they have CRUD (create-read-update-delete) permissions or not.

The combination of access rights and permissions then defines which tools and what operations are available to a given user for each dataset. See further the Managing Permissions document.

The irON Notation

Though there are a variety of ways to get data into OSF, a key one, especially for converted or authored datasets, is based on irON, the 'instance record and Object Notation'. irON is a abstract notation and associated vocabulary for specifying RDF (Resource Description Framework) triples and schema in non-RDF forms. irON's purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF.

irON is specifically designed for dataset description and characterization. irON datasets can be created or authored in the serialization forms of either XML (irXML), JSON (irJSON) or comma-separated values (CSV) using spreadsheets as the authoring and management environment. The latter serialization is called commON and is the more prevalent use case in the various examples below.

The irON specification should be read and understood in its entirety before beginning to author datasets using the irON notation.

Creating and Importing New Datasets

Adding a new dataset is the proper place to start when contemplating adding new data to the system. This document describes the basic approaches and trade-offs with the various input mechanisms available.

When authoring a new dataset or when making many changes to an existing one, it is advisable to follow the datasets import approach.

A fairly detailed step-by-step set of guidelines for how to create and import datasets is provided in the CommON case study.

Editing or Modifying Existing Datasets

When only needing to update or modify one to a few records, it is advisable to use the OSF-Drupal update record tool.

Displaying Datasets

For basic guidance on how to set up dataset edit and create forms, see the instance record forms format.