Archive 1.x:Datasets: Define Dataset Specifications

From OSF Wiki
Jump to: navigation, search

The natural follow-on to the analysis of the dataset(s) is to define the actual dataset specifications. The specifications must be done in terms of properties and classes recognizable by the system. As a result, these steps need to be done in intimate inspection of the existing ontology(ies) available to the system. The major tool to assist these workflow steps is structOntology (though Protégé or a similar ontology editing platform may also be used).

In general terms, the steps in this portion of the dataset workflow process are geared to achieve these specifications:

  • What are the kinds or types ("classes") within the dataset(s)? Do definitions of these classes already exist in the system? If so, assign their existing class name (URI); if not, flag for addition to the existing ontology(ies) (multiple types in the source data may also be mapped to a single class, depending on the analysis ) (see next sequence of steps)
  • What are the relations or attributes ("properties") within the dataset(s)? Do definitions of these properties already exist in the system? If so, assign their existing property name (URI); if not, flag for addition to the existing ontology(ies) (see next sequence of steps)
  • Does the dataset have geographical-related information? If so, particular specifications need to be defined depending on whether the dataset has point, polyline or polygon characteristics
  • Do any of the attributes refer to object values (such as reference to internal, or external records or classes)? If so, assign their existing class name (URI); if not, flag for addition to the existing ontology(ies) (see next sequence of steps).

The determinations you make in these workflow steps will be applied to the actual preparation and conversion of your datasets, either based on the commON format or datasets from a relational database.

Base URI and Metadata

Determine the base dataset URI. The base URI is where the ontology that represents your dataset vocabulary is found. It is also used to generate the core portion of the the records' URIs.

In setting this base URI, try to assign logical and short names to namespaces used for your vocabularies, such as foaf:XXX, umbel:XXX or skos:XXX, with a maximimum of five letters preferred.

In addition, set the metadata specifications for your dataset. These possible specifications provide a good starting list of candidates from which to draw.

Ontology Lookups

As you identify the concepts (classes) or attributes (properties) within your dataset, you need to determine if existing classes or attributes already exist in the system. You do so by using the structOntology tool (or via Protégé or a similar ontology editing application).

Using structOntology or its advanced search functionality (see further the Individual OSF-Drupal Ontology (structOntology) Tool manual), you first search for matching classes or properties. Use the auto-complete functionality to suggest possible matches, and then inspect these options according to these criteria:

  1. Does the definition of the object match your understanding of the target class or property in the source dataset?
  2. Does the range and domain (if used) for the object apply to the same populations as you understand for your target class or property?
  3. Do the parent and child relationships in the ontology also match your understanding for the target class or property in the ontology?

If you answer Yes to all of these questions, then you have an excellent match for your object. Note the name and URI of the match for later inclusion in the dataset linkage file (see below).

If you can not answer Yes to all of these questions, then you will need to update the ontology with the new object (see next).

Variable Naming and Assigning URIs

If the dataset object (class or attribute) is not already in the system, you will need to name and add it. You should follow the naming conventions and guidelines for including labels and definitions as contained in the Ontology and Vocabulary Design document.

Please follow the directions under the Datasets: Update Ontology(ies) document.

Attribute Assignments

As with concepts (classes) a similar process for the above needs to be followed for new attributes (properties) as well.

If the dataset includes attributes with literal values (such as names), then:

  • Check in the existing ontology(ies) to see if already existing datatype properties already exists for that attribute
  • If there is none, then a new datatype property will need to be created and then used.

If the dataset includes attributes with object values (such as reference to internal, or external records or classes), then:

  • Check in the existing ontology(ies) to see if already existing object properties already exists for that attribute
  • If there is none, then a new object property will need to be created and its range will need to be properly defined.

Controlled Vocabulary Listings

In some cases, the literals that might be the values for a given attribute will tend to sort or aggregate into a set list of items. Where such patterns occur, it is best to treat the possible attribute values as a controlled vocabulary list rather than a free-form literal (string).

This circumstance may warrant:

  1. Defining values within the ontology as a list
  2. Altering dataset authoring or conversion tools to work off of controlled vocabularies.

Geo-enabling the Data

Geo-enabling Datasets means adding geographical locational specifications to records indexed in a dataset. If the dataset has geographical related information, you must choose which of these following specifications to include in your dataset specification:

  • wsg84:lat and wsg84:long for single geographical points
  • sco:polylineCoordinates for line segments (multiples allows)
  • sco:polygonCoordinates for areas outlines (bounded regions).

Precise specifications and examples are in the Geo-enabling Datasets document.

Define Linkage Specification

The relation of the dataset specifications to the ontology specifications is provided via a linkage file. The simple linkage file relates a string value for the attribute or concept names in the dataset to its URI (of a record, class or property) within the ontology. A linkage file may either be written as XML or as a CSV (see the commON specification).

To create that linkage file:

  1. List of all the new, unique values for records, types (classes) or attributes (properties) in the source dataset
  2. Link these values to a particular URI (a record, class or property) in the target ontology
    • (Remember: new objects may need to have been created in advance in the ontology for this purpose)
  3. This new linkage schema will then be used to get string values for these objects when the actual data import process takes place.