Datasets: Prepare Small (commON) Datasets

From OSF Wiki
Jump to: navigation, search

Frequently small datasets are desired to collate related records and information. These small datasets may originate as spreadsheets or may use spreadsheets as the compilation medium. Under these circumstances, use of the commON dataset notation is the preferred approach.

Overview

A general workflow for how to prepare a commON dataset is shown as follows:

Common dataset workflow.png

The workflow assumes that the dataset specifications and ontology update steps have already been done.

The columns in the spreadsheet are assigned the attribute names, with the standard commON & (ampersand) prefix. Then, all types, attributes and values are also assigned the appropriate URI names. These are appended as the "linkage" specification at the bottom of the spreadsheet.

Each row in the instance record portion of the spreadsheet (as denoted by the &&recordList convention) gets an individual record entry. Actual values entered into these cells must conform to the commON dataset notations.

At the conclusion of the instance record portion it is then necessary to list the &&linkage specifications, particularly the &mapTo requirements for relating the object names above to their specific URIs (again, see the commON dataset specifications).

Some sample commON formats are available to see examples of how these datasets are constructed.

If the information in these source spreadsheets are to be updated on a periodic basis or used for generalized tracking purposes, you may also want to style the spreadsheet or use validation fields for controlled vocabulary entries. Other spreadsheet tests can also be applied to this source.

There is also a commON case study available that may show further guidance.

Summary of Specific Steps

  1. The first step is to create an empty spreadsheet, which we will populate as outlined below. The output of this spreadsheet is a CSV serialization according to the commON specifications
  2. We next create the &&dataset specification, which includes:
    • A base dataset URI; this is the base URI used to generate the all subsequent record URIs within the dataset
    • A &prefLabel (title) for naming the dataset
    • A &description that provides a short characterization of the dataset
    • Other dataset metadata as appropriate and provided for in the specifications
  3. We next create the &&recordList portion of the specification, which is the container for all record data and generally the largest portion of the dataset:
    • All object specifications follow the single & prefix according to the naming conventions in the specifications
  4. Each individual record is placed on a new row; each record must be given a unique &id:
    • This specific field from the dataset is used as the unique identifier of the record in the dataset; this unique identifier is appended to the end of the base dataset URI to compose the final, unique, identifier for that record (URI)
  5. Each individual record is assigned a &type (class, type or kind):
    • The existing ontology(ies) need to be checked to see if the type (class) already exists for this kind of record (using structOntology or Protégé). If it does exist, then that existing URI specification is put it into the linkage section (see below) (see further the define dataset specification)
    • In some cases, it may be possible that different types will be used to describe different kind of records in the same dataset. This may happen, for example, when a characteristic determines the type of the record. One example is with a streets dataset, where multiple street types (such as court, lane, road, avenue or street) are all desired to be mapped to a single "street" type. In this case, the "linkage file" procedure described under #10 below may need to be used
      • If the type doesn't exist, it must be created in the existing ontology(ies) prior to loading the dataset (see Datasets:_Update_Ontology(ies)), and then it should be used in the linkage section
  6. Each record attribute is provided in its own column, following the initial lowercase, CamelCase format (e.g., &newAttribute)
  7. If the dataset has geographical related information, the appropriate attribute type should be put into its appropriate attribute (column) (see further Geo-enabling Datasets for the specific steps to add geographical locational specifications to records in a dataset). Depending on type, these attributes must then me mapped in the later linkage section to:
    • wsg84:lat and wsg84:long for single geographical points
    • sco:polylineCoordinates for lines outlines
    • sco:polygonCoordinates for areas outlines
  8. If the dataset includes attributes with literal values (such as names), then you should:
    • Check in the existing ontology(ies) to see if a datatype property already exists for that attribute
    • If there is none, then a new datatype property will need to be created and used in the dataset
  9. If the dataset includes attributes with object values (such as references to internal, or external records or classes), then you should:
    • Check in the existing ontology(ies) to see if an object property already exists for that attribute; map the attribute to that record using the @ prefix convention
    • If there is none, then a new object property will need to be created and its range will need to be properly defined
  10. Then, create a "linkage" file or add a &&linkage section to the CSV file. Information in this &&linkage section relates the object values in the &&recordList section (that is, those objects prefixed by either a single & or @) to the full URIs of their objects (as defined in the ontology(ies))
    • All such object values MUST be mapped to their corresponding full URIs
    • In the step above, you may have to create new classes or properties in the appropriate ontology(ies) prior to loading (importing) the dataset
  11. The dataset file is now fully completed. At this point, the records are properly defined in CSV, and the links between the datasets attribute columns and records and types have been properly mapped to the classes and properties in the system ontologies
  12. The next step is to actually load (import) the dataset (see the Datasets: Import Dataset Files document)
  13. If you get errors or notice that attributes are not defined, either update the ontology or fix the attribute names or &&linkage specifications. Repeat as necessary until there are no errors.

Leveraging Spreadsheet Options

Because spreadsheets can be exported as CSV files, you can also use the full capabilities of spreadsheets during data entry or validation. For example, you can use the facilities of a spreadsheet for:

  • controlled vocabularies - providing set lists of allowable literal assignments using the spreadsheet Validation functions
  • validation - other checks on valid entries such as ranges, counts, etc.
  • styling - highlighting new entries or missing ones, or bolding totals and so forth for easier reading and management, or
  • cross-checks - doing things like column or row counts to ensure complete data entries or other cross-checks.

Then, when done, the spreadsheet can be exported as CSV for actual dataset import into the OSF instance.

More explanation of how to leverage a spreadsheet is offered in the commON case study.