Datasets Workflow

From OSF Wiki
Jump to: navigation, search
Important yellow.pngThis page needs to be updated to OSF v. 3.0; terminology and perhaps images are out of date !!!

The creation of datasets and the integration of them into the ontology basis of an installation is one of the central activities of the OSF system. This document overviews this overall process; most of the subsections point to specific details in other articles. You may also want to review the document on basic dataset management and terminology; there is also a glossary of dataset and related semantic technologies terms.

The dataset workflow embraces the individual workflows of small and large (internal) datasets, as well as ontologies, as the common workflow diagram highlights:

Dataset icon2.png

To understand this workflow in more detail, we first provide a general overview and then attend to the specific parts. Alternatively, you can try jumping ahead by using the Quick Start with New Data.

Workflow Overview

The dataset workflow begins by identifying a suitable candidate set of records, or dataset. The structure and description of this candidate is then analyzed for how to properly characterize the dataset in OSF terms and requirements. A key aspect of this analysis is to match the new dataset structure with the existing ontology(ies) in the system:

Overall dataset workflow.png

If structure and characterizations are available within the existing ontology(ies) then these resource names and URIs are used in subsequent conversion steps; if such structure is not available, then additions must be made to the ontology(ies). This forms a separate set of bypass steps. However, for proper importation of the new dataset into the system, these ontology resource placeholders must already be in place.

Once analyzed, the dataset preparation and conversion may take one of two paths. For smaller datasets or datasets supplied by external parties, one of the irON notations is the suitable vehicle. In the sample workflow below, we use the commON notation for our example.

For larger datasets or those arising from a local relational database system (RDBMs), a different set of conversions and pathways is followed. In our example cases herein, we are using as our exemplar basis an Oracle RDBMs with the assistance of the Safe Software FME conversion/ETL tool.

Once properly converted, the datasets then must be imported into the system. (There is also an automated side branch that involves updates, most often scheduled, for cron submissions of updated data.) The system administrator must also determine access rights for other users to the new dataset.

Optionally, the system administrator may also choose to create various display templates or other visualization modifications so as to better show the new types of data in the new dataset. Such modifications may also require some minor modification to the administrative ontologies used by the system, since OSF is premised on the idea of an ontology-driven application.

Once all of these steps are completed, the new dataset is ready for use and integration within the broader OSF framework and OSF-Drupal.

Identify the Dataset

The first step in any dataset workflow is to identify the candidate dataset. Clearly, of course, this must be a set of data or records of interest to the enterprise. However, additional guidance is provided into how to bound and package these datasets as described on the Datasets: Identify page.

Analyze the Dataset

(NOTE: Need Toad screen captures and other assistance from Brian.)

These steps in the dataset workflow process are geared to analyzing and understanding the structure of the source data. There are a set of generic questions that must be analyzed irrespective of the source of the data:

  • Should the data be classified or organized into one or more datasets? Guidance for this question comes from the Datasets: Identify document
  • For the dataset(s), what is the schema or structure of the data? The tools for how this is analyzed may differ by the type of source data (for example, a relational data table v spreadsheet)
  • Does the dataset(s) contain geographical information? If so, there are different analysis and set-up paths that differ whether that information represents:
    • Single geographical points (as might be represented by a marker or thumbtack on a map)
    • Polylines (routes or roads or paths)
    • Polygons (bounded areas or regions)
  • Does the dataset(s) include literal (string) values?
  • Does the dataset(s) characterize things via lists or controlled vocabularies?
  • Does the dataset(s) include reference to other records?

Depending on the answers to these questions, set-up and conversion approaches will differ. Further, the tools for answering these questions may differ by the formalism of the data source.

More detail on these dataset workflow steps is provided in the Datasets: Analyze and Structure Requirements document.

Determine Attribute and Concept Requirements

The natural follow-on to this analysis of the dataset(s) is to define the actual dataset specifications. The specifications must be done in terms of properties and classes recognizable by the system. As a result, these steps need to be done in intimate inspection of the existing ontology(ies) available to the system. The major tool to assist these workflow steps is structOntology (though Protégé or a similar ontology editing platform may also be used).

In general terms, the steps in this portion of the dataset workflow process are geared to achieve these specifications:

  • What are the kinds or types ("classes") within the dataset(s)? Do definitions of these classes already exist in the system? If so, assign their existing class name (URI); if not, flag for addition to the existing ontology(ies) (multiple types in the source data may also be mapped to a single class, depending on the analysis )(see next sequence of steps)
  • What are the relations or attributes ("properties") within the dataset(s)? Do definitions of these properties already exist in the system? If so, assign their existing property name (URI); if not, flag for addition to the existing ontology(ies) (see next sequence of steps)
  • Does the dataset has geographical related information? If so, particular specifications need to be defined depending on whether the dataset has point, polyline or polygon characteristics
  • Do any of the attributes refer to object values (such as reference to internal, or external records or classes)? If so, assign their existing class name (URI); if not, flag for addition to the existing ontology(ies) (see next sequence of steps).

More detail on these dataset workflow steps in provided in the Datasets: Define Dataset Specifications document.

Update the Ontology(ies)

Any properties or classes needed by the dataset structural analysis as noted above, but which are not yet in the system, need to be added to the existing ontology(ies). Like the inspection steps above, this updating of the ontology(ies) is based on the structOntology tool (though, as before, Protégé or a similar ontology editing platform may also be used). Specific use instructions are provided in the Individual OSF-Drupal Ontology (structOntology) Tool manual.

Updates to the existing ontologies may occur under any of these circumstances:

  • A new understanding of the domain, which requires extension or enhancement of the existing structure
  • Adding a new, local dataset
  • Updating or expanding a local dataset, or
  • Incorporating a remote dataset accessible via your OSF Web Service network.

In any case, the same workflow steps apply.

The basic process of updating an existing ontology has these steps:

  • Define the new class or property; make sure and provide a prefLabel for the object, add as many altLabels as applicable and useful, and define the object with a textual description sufficient to bound and scope the new object
  • Define the relationships of this new object to other classes or properties, and
  • Periodically test your updated ontology for logic consistency using a reasoner.

Specific steps and guidance for this portion of the dataset workflow are provided by the Datasets: Update Ontology(ies) document.

Prepare/Convert the Dataset

For the most part, the dataset workflow steps above have been generic, and not related specifically to the kind of source data being brought into the system. This portion of the documentation deals with specific steps for specific kinds of sources.

There are basically two paths for getting structured data into the system. The first, involving (generally) smaller datasets is the manual conversion of the source data to one of the pre-configured OSF import formats of RDF, JSON, XML or CSV. These are based on the irON notation; a good case study for using spreadsheets is also available.

The second path (bottom branch) is the conversion of internal structured data, often from a relational data store. Various converters and templates are available for these transformations. One excellent tool is FME from Safe Software (representing the example shown utilizing a spatial data infrastructure (SDI) data store), though a very large number of options exist for extract, transform and load.

In the latter case, procedures for polling for updates, triggering notice of updates, and only extracting the deltas for the specific information changed can help reduce network traffic and upload/conversion/indexing times.

Data.png

The information below contrasts two use cases that capture the spectrum of these structured data possibilities. The first example is for smaller datasets using a spreadsheet (commON notation) basis. The second example is based on transfer of multiple and large datasets from an existing Oracle relational database management system with spatial capabilities.

Small (commON) Datasets

Frequently small datasets are desired to collate related records and information. These small datasets may originate as spreadsheets or may use spreadsheets as the compilation medium. Under these circumstances, use of the commON dataset notation is the preferred approach.

A general workflow for how to prepare a commON dataset is shown as follows:

Common dataset workflow.png

The workflow assumes that the dataset specifications and ontology update steps (see above) have already been done.

The columns in the spreadsheet are assigned the attribute names, with the standard commON & (ampersand) prefix. Then, all types, attributes and values are also assigned the appropriate URI names. These are appended as the "linkage" specification at the bottom of the spreadsheet.

Each row in the instance record portion of the spreadsheet (as denoted by the &&recordList convention) gets an individual record entry. Actual values entered into these cells must conform to the commON dataset notations.

At the conclusion of the instance record portion it is then necessary to list the &&linkage specifications, particularly the &mapTo requirements for relating the object names above to their specific URIs (again, see the commON dataset specifications).

Some sample commON formats are available to see examples of how these datasets are constructed.

If the information in these source spreadsheets are to be updated on a periodic basis or used for generalized tracking purposes, you may also want to style the spreadsheet or use validation fields for controlled vocabulary entries. Other spreadsheet tests can also be applied to this source.

For an explanation of the detailed steps in this workflow, see Datasets: Prepare Small (commON) Datasets. There is also a commON case study available that may show further guidance.

Internal (RDB) Datasets

Larger datasets, especially those from existing databases, are often updated and often have related variants that also deserve being incorporated into the system as datasets. For these reasons, it is often useful to have various scripts that can be modified or revised in order to capture the transformation pathway from source to OSF.

Initial Dataset Load

(NOTE: Need substantial screen captures and other assistance from Brian.)

The combination of scripting and volume, plus the utility of an approach geared to relational databases, warrants a different slate of workflow steps than what might be applied to a standard dataset (see above). Though the specific example here is based on the Oracle RDBMs with the assistance of the Safe Software FME conversion/ETL tool, any conversion from an existing relational DB shares these similar steps:

Rdb dataset workflow.png

Once the dataset(s) has been analyzed and slots for new objects (classes and properties) have been added to the existing ontology(ies), it is time to prepare the transformation process. A "view" file derived from the existing database table (schema) is created that provides all of the defined mappings between relational table variables and OSF object specifications. In addition, if controlled vocabularies or lists are the values for specific relational table attributes, these are separately converted into a file readable by the transformation tool, FME.

These two files are the input bases to the FME transformation tool. Once these input files are imported and joined in the FME tool, there are a series of steps -- or transformations -- that may need to be applied to make actual data values consistent. By using small subset extractions of the data, these specifications can be tested and then modified as needed until the transformation output is validated as suitable for import into OSF Web Service.

Setting the transformation rules and process in the FME tool is aided by a graphical user interface that makes it straightforward to map source-to-target paths as well as to swap in or out various transformation filters to achieve the conversion objectives consistently.

Individual filters within a transformation pathway are themselves reusable objects that may be applied to other source-to-target conversion pathways. The combinations of all of these specifications can also then be saved and reused as transformation templates. An example transformation template is available for inspection that shows this building block design.

Thus, one early step in the transformation process might be to import an existing template to use as the basis for refinement for the current dataset. Once these changes are made, the new script may then be saved on its own and used again for the same transformation or used as a template for a still newer transformation of another dataset.

For details on the workflow steps, see Datasets: Prepare Internal (RDB) Datasets.

Updating Datasets

(NOTE: Need substantial screen captures and other assistance from Brian.)

To be completed.

For details on the workflow steps, see Datasets: Update Internal (RDB) Datasets.

Import Dataset Files

Once the dataset is prepared, it then must be imported into the framework via OSF Web Service. There are multiple ways to import datasets into a OSF Web Service node, which are explained in the Datasets: Import Dataset Files document. It is also possible to append dataset when updates are relatively minor for smaller- to medium-sized datasets.

Assign Access Rights

Recall that one criteria for bounding a dataset is its intended uses and rights by various groups (see Datasets: Identify for example). Upon import, it is now time to make those assignments according to the Datasets: Assign Rights document, which itself implements the more general considerations laid out in the Datasets and Access Rights (OSF Web Service) document.

Structure Results Displays and Pages

The incorporation of new datasets often times means new structures, attributes and perspectives. There are a variety of modifications and tailorings that might be done to present the new data in tabular, visualization or widget form.

The entire topic of structure results displays and pages is covered under a separate workflow section, Configuration Workflow, especially under the layouts, templates and component theming sub-sections.

Use the Data

There is much written in many places throughout this zWiki on how to use the data once loaded. A good place to start is Using Datasets in OSF-Drupal, which provides basic steps for how to add and then integrate a new dataset into your local instance using various open semantic framework tools.