Datasets: Prepare Internal (RDB) Datasets

From OSF Wiki
Jump to: navigation, search

Larger datasets, especially those from existing databases, are often updated and often have related variants that also deserve being incorporated into the system as datasets. For these reasons, it is often useful to have various scripts that can be modified or revised in order to capture the transformation pathway from source to OSF.

It is important that the steps outlined in Datasets: Define Dataset Specifications be completed before moving on to this step.

Overview

The combination of scripting and volume, plus the utility of an approach geared to relational databases, warrants a different slate of workflow steps than what might be common to a standard dataset. Though the specific example here is based on the Oracle RDBMs with the assistance of the Safe Software FME conversion/ETL tool, any conversion from an existing relational DB shares these similar steps:

Rdb dataset workflow.png

Once the dataset(s) has been analyzed and slots for new objects (classes and properties) have been added to the existing ontology(ies), it is time to prepare the transformation process. A "view" file derived from the existing database table (schema) is created that provides all of the defined mappings between relational table variables and OSF object specifications. In addition, if controlled vocabularies or lists are the values for specific relational table attributes, these are separately converted into a file readable by the transformation tool, FME.

These two files are the input bases to the FME transformation tool. Once these input files are imported and joined in the FME tool, there are a series of steps -- or transformations -- that may need to be applied to make actual data values consistent. By using small subset extractions of the data, these specifications can be tested and then modified as needed until the transformation output is validated as suitable for import into OSF Web Service.

Setting the transformation rules and process in the FME tool is aided by a graphical user interface that makes it straightforward to map source-to-target paths as well as to swap in or out various transformation filters to achieve the conversion objectives consistently.

Individual filters within a transformation pathway are themselves reusable objects that may be applied to other source-to-target conversion pathways. The combinations of all of these specifications can also then be saved and reused as transformation templates. An example transformation template is available for inspection that shows this building block design.

Thus, one early step in the transformation process might be to import an existing template to use as the basis for refinement for the current dataset. Once these changes are made, the new script may then be saved on its own and used again for the same transformation or used as a template for a still newer transformation of another dataset.

Summary of Specific Steps

  1. Scope and define the target dataset
  2. Define the base dataset URI, which is used to generate the all subsequent record URIs within the dataset
  3. Analyze the existing data tables and begin transformation scripts
  4. Update the existing ontology(ies); determine if required types (classes) already exist; if not, add them
  5. Create an empty FME XML Template
    • See below
  6. Define the type of each record
    • See below
  7. Add geographical-related information, if applicable
    • See below
  8. Add other transformation factories, if applicable
    • See below
  9. Account for attributes with literal values (such as names)
    • See below
  10. Account for attributes with object values (all assignments besides strings)
    • See below
  11. Create the linkage file
    • See below
  12. Initially convert and inspect the FME XML template file
    • See below
  13. Import the converted data

Create the FME XML Template

Once the preliminaries are done of dataset definition and initial structure analysis and simple transformations (see Steps #1-4 above), it is now time to begin working with the FME tool directly.

We will do so by creating an empty FME XML Template. We will test this template with a single record conversion until the process below is vetted and validated. The output of this template is a RDF+XML serialization.

We first begin by exporting some sample records from our source relational data table (Oracle, in this instance):

Oracle view.png

And then applying any of our simple, pre-processing transformations against it (see Datasets: Analyze and Structure Requirements). This now creates a starting text file of sample input data, against which we can design our FME templates.

We start the FME Workbench and load this initial data file:

FME main.png

FME Workbench is a graphical transformation workflow editor for how we create our transformation templates (or "factories"). This workflow is comprised of one or more pipeline steps, with each step involving some form of transformation. Pipeline steps may be more complication compound ones, involving a number of combine steps for more complicated conversions.

Good design is to organized repeated conversions into their own templates, which then can be invoked as pipeline steps in other templates.

As we work through the generic steps below, we will eventually be creating an complete FME Workbench template file, as this example template file shows.

Define Record Types

Due to our earlier analysis and resulting ontology updates, we are now ready to assign source values to these specifications.

Each desired field and record identifier in the source table (as they may have been re-defined during the analysis and earlier transition steps) appended to the end of the base dataset URI in order to compose the final, unique, identifier for that field or record.

(Note: because of desired transformations in case, underscores, or other transformations, a more complicated filter may need to be added to the FME workflow to generate one at conversion time.)

Add Geographical-related Information

A strength of FME as an ETL tool is its native ability to handle geographically related information. Thus, one common set of filters or transformations may need to be to prepare the target OSF requirements for geo-enabling Datasets, specifically for:

  • wsg84:lat and wsg84:long for single geographical points
  • sco:polylineCoordinates for lines outlines
  • sco:polygonCoordinates for areas outlines.

These transformations are rather complicated on their own; see further the separate geo-enabling document.

Add Transformation 'Factories'

The earlier analysis step overviewed many of the possible transformations in moving from source data tables to target datasets. Some of these might be simple ones that some pre-processing scripts might be applied to; some might be as complicated as geo-coordinate or geo format conversions. Unit conversions, changes in data aggregation levels, or the extraction and use of controlled vocabularies may also be warranted.

The FME Workbench tools and interface is specifically designed to aid creating workflows graphically to capture such transformations (for non-geo-enabled data, there are some excellent open source tools[1]). The result of working with this FME interface is the creation of the template files that are actually applied during the transformation process (see an example here).

Depending on the kind of transformation and its complexity, individual steps may vary. But in this series of screen captures, we highlight a few of the aspects of the FME interface to make development of these filters easier.

A multi-paned view, similar to other standard IDEs, enables us to pick the specific resources we want to work with. The main activity occurs in the upper-right design pane, where we place and link resources together into processing pipelines:

FME transformer.png

Once defined, we can then name the overall pipeline (template) and indicate where it is stored and to what files it applies. These same templates can themselves be building blocks in subsequent templates, enabling quite complicated overall pipeline to be built:

FME transformer parameters.png

For individual pipeline steps, we can also specify filters or other processing steps that might be applied, including warnings and messages to be embedded at time of processing:

FME spatial join.png

And, throughout, we can provide detailed notes and explanations for what each step does or other annotations related to when created, by whom, or whatever.

See further the Example FME Workbench File for what eventually gets created with our example conversion.

The specifics of each step and how they are configured together obviously differ by project and source and target data requirements. Nonetheless, FME, or tools like it, are an excellent means to codify and document these automated conversion pipelines.

Process Literal Attributes

When processing attributes with literal values (such as names), it is important to refer to existing datatype properties in the pipeline where they already exist, or to define them in advance in the ontology and then refer to these new names in the XML template.

Process Object Attributes

When processing attributes with object values (such as references to internal, or external records or classes), it is important to refer to existing object properties in the pipeline where they already exist, or to define them in advance in the ontology and then refer to these new names in the XML template.

Create Linkage File

A linkage file needs to accompany the dataset at time of import and creation. The linkage file relates the attribute names in the dataset being created to the specific object names (with full URIs) within the ontology. The linkage file may be created separately or included as a specification within the overall FME template.

The simple linkage file relates a string value (literal) found in the relational database to a URI (of a record or class) added earlier to the existing ontology(ies). To create the linkage file, a separate list of all unique values must be produced from its corresponding field in the relational database table, each of which will be linked to a particular URI (a record, or a class depending on what we are talking about). If it is an attribute (class), this should have been defined in advance in the ontology.

Here is a sample input file that relates attributes:

<now:Aboriginal_services rdf:about="http://SERVER_URL/datasets/Aboriginal_services/">
  <ID>{fme:get-attribute("ID")}</ID>
  <iron:prefLabel>{fme:get-attribute("SERVICE_NAME")}</iron:prefLabel>
  <iron:description>{fme:get-attribute("DESCRIPTION")}</iron:description>
  <now:streetAddress>{fme:get-attribute("ADDRESS")}</now:streetAddress>
  <now:postalCode>{fme:get-attribute("POSTAL_CODE")}</now:postalCode>
  <now:phoneNumber>{fme:get-attribute("CONTACT")}</now:phoneNumber>
  <now:serviceCategory>{fme:get-attribute("serviceCategory_RESOURCE")}</now:serviceCategory>
  <now:servicePrimaryActivity>{fme:get-attribute("servicePrimaryActivity_RESOURCE")}</now:servicePrimaryActivity>
  <now:serviceTargetUser>{fme:get-attribute("serviceTargetUser_RESOURCE")}</now:serviceTargetUser>
  <now:serviceIntendedImpact>{fme:get-attribute("INTENDED_IMPACT")}</now:serviceIntendedImpact>
  <now:typeOfFunding>{fme:get-attribute("FUNDING")}</now:typeOfFunding>
  <now:affiliatedWith>{fme:get-attribute("AFFILIATION")}</now:affiliatedWith>
  <geo:locatedIn>{fme:get-attribute("neighbourhood_RESOURCE")}</geo:locatedIn>
  <wgs84_pos:long>{fme:get-attribute("_x")}</wgs84_pos:long>
  <wgs84_pos:lat>{fme:get-attribute("_y")}</wgs84_pos:lat>
  <wgs84_pos:alt>{fme:get-attribute("_z")}</wgs84_pos:alt>
  <sco:namedEntity>true</sco:namedEntity>
  <wsf:crudAction>{fme:get-attribute("deltatype")}</wsf:crudAction>
</now:Aboriginal_services>

Here is a sample input file that relates those attributes to their full namespaces:

<now:Aboriginal_services xmlns:now="http://purl.org/ontology/now#" rdf:about="http://SERVER_URL/datasets/Aboriginal_services/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <ID>{fme:get-attribute("ID")}</ID>
  <iron:prefLabel xmlns:iron="http://purl.org/ontology/iron#">{fme:get-attribute("SERVICE_NAME")}</iron:prefLabel>
  <iron:description xmlns:iron="http://purl.org/ontology/iron#">{fme:get-attribute("DESCRIPTION")}</iron:description>
  <now:streetAddress xmlns:now="http://purl.org/ontology/now#">{fme:get-attribute("ADDRESS")}</now:streetAddress>
  <now:postalCode xmlns:now="http://purl.org/ontology/now#">{fme:get-attribute("POSTAL_CODE")}</now:postalCode>
  <now:phoneNumber xmlns:now="http://purl.org/ontology/now#">{fme:get-attribute("CONTACT")}</now:phoneNumber>
  <now:serviceCategory xmlns:now="http://purl.org/ontology/now#">{fme:get-attribute("serviceCategory_RESOURCE")}</now:serviceCategory>
  <now:servicePrimaryActivity xmlns:now="http://purl.org/ontology/now#">{fme:get-attribute("servicePrimaryActivity_RESOURCE")}</now:servicePrimaryActivity>
  <now:serviceTargetUser xmlns:now="http://purl.org/ontology/now#">{fme:get-attribute("serviceTargetUser_RESOURCE")}</now:serviceTargetUser>
  <now:serviceIntendedImpact xmlns:now="http://purl.org/ontology/now#">{fme:get-attribute("INTENDED_IMPACT")}</now:serviceIntendedImpact>
  <now:typeOfFunding xmlns:now="http://purl.org/ontology/now#">{fme:get-attribute("FUNDING")}</now:typeOfFunding>
  <now:affiliatedWith xmlns:now="http://purl.org/ontology/now#">{fme:get-attribute("AFFILIATION")}</now:affiliatedWith>
  <geo:locatedIn  xmlns:geo="http://www.geonames.org/ontology#">{fme:get-attribute("neighbourhood_RESOURCE")}</geo:locatedIn>
  <wgs84_pos:long xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#">{fme:get-attribute("_x")}</wgs84_pos:long>
  <wgs84_pos:lat xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#">{fme:get-attribute("_y")}</wgs84_pos:lat>
  <wgs84_pos:alt xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#">{fme:get-attribute("_z")}</wgs84_pos:alt>
  <sco:namedEntity xmlns:sco="http://purl.org/ontology/sco#">true</sco:namedEntity>
  <wsf:crudAction xmlns:wsf="http://purl.org/ontology/wsf#">{fme:get-attribute("deltatype")}</wsf:crudAction>
</now:Aboriginal_services>

Test Convert and Inspect the Template File

Upon completion of the various pipeline steps (conversions and transformations), the FME XML template file is now ready for testing. At this point, the records are properly defined in RDF+XML, and the links between the relational database's table columns are properly mapped to the RDF datatype and object properties in the template (via the linkage files).

Running this script against a small subset of records will enable quality checks to ensure that all conversions are processing properly. For example, here is an output file with three sample records:

Here is an example FME XML template output file, (with only three records shown):

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:now="http://purl.org/ontology/now#"
xmlns:sco="http://purl.org/ontology/sco#"
xmlns:iron="http://purl.org/ontology/iron#"
xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"
xmlns:geo="http://www.geonames.org/ontology#"
xmlns:wsf="http://purl.org/ontology/wsf#">
 
  <now:Aboriginal_services rdf:about="http://npidev.structureddynamics.com/datasets/Aboriginal_services/15">
    <iron:prefLabel>Grassroots News</iron:prefLabel>
    <iron:description>Aboriginal newspaper</iron:description>
    <now:streetAddress>150 Henry Ave</now:streetAddress>
    <now:postalCode>R3B 0J7</now:postalCode>
    <now:phoneNumber>589-7495</now:phoneNumber>
    <now:serviceCategory rdf:resource="http://purl.org/ontology/now#Culture_and_creativity" />
    <now:serviceCategory rdf:resource="http://purl.org/ontology/now#Heritage" />
    <now:servicePrimaryActivity rdf:resource="http://purl.org/ontology/now#Printing_services" />
    <now:serviceTargetUser rdf:resource="http://purl.org/ontology/now#People" />
    <now:serviceIntendedImpact>strength through
    knowledge</now:serviceIntendedImpact>
    <now:typeOfFunding>self supporting</now:typeOfFunding>
    <now:affiliatedWith>Grassroots News</now:affiliatedWith>
    <geo:locatedIn rdf:resource="http://npidev.structureddynamics.com/datasets/neighbourhoods/1022" />
    <wgs84_pos:long>-97.1326238490838</wgs84_pos:long>
    <wgs84_pos:lat>49.9027129851638</wgs84_pos:lat>
    <wgs84_pos:alt>0.0</wgs84_pos:alt>
    <sco:namedEntity>true</sco:namedEntity>
    <wsf:crudAction>create</wsf:crudAction>
  </now:Aboriginal_services>
 
  <now:Aboriginal_services rdf:about="http://npidev.structureddynamics.com/datasets/Aboriginal_services/198">
    <iron:prefLabel>Yellowquill College</iron:prefLabel>
    <iron:description>Aboriginal college providing post-secondary
    programs</iron:description>
    <now:streetAddress>340 Assiniboine Ave</now:streetAddress>
    <now:postalCode>R3C 0Y1</now:postalCode>
    <now:phoneNumber>953-2800</now:phoneNumber>
    <now:serviceCategory rdf:resource="http://purl.org/ontology/now#Education_and_skills" />
    <now:serviceTargetUser rdf:resource="http://purl.org/ontology/now#People" />
    <now:serviceIntendedImpact>completion of post-secondary
    studies</now:serviceIntendedImpact>
    <now:typeOfFunding>multi-year</now:typeOfFunding>
    <now:affiliatedWith>Yellowquill College</now:affiliatedWith>
    <geo:locatedIn rdf:resource="http://npidev.structureddynamics.com/datasets/neighbourhoods/116" />
    <wgs84_pos:long>-97.1390879713711</wgs84_pos:long>
    <wgs84_pos:lat>49.8847406214701</wgs84_pos:lat>
    <wgs84_pos:alt>0.0</wgs84_pos:alt>
    <sco:namedEntity>true</sco:namedEntity>
    <wsf:crudAction>create</wsf:crudAction>
  </now:Aboriginal_services>
 
  <now:Aboriginal_services rdf:about="http://npidev.structureddynamics.com/datasets/Aboriginal_services/235">
    <iron:prefLabel>Native Studies Department</iron:prefLabel>
    <iron:description>courses towards a degree in Native
    Studies</iron:description>
    <now:streetAddress>183 Dafoe Rd</now:streetAddress>
    <now:postalCode>R3T 2N2</now:postalCode>
    <now:phoneNumber>474-6333</now:phoneNumber>
    <now:serviceCategory rdf:resource="http://purl.org/ontology/now#Education_and_skills" />
    <now:serviceTargetUser rdf:resource="http://purl.org/ontology/now#People" />
    <now:serviceIntendedImpact>completion of post-secondary
    studies</now:serviceIntendedImpact>
    <now:typeOfFunding>other</now:typeOfFunding>
    <now:affiliatedWith>University of Manitoba</now:affiliatedWith>
    <geo:locatedIn rdf:resource="http://npidev.structureddynamics.com/datasets/neighbourhoods/1670" />
    <wgs84_pos:long>-97.1318156237263</wgs84_pos:long>
    <wgs84_pos:lat>49.8093893254848</wgs84_pos:lat>
    <wgs84_pos:alt>0.0</wgs84_pos:alt>
    <sco:namedEntity>true</sco:namedEntity>
    <wsf:crudAction>create</wsf:crudAction>
  </now:Aboriginal_services>
</rdf:RDF>

It may be necessary to revise the Workbench template to resolve any errors found. Repeat the testing as above until a clean transformation pathway is validated. Make sure to Save the final XML template for re-use for repeat transformations or as a template for other migrations.

Import the Converted Data

With these steps complete, it is now time to actually import the data. See the Datasets: Import Dataset Files document.

  1. See the Pentaho and Talend Open Studio open source ETL options, among others.