Datasets: Analyze and Structure Requirements

From OSF Wiki
Jump to: navigation, search

These steps in the dataset workflow process are geared to analyzing and understanding the structure of the source data. There are a set of generic questions that must be analyzed irrespective of the source of the data:

  • Should the data be classified or organized into one or more datasets? Guidance for this question comes from the Datasets: Identify document
  • For the dataset(s), what is the schema or structure of the data. The tools for how this is analyzed may differ by the type of source data (for example, a relational data table v spreadsheet)
  • Does the dataset(s) contain geographical information? If so, there are different analysis and set-up paths that differ whether that information represents:
    • Single geographical points (as might be represented by a marker or thumbtack on a map)
    • Polylines (routes or roads or paths)
    • Polygons (bounded areas or regions)
  • Does the dataset include literal (string) values?
  • Does the dataset characterize things via lists or controlled vocabularies?
  • Does the dataset include reference to other records?

Depending on the answers to these questions, set-up and conversion approaches will differ. Further, the tools for answering these questions may differ by the formalism of the data source.

Identify Datasets

The considerations for what makes for a good dataset are discussed in a separate workflow document. One important consideration to keep in mind, however is what the eventual dataset access rights to the source data may be (public v restricted v private data). Try not to conjoin data in a single dataset where there may be non-uniform differences in access or CRUD rights within groups.

Assuming the subject dataset has been properly scoped, we can now proceed to the other steps.

Example Use Case

Note: The remaining portions of this discussion emphasize conversion of relational data, with tools and work approaches that are indicative -- but not the only approaches possible -- for such conversions.

Our example use case is based on data in an Oracle relational database. This information is inspected using the Toad data inspection utility. The ETL (extract, transform, load) utility used in the conversion is FME.

Inspect Source Material

The purpose of your initial inspection is to understand the structure of your source data and the possible transformations necessary to it in order to convert that data into usable for by OSF.

Relational Datastores

Generally speaking, and the most common use case, is the desire to migrate information from a single relational data table into an OSF dataset. In this case, we have used the Toad utility to open up and inspect a table dealing with service providers to aborigines (native peoples) in a local community. Via Toad, we are able to see the table structure, issue SQL queries (if needed to filter subset results), and other structural aspects of the data table:

Rdb datasets 1.png

In this next case we are looking at a data table regarding land use parcels, which we have further filtered down via an SQL query into those parcels that have a land use code assigned:

Rdb datasets 2.png

The point here is not to describe how to use the Toad tool in detail, nor to describe the specific aspects of the example tables. Rather, via tools like Toad, it is important to inspect the structure of your source information. We are specifically looking for field names (columns) or attributes in our data tables, as well as the data types (strings, integers, floats, etc) for the data values within the individual records. We are also attempting to understand labels and identifiers and record scope and structure through such inspections.

Organize Field Names

At this point it is useful to grab all of the field names (columns) from your source relational table and put them into a spreadsheet. You can inspect these items, decide which ones you want to extract, and begin to map them to their new attribute names within the target ontology. As you will also see below, you can use such spreadsheets (amongst other tooling approaches) to help organize and write the scripts for some of the transformation steps you may likely need to apply.

Transformation Requirements

How data is organized and described in your source relational data tables is likely not how you want them organized and described in your target OSF installation. Desired changes of this kind are known as "transformations". Examples of common transformations are provided in this section.

Major Transformation Areas

Transformations can either be descriptive or structural. Some of the key types of transformations possible are:

  • String substitutions - these might be case changes, changes from underscores to spaces, renaming things, and so forth. How stuff is labeled and described in the source tables is transformed largely using string substitutions
  • Structural transformations - you may also need to make structural changes using such methods as named regexes, unit conversions, or spatial filters. In conversion tools, these are often known as "factories"
  • Geometric conversions - the way in which geographical information is encoded in your source tables may not conform to your target. this is a fairly convoluted area with different techniques needed for points, polylines and polygons, with different transformations depending on source and target formats. For more discussion of this topic, see Geo-enabling_Datasets
  • LocatedIn - it is sometimes useful -- at time of conversion and transformation -- to relate spatial objects to a given bounded polygon, such as placing a building object within a particular neighborhood or community. Such "LocatedIn" transformations amount to a spatial join, with the point object placed within a bounding containers. These methods are parameter driven and require special processing routines. It may also be advisable to create a separate LocatedIn dataset specific for handling these inclusions
  • Named Entities - certain records in your source tables may conform to what you would like to treat as a "named entity" within OSF (and its indexing or tagging treatment by OSF Tagger (scones), for example). If this is likely to be the case, you will also want to create a transformation filter that will flag specific records or not as named entities
  • Split / combined categories - a common transformation challenge has to do with aggregation mismatches: either some records need to be split, or others need to be combined, in order to meet the aggregation schema of the target datasets. These cases should be understood, with specialized filters and factory transformations necessary at time of conversion
  • URI Identifiers - the basis of identifiers in OSF datasets is the URI (such as http://example.com/datasets/dataset_1/attribute_a). Existing identifiers need in the source tables need to be mapped and possible transformed to the desired canonical URI format of the target datasets.

Controlled Vocabulary

Another transformation area that might arise in your inspection is the possible shift to the use of controlled vocabularies. That occurs when literal values for a given data field in the source data table are repeated or duplicated. This pattern may warrant using a controlled vocabulary ('dropdown list') in the target dataset. Here are some guidelines for this possiblity:

  • If you suspect your source data might meet this condition, do a Select Distinct and inspect the results. If you find a relatively few number of distinct results or results with similar values but different names that can be combined together, the subject data field is likely a good candidate for a controlled vocabulary
  • Create a special view
  • Develop some scripts to create and/or consolidate the controlled vocabulary names.

Other Structured Information

You may also want to inspect the Classification of Semantic Heterogeneity document for other possible sources of structural differences.

Transformation Approaches and Scripts

As the next part of this workflow describes, Datasets: Prepare Internal (RDB) Datasets, we are using the FME tool to do the actual convert and transformation steps. (Though there are certainly other possible tools [1].)

FME, or similar tools for that matter, tend to work from an exported "view" of the source data, generally in text form, upon which final transformations (or the "factories") operate. Many of the initial conversion steps can be applied to this "view" after export, but prior to actually managing the more complicated conversions within the ETL tool itself.

Such intermediate conversion tools have the advantages of being easily understood and simple. If done right, they are largely self-documenting and capture key knowledge of the domain. Since there is no real penalty for having many of these, it is useful to build up your own set of these scripting aids.

Useful transformation areas above for which such aids might be applicable include:

  • URI assignments
  • typo and case conversions, etc.
  • relatively straightforward regex conversions.

For example, here is a screenshot of assigning a namespace URI to source material:

Replace input.png

Here we are setting up some namespace replacements:

Replace namespace 1.png

And then making assignments to individual attribute (field) names:

Replace namespace 2.png

And then making sure that full URIs are assigned to each of the attributes and records:

Replace uri.png

Many multiples of these scripts are possible. Again, the advantage is that such scripts record your steps and maintain them for later use. Further, since these are highly patterned, it is possible to build still further scripts from those already developed. This also helps transfer the capabilities to others within your organization.

At time of processing, you can also set up a batch file to do some of the simple, early transformations:

  set myDir=%cd%
  cd ..
  call SetmyCommonVars
  cd %myDir%
  for /F "tokens=1" %%1 in (%myFMEDrive%%myFMEBase%infinitemonkey\cmd\ALPHA\DATASETS.txt) do call .\ALPHA_Common_Param.cmd %%1
  call ..\..\..\locatedIn\cmd\ALPHA.cmd
  pause

Update the Ontology to Accommodate the Dataset

From a workflow standpoint, and prior to actual dataset import, it is best practice to ensure that the new dataset and its structure is properly modeled and linked into the domain ontology guiding your specific instance. (If you are not already familiar with them, you may want to see the other background material regarding ontologies on this wiki.)

You may find, for example, that to properly include your new data in your system, that you are missing a "bridging concept" between an existing concept ("parent") already in the ontology, as well as some attributes (data) that describe that concept.

Let's say, for example, that our existing ontology has the concept of housing, but not the concept of single-family dwellings or the specific data attributes captured by our 'SFD Housing Starts' data. The basic conceptual gap this represents appears as follows, with housing representing the "parent" concept and single-family dwellings the "child":

AddingAttributeSchematic.png

Upon identification of such a gap it is now necessary to update an existing ontology using these steps:

  • Define the new class or property; make sure and provide a prefLabel for the object, add as many altLabels as applicable and useful, and define the object with a textual description sufficient to bound and scope the new object
  • Define the relationships of this new object to other classes or properties, and
  • Periodically test your updated ontology for logic consistency using a reasoner.

If not found, then you must update the ontology(ies). See further the Datasets: Update Ontology(ies) document.

Transform and Load the Dataset

Once the necessary placeholders are ready for the new dataset attributes in the ontology, it is time to formally transform the source data and load it into the OSF instance.

See next the Datasets: Prepare Internal (RDB) Datasets document.


  1. See the Pentaho and Talend Open Studio open source ETL options, among others.