CommON Case Study

From OSF Wiki
Jump to: navigation, search

Supplementary Documentation - 12 November 2009

Latest version
http://wiki.opensemanticframework.org/index.php/CommON_Case_Study
irON Logo
Last update
$Date: 2014/01/14 12:32:43 $
Revision
Revision: 0.4
Author
Michael Bergman - Structured Dynamics
Reviewer
Frédérick Giasson - Structured Dynamics
Copyright © 2009 by Structured Dynamics LLC.
Structured Dynamics LLC
Creative Commons License
irON: instance record and Object Notation by Structured Dynamics LLC is licensed under a Creative Commons Attribution-Share Alike 3.0 license. irON's parsers or converters are separately available under the Apache License, Version 2.0.


Abstract

This document presents a case study for using the commON serialization of irON (instance record and Object Notation) for the Sweet Tools dataset of about 800 semantic Web- and related tools. It describes the reasons and benefits of dataset authoring in a spreadsheet, and provides working examples and code to aid users in understanding and using the commON notation.


INTRODUCTION

There are perhaps 1 billion spreadsheet users worldwide [1], making spreadsheets undoubtedly the most prevalent data authoring environment in existence.

Spreadsheets often begin as simple notetaking environments. With the addition of new findings and more analysis, some of these worksheets may evolve to become full-blown datasets. Alternatively, some spreadsheets start from Day One as intended datasets or modeling environments. Whatever the case, clearly there is much accumulated information and data value "locked up" in existing spreadsheets.

The dataset that is the focus of this use case, Sweet Tools [2], began as an informal tracking spreadsheet. As it grew over time, however, it gained in structure and size. Eventually, it became a reference dataset, with which many other people desired to use and interact.

This need for sharing and collaboration was a major stimulus to development of the commON serialization of irON (instance record and Object Notation) [3]. This case study describes the reasons and benefits of dataset authoring in a spreadsheet, and provides working examples and code based on Sweet Tools to aid users in understanding and using the commON notation.

Genesis and Major Transitions of the Sweet Tools Dataset

About four years ago, shortly after I began my blog, AI3:::Adaptive Information, I began researching and reviewing semantic Web and -related tools. There had been compilations and listings in the past, notably Dave Beckett's Resource Description Framework (RDF) Resource Guide (last updated in September 2005). It takes time to keep such listings current and there was no single place to find useful and relevant tools, particularly open source ones.

By June 2006, I began tracking tool sets in earnest and provided my first listing of about 50 tools in spreadsheet form. The W3C’s ESW wiki began to consolidate a listing of RDF and OWL tools about that same time. In August 2006, that SemanticWebTools listing was posted, containing references to about 70 distinct tools, both open source and proprietary. That listing continues to this day and is an essential starting reference.

My motivation for collecting this information was research into a new semantic Web venture (now well underway at Structured Dynamics). With each published update on a periodic basis I was finding I needed more structure and organization to keep track of the growing list.

By late 2006/early 2007 the Exhibit faceted data browser had been published by MIT's Simile program and I was the first outside of the program to use a Google spreadsheet on a WordPress blog to drive the display, described here. The availability of facets supported the addition of still more structure to the listing and gave quick and useful means to cull through the growing dataset. By January 2007 I named the listing Sweet Tools and made it a permanent part of my AI3 blog.

Exhibit has been helpful not only for display purposes, but because it later was upgraded to include structured data output, including RDF, JSON and others. The growth of the listing and increased attribute characterization has made Sweet Tools a desired dataset for use by others and for collaboration. In May 2008 I began collaborating on the listing with The Semantic Web Company.

Throughout this evolution it has become clear that easier ways to co-author, update and maintain the listing are desirable, as well as simpler ways to export and characterize the datasets. Moreover, what is really desirable is to make the dataset interoperable with a coherent structure for including with other datasets.

Today, Sweet Tools has more than 800 tools in its listing with a rich attribute and typing structure. This case study describes the 16th major version of the dataset, and the first to take advantage of the commON serialization of the instance record and object notation (irON). The two other serializations of irON are irJSON (JSON) and irXML (XML), not further discussed herein.

Sweet Tool's Role in irON

Different dataset authors have different reasons for desiring different formats or serializations for their data. For Sweet Tools, the use of spreadsheets was natural given my own background and the various internal functionality provided by spreadsheets (discussed further below). In contemplating the development of irON, spreadsheets clearly deserved to be an equal citizen with other leading formats.

The format of comma-separated values, or CSV, has been in existence for decades. Microsoft made the format famous as a spreadsheet exchange format. CSV is thus a very useful format since spreadsheets can be used as authoring front-ends with functionality useful to the creation of datasets. CSV is less expressive and capable as a data format than the other irON serializations, yet still has an attribute-value pair orientation. And, via spreadsheets, datasets can be easily authored and inspected, while also providing a rich functional environment including sorting, formatting, data validation, calculations, macros, etc.

Most simple data can be developed and provided as text datasets based on attribute-value pairs (also known as key-value pairs and many other variants) [4]. In spreadsheets, a tabular view of similar "things" (instances) can be readily presented where the records of those instances represent the rows in the table or spreadsheet, the attributes or properties or "fields" describing those things are listed in the columns. The actual values are placed in the cells. Indeed, this basic framework is also what is used in relational data tables.

When exported, CSV only contains the cell data values from a spreadsheet. While this has the disadvantage of losing formatting, formulas and other niceties of spreadsheets in their native form, it also makes the data exchanged clean and relatively uniform. Thus, during data development and preparation the spreadsheet can be used in all of its native capabilities to provide data validation, sorting, formatting, cell referencing, calculations of (say) totals and subtotals, etc. This means that templates with useful prompts and controlled vocabularies and rapid editing and entry functions can be quickly developed. Then, when ready, the data can be exported in a clean manner using CSV for use by other tools and for data federation.

As a dataset very familiar to irON's editors, and directly relevant to the semantic Web, Sweet Tools provided a perfect prototype case study for helping to guide the development of irON, and specifically the commON serialization for irON. The Sweet Tools dataset is relatively large for a speciality source, has many different types and attributes, and is characterized by text, images, URLs and similar.

The premise was that if Sweet Tools could be specified and represented in commON sufficiently to be parsed and converted to interoperable RDF, then many similar instance-oriented datasets could likely be so as well. Thus, as we tried and refined notation and vocabulary, we tested applicability against the CSV representation of Sweet Tools in addition to other CSV, JSON and XML datasets.

The resulting commON is fully compliant with [5] and the parser and ingesters used by the OSF Web Service framework [6]. commON has been validated for Microsoft Excel CSV and Open Office CSV.


DATASET MAINTENANCE IN A SPREADSHEET

Sweet Tools is maintained as a spreadsheet with multiple, individual worksheets. Depending, edits or updates to this spreadsheet may be made by various versions of Microsoft Excel or Open Office scalc. (Note: all screen shots below are from Microsoft Excel Vista.)

Typically, new tools as encountered (or updates to existing tools) are kept and tracked in an email folder. On a periodic basis, these notes are consolidated, including from some standard supporting reference sources, and the individual tools are characterized and updated.

This activity includes inspecting various Web sites or pages for each tool, gleaning and editing descriptions and links, characterizing the tool, and creating a thumbnail image (see further this description of the thumbnail process).

This results in an active worksheet that contains the then-current listing of instance records, as this screen shot shows:

Sweet Tools Main Spreadsheet Screen

Figure 1. Main Sweet Tools Records Screen (click to expand)

The current instance record sheet, called records, has more than 800 rows by twenty or so columns (not all shown).

Formatting and On-sheet Management

The first usefulness of a spreadsheet comes from being able to format and organize the records.

For Sweet Tools, all new records are given a background highlight color. Embedded live HTML links are allowed, contents can be wrapped and styled within cells, and the column and row heads can be "frozen", useful when scrolling large workspaces. How some of this formatting looks in the standard Sweet Tools spreadsheet view is shown in Figure 2:

Flexible Formatting and Display Options

Figure 2. Formatting and Display Options (click to expand)

These functions are geared to aid navigation and viewing on screen. Of course, these options are infinitely variable and a matter of personal taste and work style.

Named Blocks and Sorting

Named blocks are a powerful feature of modern spreadsheets, useful for data manipulation, printing and internal referencing by formulas and the like. Typically, depending on how you have customized your own spreadsheet, you can find the listing of named blocks at the upper left of the main worksheet area:

Spreadsheet Named Blocks

Figure 3. Named Blocks

Sorting with named blocks is especially important as an aid to check consistency of terminology, records completeness, duplicates checks, missing value checks, and the like. By giving the entire instance record set a block name — in this case, SortBlock — it is easy to select the entire record set for manipulation. This naming function is especially helpful when the number of instances (rows) or attributes (columns) grows large, or when you have both display and calculation areas on the same active worksheet.

It is not in any way as easy to do these fast and comprehensive manipulations on other data serializations, such as XML or JSON.

Calculation- or formula-intensive spreadsheets also benefit from named blocks. Once named, it is no longer necessary to highlight all applicable cells in a calculation, which saves tremendous time. As one example among many, this feature is essential when comparing two large datasets to one another with the various lookup functions in spreadsheets. Text manipulations and replacements are another useful area.

These techniques are real time savers when datasets grow large and consistency of treatment and terminology is important.

Note the &&recordList header information on this screen shot, which is the keyword signal for the instance record module.

Multiple Sheets and Consolidated Access

As noted under the Single or Multiple Files section below, commON modules can be specified on a single worksheet or multiple worksheets saved as individual CSV files.

Because of its size and relative complexity, the Sweet Tools dataset is maintained on multiple sheets, as shown in Figure 4:

Managing Multiple Sheets

Figure 4. Manage Multiple Sheets

The first three tabs of dataset, records and linkage are specific commON modules, discussed individually below. The other tabs are for convenience or internal uses. For example, each version is tracked; controlled vocabularies are used for data entry validation; and abandoned tools are also tracked once removed.

Though, by definition, CSV files are limited to a single worksheet, multiple worksheets can be exported from the same spreadsheet (depending on which worksheet or tab is active when saving as a CSV). In any case, multi-worksheet environments help keep related data and notes consolidated and more easily managed on local hard drives.

Completeness and Counts

It is easy to miss entering information into specific cells when datasets grow large. One simple use I employ using the spreadsheet counta function is to sum completed cell entries by both column and row. A count discrepancy for a column tells me an attribute or type value is missing; a discrepancy by row may indicate an incomplete record.

Here is an example for Sweet Tools, indicating that a value is missing is the required &type field:

Count Checks and Internal Calculations

Figure 5. Count Checks and Internal Calculations (click to expand)

Of course, similar helps and uses can be found for many of the hundreds of embedded functions within a spreadsheet.

Unfortunately, an export with such additional information as these counts can not be recognized by the commON parser. When using such techniques you have a choice. You can either maintain a duplicate worksheet with the functions included and the values linked to the actual CSV export sheet, or you can simply delete the temporary calculations immediately prior to CSV export.

Controlled Vocabularies and Data Entry Validation

Quality datasets often hinge on consistency and uniform values and terminology. The data validation utilities within spreadsheets are especially helpful for this. Because they work in the background, you can apply data validation checks anywhere in your worksheets and only the actual values shown on the sheet get exported to CSV.

Data validation can be applied to Boolean, ranges and mins and maxes, and to controlled vocabulary lists. The latter is quite helpful in enforcing consistency, etc., for faceted browsing and the like when the exported dataset values are converted to RDF via the irON parsers and converters.

Figure 6 shows the application of a controlled vocabulary to the primary category assigned to a tool within Sweet Tools. A part of the 50-item pick list is shown for the current entry:

Controlled Vocabularies and Data Entry Validation

Figure 6. Data Entry Validation

Actually, most of the adjacent columns also employ controlled vocabulary lists for such items as the language the tool is written in, whether it is open source or available for free online, what its listing status is (Existing, New, Updated, etc.) and so forth. Note under the controlledVocab tab shown above in Figure 4 that a specific worksheet is devoted for such use in Sweet Tools.

Specialized Functions and Macros

Of course, this case study is not meant to be a tutorial on the general use of spreadsheets. The key point is that all functionality of spreadsheets may be employed in the development of commON datasets. Then, once employed, only the values embedded within the sheets are then exported as CSV.


MODIFYING FOR commON AND CSV EXPORT

One secret of the commON notation is to follow its limited — but specified — conventions and reserved vocabulary in the construction of your input spreadsheets. For full specifics on these requirements, please see SUB-PART 3: commON PROFILE of the irON specification.

In this section we provide some practical guidance on how to work with existing spreadsheets to conform to the commON notation, as well as perhaps a few other useful tips.

Working with Existing Spreadsheets

There are clearly conventions, notations and vocabulary that must be followed to use commON. But, because you can create multiple worksheets within a spreadsheet, it is not necessary to modifiy existing worksheets or tabs. Rather, if you are reluctant or can not change existing information, merely create parallel duplicate sheets of the source information. These duplicate sheets have as their sole purpose export to commON CSV. You can maintain your spreadsheet as is while staging for commON.

To do so, use the simple = formula to create cross-references between the existing source spreadsheet tab and the target commON CSV export tab. (You can also do this for complete, highlighted blocks from source to target sheet.) Then, by adding the few minor conventions of commON, you have now created a staged export tab without modifying your source information in the slightest.

Of course, this step may not be necessary if it is relatively easy for you to modify what already exists or to create a new spreadsheet with the few minor commON conventions in mind.

Encoding Considerations and Handling

In standard form and for Excel and Open Office, single quotes, double quotes and commas when entered into a spreadsheet cell are automatically 'escaped' when issued as CSV. In this usage, escaped merely means that when a parser encounters the escape character within a character string (the standard is the backslash '\'), it tells the system to accept the next character sequence as a legitimate character. Clearly, in CSV where the delimiter is the comma, escaping a comma in a description field, for example, is an essential requirement to maintain the integrity of the text snippet.

commON allows you to specify your own delimiters for lists (the standard is the pipe '|' character) and what the parser recognizes as the 'escape' character ('\' is the standard).

You probably should not change these values in standard use.

For some encodings, however, it may also be necessary to instruct your spreadsheet program about language sets and encodings. Consult your respective help system.

The standard commON parsers and converters are UTF-8 compatible. If your source content has unusual encodings, try to target UTF-8 as your canonical spreadsheet output.

Modules and File Formats

Recall from the irON specification that there are a small number of defined modules or processing sections within the notation. In commON, these modules are denoted by the double-ampersand character sequence ('&&'), and apply to lists of instance records (&&recordList), dataset specifications and associated metadata describing the dataset (&&dataset), and mappings of attributes and types to existing schema (&&linkage).

In commON, any or all of these can occur within a single CSV file or in multiple files. In any case, the start of one of these processing modules is signaled by the module keyword and &&keyword convention.

The RecordList Module

Figures 1, 2 and 3 above (and Figures 9 and 10 below) show examples that start with the &&recordList module, indicating one of more instance records will follow. Note that the first line after the &&recordList keyword is devoted to the listing of attributes and types for the instance records (designated by the &attribute convention in the columns for the first row after the &&recordList keyword is encountered).

As noted in the irON specification, the &&recordList format can also include the stacked style (not shown herein). See the irON specification for an example (though the next &&dataset diagram also shows the stacked style for the &href attribute).

The Dataset Module

The &&dataset module defines the dataset parameters and provides very flexible metadata attributes to describe the dataset [7]. Note the dataset specification is exactly equivalent in form to the instance record (&&recordList) format, and also allows the row or stacked styles (see these instance record examples).

The Dataset Module

Figure 7. The Dataset Module (click to expand)

The Linkage Module

The &&linkage module has a simple, but specific structure. Either attributes (presented as the &attributeList) or types (presented as the &typeList) are listed sequentially by row until the listing is exhausted.

By convention, the second column in the listing is the targeted &mapTo value. Absent a prior &prefixList value, the &mapTo value needs to be a full URL to the corresponding attribute or type in some external schema:

The Linkage Module

Figure 8. The Linkage Module

Notice in the case of Sweet Tools that most values are from the actual COSMO mini-ontology underlying the listing. These need to be listed as well, since absent the specifications in commON the system has NO knowledge of linkages and mappings.

The Schema (structure) Module

In its current state of development, commON does not support a spreadsheet-based means for specifying the schema structure (lightweight ontology) governing the datasets. Another irON serialization, irJSON, does. Either via this irJSON specification or via an offline ontology, a link reference is presently used by commON (and, therefore, Sweet Tools for this case study) to establish the governing structure of the input instance record datasets.

A spreadsheet-based schema structure for commON has been designed and tested in prototype form. commON should be enhanced with this capability in the near future.

Single or Multiple Files

As noted, these specifications can be presented via either single or multiple files. If multiple files, the &&dataset specification needs to provide a correct URL to the supplementary files containing the needed instance record or linkage modules.

Saving and File Names

If the modules are spread across more than one worksheet, then each worksheet must be saved as its own CSV file. In general, you should use the Save As option within your host spreadsheet application to save the currently active workshop as a CSV file. Depending on your design, this can require saving multiple CSV files, each invoked from its active worksheet.

In the case of Sweet Tools, as exhibited by its reference current spreadsheet, sweet_tools_20091110.xls, three individual CSV files get saved.

These files can be named whatever you would like. However, it is essential that the names be remembered for later referencing.

My own naming convention is to use a format of appname_date_modulename.csv. The appname in the case of Sweet Tools is generally swt. The modulename is generally the dataset, records, or linkage convention. I tend to use the date specification in the YYYYMMDD format.

Thus, in the case of the records listings for Sweet Tools, its filename could be something like: swt_20091110_records.csv.

Once saved, the current instance record CSV has this appearance when opened in spreadsheet view:

Spreadsheet View of the CSV File

Figure 9. Spreadsheet View of the CSV Record File (click to expand)

The other two CSV files to accompany these instance records thus become swt_20091110_dataset.csv and swt_20091110_linkage.csv. (Note: we also provide a single worksheet view of the dataset called swt_20091110_csv-merge.csv; see below.)

Alternatively, you could open this same file in a text editor. Here is how this exact same instance record view looks in an editor:

Editor View of the CSV Record File

Figure 10. Editor View of the CSV Record File (click to expand)

Note that the CSV format separates each column by the comma separator, with escapes shown for the &description field when it includes a comma-separated clause. Without word wrap, each record in this format occupies a single row (though, again, for the stacked style, multiple entries are allowed on individual rows so long as a new instance record &id is not encountered in the first column).

Importing into OSF Web Service

Once saved, these files are now ready to be imported into a OSF Web Service instance, which is where the CSV parsing and conversion to interoperable RDF occurs. For our example, we will use the example of the content management system of the Drupal-based OSF-Drupal system [8]. OSF-Drupal exposes the OSF Web Services via a user interface and a user permission and access system.

Current System

The current commON conversion system uses an offline PHP parser that produces a N3 output, which is then used for final conversion by use of OSF Web Service.

OSF Web Service-Mediated System

Within the next couple of weeks, the existing OSF Web Service Import tool will be modified to accept direct CSV initial ingest:

Importing into OSF-Drupal

Figure 11. Importing into OSF-Drupal (click to expand)

You need to point to the location of each applicable CSV file location, give it the dataset name you desire, set the content type of commON, and save the dataset on the network with a name of your choice .


USING THE DATASET

Now that the data is loaded, we are now ready to interact with the Sweet Tools structured dataset using OSF-Drupal (assuming you have a Drupal installation with the OSF-Drupal modules).

Introduction to the App

The screen capture below shows a couple of aspects of the system:

  • First, the left hand panel (according to how this specific Drupal install was themed) shows the various tools available to OSF-Drupal. These include (with links to their documentation) Search, Browse, View Record, Import, Export, Datasets, Create Record, Update Record, Delete Record and Settings [9];
  • The Browse tree in the main part of the screen shows the full mini-ontology that classifies Sweet Tools. Via simple inferencing, clicking on any parent link displays all children projects for that category as well (click to expand):
OSF-Drupal Browse Screen for Sweet Tools

Figure 12. The OSF-Drupal Application (Browse Tool view) (click to expand)

One of the absolutely cool things about this framework is that all tools, inferencing, user interfaces and data structure are a direct result of the ontology(ies) underlying the system (plus the irON instance ontology, as well). This means that switching datasets or adding datasets causes the entire system structure to now reflect those changes — without lifting a finger!!

Exporting in Alternative Formats

Of course, one of the real advantages of the irON and OSF Web Service designs is to enable different formats to be interchanged and to interoperate. Upon submission, the commON format and its datasets can then be exported in these alternate formats and serializations [8]:

  • commON
  • irJSON
  • irXML
  • N-Triples/CSV
  • N-Triples/TSV
  • RDF+N3
  • RDF+XML

As should be obvious, one of the real benefits of the irON notation -- in addition to easy dataset authoring -- is the ability to more-or-less treat RDF, CSV, XML and JSON as interoperable data formats.


DOWNLOADABLE FILES

The files that accompany this Sweet Tools case study are:

In addition, this entire package may be downloaded in zip form as sweet_tools_complete_20091110.zip (562 KB).


REFERENCES

  1. In 2003, Microsoft estimated its worldwide users of the Excel spreadsheet, which then had about a 90% market share globally, at 400 million. Others at that time estimated unauthorized use to perhaps double that amount. There has been significant growth since then, and online spreadsheets such as Google Docs and Zoho have also grown wildly. This surely puts spreadsheet users globally into the 1 billion range.
  2. See Michael K. Bergman's AI3:::Adaptive Information blog, Sweet Tools (Sem Web). In addition, the commON version of Sweet Tools is available for use and manipulation at the conStruct site.
  3. See Frédérick Giasson and Michael Bergman, eds., Instance Record and Object Notation (irON) Specification, Specification Document, version 0.82, 20 October 2009. See Instance Record and Object Notation (irON) Specification.
  4. An attribute-value system is a basic knowledge representation framework comprising a table with columns designating "attributes" (also known as properties, predicates, features, parameters, dimensions, characteristics or independent variables) and rows designating "objects" (also known as entities, instances, exemplars, elements or dependent variables). Each table cell therefore designates the value (also known as state) of a particular attribute of a particular object. This is the basic table presentation of a spreadsheet or relational data table. Attribute-values can also be presented as pairs in a form of an associative array, where the first item listed is the attribute, often followed by a separator such as the colon, and then the value. JSON and many simple data struct notations follow this format. This format may also be called attribute-value pairs, key-value pairs, name-value pairs, alists or others. In these cases the "object" is implied, or is introduced as the name of the array.
  5. The CSV mime type is defined in Common Format and MIME Type for Comma-Separated Values (CSV) Files [RFC 4180]. A useful overview of the CSV format is provided by The Comma Separated Value (CSV) File Format. Also, see that author's related CTX reference for a discussion of how schema and structure can be added to the basic CSV framework; see http://www.creativyst.com/Doc/Std/ctx/ctx.htm, especially the section on the comma-delimited version (http://www.creativyst.com/Doc/Std/ctx/ctx.htm#CTC).
  6. OSF Web Service is a platform-independent Web services framework for accessing and exposing structured RDF data, with generic tools driven by underlying data structures. Its central perspective is that of the dataset. Access and user rights are granted around these datasets, making the framework enterprise-ready and designed for collaboration. Since a OSF Web Service layer may be placed over virtually any existing datastore with Web access -- including large instance record stores in existing relational databases -- it is also a framework for Web-wide deployments and interoperability.
  7. As of the date of this case study, some of the processing steps in the commON pipeline are manual. For example, the parser creates an intermediate N3 file that is actually submitted to the OSF Web Service. These capabilities should be available as a direct import to a OSF Web Service instance.
  8. OSF-Drupal is a structured content system built on the Drupal content management framework. OSF-Drupal enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces. It is based on RDF and SD's OSF Web Service platform-independent Web services framework. In addition to user access control and management and a general user interface, OSF-Drupal provides Drupal-level CRUD, data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF.
  9. More Web services are being added to OSF Web Service on a fairly constant basis, and the existing ones have been through a number of upgrades.