User:Wanderson/Books/Lab Notebook

From OSF Wiki
Jump to navigation Jump to search

Lab Notebook

Experiments with OSF-Drupal

Main Page

Experiment 1: Transform a simple Excel spreadsheet to OSF-Drupal dataset import format.

Introductory

This is a record of the steps needed to transform a simple Excel spreadsheet to a format that can be imported into an OSF-Drupal system. My goal is to capture what I had to learn, including the mistakes I made and misunderstandings I had that were corrected as I made progress. The outcome of this is perhaps a Cookbook recipe that others might find useful.

There are several different sorts of issues that this experiment raises. Some are about data informatics and database design. Some are about writing down almost everything that I had to do to transform an ad-hoc and idiosyncratic spreadsheet into an OSF dataset.

The starting spreadsheet was found by a search of online data generated by the City of Austin, Texas, USA. This spreadsheet is described as a list of shrubs that are found in and around Austin, TX, and some summary information about their water needs, maintenance, etc. The table contains approximately 60 records. This experiment uses the first 10 records, five of which are shown here:

CoA-shrublist-r1-5.jpg

Adding commON attributes

I had two confusions. The first was figuring out the difference between a "dataset" and a "recordList". The documentation [1] is quite clear; viz., a dataset is a description of the entire set of records and a recordList is, well, a list of records. In Mike Bergman's words "dataset is the metadata about the entire dataset (author, when created, etc); recordList is the transmittal of the actual data". Mike Bergman and Fred Giasson were very helpful in clarifying this for me.

The second confusion was in understanding how to adapt the spreadsheet shown above and make it a recordList. My primary confusion here was wondering if the column labels are restricted to being one of the specified commON attributes. For some reason, I was confused by the example shown in the documentation.

Here is the spreadsheet (only showing the first three data rows) with the first two rows replaced by five rows of commON &&dataset and &&recordList attributes:

COA-shrub-dataset01.jpg


Fred Giasson and Mike Bergman urged adding an &id and a &type field. The &id will support a usable record URI specification. The &type field will add a minimum of data organization to the records in the dataset. Without a &type field each record will end up being of type Thing. Now the Excel spreadsheet looks like this:

OSF dataset-t01xls.jpg

Some data informatics issues

The spreadsheet just above reflects some data quality and database design issues. First, the first column is titled "Common Name" but it contains both common and Latin names for the plants. It might be a better idea to separate these names into separate columns. Mike Bergman pointed out that "for &height (and similar) we have a different issue: the actual attribute values are also being combined with their unit of measure (feet -- " ' " -- in this case). What the value should be is the actual number, with a separate specification of the attribute and its characteristics (*not* part of the record list transmittal) done in the schema definition. (Unfortunately, we do not really have a schema definition here, do we? At this juncture and with current systems, that must be handled in the actual ontology spec.)"

This is an interesting problem. I hope to have a workable solution for it. However, I also think that if the OSF framework is going to be useful then simple ways of importing a given spreadsheet are required, regardless of the quality of the data or schema. Another conclusion one can draw from this example is that importing legacy datasets into OSF may require substantial manual work, especially if the data and information schema need rework.

This experiment will not attempt to improve the quality of the input dataset.

Successful dataset import

I exported the spreadsheet shown above into the CSV file shown here:

Coas01 dataset 20100818-CSV.png

One difference between this file and the spreadsheet shown above: I shortened the &&dataset &id to "http://localhost/" since it is only a prefix for creating a URI record identifier; it's not really the location of a file on my localhost. (Note: even though I know better it is easy to forget the difference between a URI and and URL.)

The next step was to download the structwsf commON converter software into /usr/share/structwfs/converter/common. This software is not part of the default structwsf install and needs to be downloaded from a development branch with a command like this one:

~/tmp$ svn co http://structwsf.googlecode.com/svn/branches/dev/converter/common common-read-only

and then moving the files in the common-read-only directory to the directory (created if needed) /usr/share/structwsf/converter/common

Installing these files enables import of commON CSV files like the one shown above using the OSF-Drupal Dataset Import dialog:

Import coas dataset.png

The result of this import is shown here:

Coas01 dataset created.png

Note the message warning that some types and attributes used in this dataset are not defined in the system's ontological structure. Connecting the information in this dataset to an ontology is the next problem I encountered. But at least the dataset was successfully imported. The imported records of the dataset are viewed by selecting the "Browse imported dataset" button, which yields (for the first record):

Coas01 dataset record1.png

Two observations on this record are related to the system warning of undefined types and attributes. First, it is an "Unlabeled record". Second, it is classified as a "Thing". Avoiding being classified and a "Thing" was the reason that each record in the recordList has a &type field value of "Shrub". Clearly, more work is required.

Connecting attributes and types to ontologies

Once again the OSF documentation was my starting point. My system has the small set of ontologies loaded, so I took some time to look at their documentation online. My first thoughts were to find a way to link the attribute labeled "&common_name" to an existing "name" concept (sometimes called a "class") in one of the ontologies already loaded, and to link the &typeList value "Shrub" to another already loaded concept. The foaf ontology has a "name" element, so I used that for "common_name". Just to use something I decided to map the &typeList value "Shrub" to the foaf element "Person." So I updated the &attributeList and &typeList &&linkage sections of the commON CSV file to look like this:

&&linkage
&version,&linkedType
1.01,application/rdf+xml

&attributeList,&mapTo
common_name,http://xmlns.com/foaf/0.1/name

&typeList,&mapTo
Shrub,http://xmlns.com/foaf/0.1/Person

A subsequent Import of this file yielded the following record:

Common name isa person.png


Two observations. First, the record is no longer "unlabeled"; it is labeled with the contents of the &common_name field. And second, the record "is a Person". That was fairly straightforward. Next I need to find an ontology with a concept that can &mapTo the &typeList value "Shrub". I had an early release copy of the MUNI ontology and it contained a "Plants" class, so I put the muni.owl file into my "/data/ontologies/files/" directory and ran /usr/share/structwsf/ontology/load_ontologies.php with this command:

$ sudo php /usr/share/structwsf/ontology/load_ontologies.php

Once all the ontologies were loaded I changed changed the &typeList entry in the &&linkage section of the commON CSV file to be this:

...
&typeList,&mapTo
Shrub,http://purl.org/ontology/muni#Plants

Import of the file yielded this first record:

Common name isa plants.png

Two observations from this result. First, installing new ontologies is relatively easy; the command "php /usr/share/structwsf/ontology/load_ontologies.php" can be run as many times as needed. Second, now we have the shrub records named and now they are of type "Plants". This is an awkward phrasing, as the &type should be a singular noun. In the MUNI ontology "Plants" is a class with subclasses "Cacti", "Flowers", "Houseplants", "Succulents", and "Trees". Using the "Cacti" subclass as an example I added another subclass "Shrub". I updated the &typelist entry in the &&linkage section of the commON CSV file to be:

...
&typeList,&mapTo
Shrub,http://purl.org/ontology/muni#Shrub

Importing the file yielded this view of the first record:

Common name isa shrub.png

Now there is a dataset whose records display the &type value "Shrub".