Archive 1.x:Datasets Syncing Framework

The open semantic framework (OSF) datasets syncing framework is a work process and set of utilities to aid the migration and updating of source data into an OSF instance. This document describes its components and how to configure and use the framework.

Introduction
The datasets syncing framework helps structWSF system administrators keep in sync datasets that come from a variety of external sources. The datasets syncing framework is composed of:


 * 1) A set of external datasets serialized in different formats
 * 2) A series of conversion tools
 * 3) A configuration file
 * 4) The core syncing tool
 * 5) A running structWSF instance where to sync the datasets

In this documentation page, we will see how a structWSF system administrator can manage, and keep in sync, a series of external datasets with a running structWSF instance.

The Use Case
The structWSF datasets syncing framework has been developed with one use case in mind: integrating and keeping in sync multiple heterogeneous datasets that come from different database management system and serialization formats.

Let's take a look at this example:

An organization wants to integrate these different sources of data that come from different data management systems:


 * 1) An SQL database
 * 2) Geo-spatial information that comes from a SDI instance
 * 3) Spreadsheets

This organization wants to integrate that information into structWSF because it is the best open source framework designed for heterogeneous data integration currently available on the market, and because it is what is used by their Web portal.

Some of these datasets are fairly static, but others are dynamic, changing daily if not more frequenly. Also, new datasets will be added over time and new formats may have to be handled as well.

Converters
Converters are used to convert the input dataset files written in any format into RDF. Then the generated RDF data is fed to structWSF by querying the CRUD: Create Web service endpoint. The converters that are currently available are:


 * 1) commON
 * 2) default (RDF/XML)
 * 3) kml

However, it is easy to develop new ones to convert any other kind of format into RDF. (See further the discussion on RDFizers.)

Staging the Datasets for the Syncing Process
Prior to actual synchronization, the source datasets must be prepared.

Datasets Selection
The first step that the structWSF system administrator has to do is to select all of the datasets desired for import that are desired to be kept in sync with the structWSF instance. These datasets can come from any kind of data source and can be serialized in any format.

Generating Datasets
The system administrator next needs to generate the dataset files. Here are the steps that should be followed:


 * 1) Analyze the data of the source datasets; see further the Analyze Datasets guidance
 * 2) Analyze the system currently used to manage the source dataset
 * 3) Based on these two analyses, determine the the best way to serialize the dataset information and to save it in a file. This determination is driven by three factors: (1) the size and complexity of the dataset; (2) what output formats are currently supported by the source system; and (3) what converters are currently available, both as part of the the native system and as third-party extensions.
 * 4) Once the format and serialization decisions are made, it is then necessary to properly map the data from the source dataset into the desired output format. By example, if the objective is to generate RDF data directly from the source dataset into an RDF/XML file, then it will be necessary to map that information to an ontology to generate the proper RDF/XML resources file. This is typically the best approach to use with structWSF.

The result of performing these four steps will be a file serialized in some format supported by the some converter of the datasets syncing framework.

The next series of steps deals with the updates that will occur with the source dataset:


 * 1) Analyze the update frequency of the dataset: will the dataset be updated over time, or it is static?
 * 2) If updates may occur, the system administrator will have to put a new strategy in place to notify the datasets syncing framework that a particular record got updated. The nature of these updates may be:
 * 3) The record is new in the source dataset, so it has to be created. The records are "tagged" with the   statement
 * 4) The record has been modified in the source dataset, so it has to be updated. The records are "tagged" with the   statement
 * 5) The record has been removed in the source dataset, so it has to be deleted. The records are "tagged" with the   statement
 * 6) If records have been added/modified/removed since the last sync, then the system administrator has to compute the various   (that is, the difference between the last version of the dataset, and the new one) and generate a new version of the dataset file where each modified record (the computed  ) gets properly tagged (see below).

With this datasets staging procedure in place, the system is now ready to create new datasets and to update existing ones.

Datasets Aggregation
The next step is to aggregate all the serialized dataset files into the same folder. This aggregation folder is used in the configuration file step described below.

Example
Let's use the following example to illustrate the steps that have to be performed above.

There is a structWSF system administrator that has to integrate multiple datasets from different data sources, supported by a variety of systems. One of these datasets is a dataset of schools that is managed in a Oracle SDI instance, which is a geospatial specialized database. There is also some more information about these schools that is hosted by a more traditional SQL database. Fortunately, the system administrator has access to FME. After his analysis, he will use FME to get information from the SDI and the SQL database, and create an RDF/XML file that will be fed to the datasets syncing framework.

After generating the FME Workbench File that maps the data source to some ontologies and that will generate the RDF/XML file, the process will save the initial file of the dataset into this folder:

/data/datasets/schools_2011_11_30.xml

Note that the system administrator added the date into the file that is being generated. This date is used as a version stamp. These version files are used by the datasets syncing framework to apply the latest changes to the structWSF datasets.

The initial schools RDF/XML file looks like:

Notice that all of the records are "tagged" using the  statement. This tells the datasets syncing framework to create all of these records. These statements are needed, otherwise the records will be ignored.

However, after initial loading a few days later, some of the school's information has now changed. One of the phone numbers of one of the school got changed, another school got closed, and another one got opened (what a week!). What the FME process will do is to create a new version file for this dataset, and add the changes to it.

The new file that gets created looks like:

/data/datasets/schools_2011_12_05.xml

And the example content of that file might be:

Now, notice each.

This new version file is read by the datasets syncing tool, which then acts according to the  statements defined for each of these records.

This is how the datasets get in sync between a multiple number of heterogeneous datasets into a consolidated structWSF instance.

In this example, a RDF/XML file has been generated to be used by the datasets syncing framework, however there is no obligation to do so. What needs to be done, is to create a file of a format supported by one of the converters. Also, if a converter doesn't exist, it can easily be developed and added to the framework.

Configurating the Datasets Syncing Framework
Once the external data sources datasets are properly staged, it is not hard to finish the configuration of the datasets syncing framework.

The first step is to download the framework, and to put it somewhere on your syncing server. Let's say that it is located here:

/data/sync/

This folder contains:

/data/sync/ /data/sync/converters/ /data/sync/converters/common/ /data/sync/converters/common/commonConverter.php /data/sync/converters/common/CommonParser.php /data/sync/converters/default/ /data/sync/converters/default/defaultConverter.php /data/sync/converters/kml/ /data/sync/converters/kml/kmlConverter.php /data/sync/sync.ini /data/sync/sync.php

So, it contains a series of converters, a configuration file and the actual syncing program.

Configuration File
The configuration file is the  file. This is where the datasets syncing framework is configured: where the core settings are defined, and where each dataset is configured.

[config]
The  section of the   file is where the core settings of the framework are defined. Here is the list, and description, of the settings available:


 * 1) structwsfFolder - this is the path of the structWSF instance folder. This path has to end with a trailing slash
 * 2) indexesFolder - this is the path where the internal framework indexes are saved. These indexes are used to know what got modfied and what didn't. This path has to end with a trailing slash
 * 3) ontologiesStructureFiles - this is the path where the  and the   ontologies structure files are located on the server.  This path has to end with a trailing slash
 * 4) missingVocabulary - this is the path where the missing vocabulary attributes and types log files get saved. There is one file per dataset that will be created. This path has to end with a trailing slash.

[Some-Dataset-Name]
All the other  sections (what appears between the brackets) is considering a dataset configuration object.

For each dataset defined in this configuration file, you will have to create an empty dataset in the structWSF instance. The URI of this dataset will be the value of the  setting described below.

A dataset configuration object is composed of 7 required settings, with a further 5 optional settings:


 * 1) datasetURI [Required] - this is the URI of the dataset to update in the structWSF instance
 * 2) baseURI [Required] - this is the base URI of the records that get converted
 * 3) datasetLocalPath [Required] - this is the local path folder where the files of the dataset are archived
 * 4) converterPath [Required] - this is the path where all files of the converter are located
 * 5) converterScript [Required] - this is the name of the converter PHP script to run
 * 6) converterFunctionName [Required] - this is the name of the function to call that will convert a list of files into RDF. It takes two parameters, the first one is the path of a file to convert and the second parameter is the parsed INI processing section of this file for this dataset
 * 7) targetStructWSF [Required] - this is the URL of the structWSF instance where the records have to be created. Note that the dataset has exist on that structWSF instance before running the syncing script. Also note that the server that perform the sync has to have the proper rights to write information into that dataset on that structWSF instance
 * 8) baseOntologyURI [Optional] - this is used by the converter of the dataset to properly create the new properties and classes while converting the dataset. This is only used when the base URI of a record is missing
 * 9) sliceSize [Optional] - this defines the number of records to send to the  structWSF endpoint at each time. Tweaking this parameter has an impact on the performance of the syncing process
 * 10) largeFileSize [Optional] - this determines the size of a big file handled by this system. Big files are handled differently (in chunks). Tests may be needed to check what is the right size for your use cases. The size is in -bytes-
 * 11) filteredFiles [Optional] - this is used to filter down to a file, or a set of files for that dataset. Each file name is separated by a semi-colon ";"
 * 12) filteredFilesRegex [Optional] - this has the same behavior as the "filteredFiles" parameter but it does match files to include into the dataset based on a regex pattern. This parameter has priority on "filteredFiles".

Running the Syncing Process
Once the datasets are properly staged and configured, running the syncing process is easy.

You can run the process manually from the terminal:

Or what you can configure a  job to run that syncing process (for example) every day:

Then add this line to the crontab:

In this manner, the syncing process will run daily at midnight and check if new datasets have been created or if any datasets got modified since the last run.