OSF Engines Layer

The premise of the Open Semantic Framework stack is based on the RDF data model. Using a common data model means that all Web services and actions against the data only need to be programmed via a single, "canonical" form. Simple converters convert external, native data formats to the RDF form at time of ingest; similar converters can translate the internal RDF form back into native forms for export (or use by external applications). This use of a "canonical" form leads to a simpler design at the core of the stack and a uniform basis to which tools or other work activities can be written. This leads to lower development and maintenance costs, and faster implementation. This framework is then made operational via ontologies that both capture the domain or knowledge space with internal ontologies that guide OSF (see separate Role of Ontologies). This design approach is known as ODapps, for ontology-driven applications.

The OSF engines are all open source and work to support this premise. The OSF engines layer governs the index and management of all OSF content. Documents are indexed by the Solr engine for full-text search, while information about their structural characteristics and metadata are stored in an RDF database, called a "triple store." The schema aspects of the information (the "ontologies") are separately managed and manipulated with their own W3C standard application, the OWL API. At ingest time, the system automatically routes and indexes the content into its appropriate stores. Another engine, GATE, is available for semi-automatic assistance in tagging input information and other natural language processing (NLP) tasks.

The RDF triple store is provided by OpenLink's Virtuoso software. Virtuoso is a cross-platform ‘universal server’ for SQL, XML, and RDF data, including data management, that also includes a powerful virtual database engine, native hosting of existing applications, Web services deployment platform, Web application server, and bridges to numerous existing programming languages. We mostly use the RDF storage and management, SPARQL and inferencing capabilities of Virtuoso.

Many structured data systems lack good performing full-text search. Also, structured data based on linked data RDF often substitutes Web identifiers for literal text values. This practice is good for linking and tracking purposes, but can excise much text, leading to incomplete results sets during standard text search. To address these issues, we: 1) changed standard RDF practice to also record literals in addition to URI identifiers; and 2) integrated our structured data store with the Solr text-search engine. Solr is an open source enterprise search server based on the Lucene Java search library, with faceted search, caching, and many more features.

The OWL API is a Java implementation for creating, manipulating and serializing OWL ontologies. This engine gives us a very flexible and powerful way for managing the ontology schema at the core of OSF and to conduct special retrieval manipulation tasks based on the structure in those schema.

The General Architecture for Text Engineering (GATE) engine is a Java suite of tools used by a worldwide community of scientists, companies, teachers and students for all sorts of natural language processing tasks, including information extraction in many languages. GATE is one of the acknowledged best tools for conducting all types of computational tasks involving human language or text analysis. The primary use of GATE in OSF is to drive the semi-automatic tagging of subject tags within documents.

The OSF engines layer also includes the PHP/Java Bridge, an XML-based network protocol to connect a native script engine (in our case, PHP) to a Java virtual machine. It is fast and efficient. The bridge gives us the capability to run Java-based engines efficiently within the stack. It connects to GATE and the OWL API within OSF, and provides a ready means for integrating still other Java-based capabilities and engines as customers may need.

For efficiency, Web service requests are handled by Memcached. It is an open source, high-performance, distributed memory object caching system. The generic Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects), well suited to these API calls.

The fundamental unit of record aggregation upon which these engines act is the "dataset". A dataset refers to a named grouping of records, best designed as similar in record types and intended access rights (though technically a dataset is any named grouping of records). Datasets are one of the three major access dimensions to the OSF (the other two being users/groups and tools/endpoints, see next).

All data objects (what is called in various settings as entities, kinds, types or classes) and their relations (properties, fields, attributes) and their annotations (metadata) are given Web identifiers in the form of URIs. These are similar to Web site URLs, but now designate objects and properties as opposed to Web sites. This means any and all data within the OSF has a unique identifier, accessible using the HTTP protocol.