OSF Revisioning Design

From OSF Wiki
Jump to: navigation, search

Introduction

A record revisioning system for OSF is a set of new mechanisms that will enable users and developers to create revisions (versions) of individual OSF records. All versions of a record's description will be saved in a different revisions dataset. Users will be able to check older revisions for a particular record, they will be able to list all revisions of a record. They will also be able to check differences between versions and they will be able to check future versions to be published for a record. This document outline the design of the OSF Revisioning system.

Required Behaviors

A record revisioning system should at least meet these requirements by being able to:

  • Get a complete list of all revisions for a record
  • Specify if we want to create a new revision, or not, for a record that is being updated
    • This would only be possible if the record to update is already published
  • Delete all revisions for a record
  • Delete a single revision for a record
  • Revert the record currently published in the dataset to a previous revision
  • Update a specific revision status of the record
  • 'Diff' two revisions of a record
  • Create unpublished revisions that waits to be moderated. These unpublished revisions are revisions that are more recent than the current version of a record available in the dataset, and will eventually replace it after moderation.
  • Revision reified statements of the records.

Revisioning Method

The revisioning method designed in this document is one that will save the complete description of a record every time a new revision is being created. This means that every time a new revision is created, all the triples, including reified triples, will be saved in the revision. An alternative method would be to save the triples that changed (so, the difference between two records' description) using a method such as the one that uses the ChangeSet.

Now, let's outline the advantages and disadvantages of the revisioning method outlined in this document.

Advantages

  1. Less space consumed for smaller records with a lot of changes per revision
  2. Reverting to a previous revision is fast since the complete state of a record exists in its revision record; a single read query is required
  3. Comparing two non-concurrent revisions is faster than with the ChangeSet method since the time to compare these two non-concurrent revisions is the same as if they were concurrent.
  4. Can easily revision reification statements.

Disadvantages

  1. More space consumed for big records with a small number of changes per revision
  2. Comparing two concurrent revisions needs to be done at runtime with the RDF Diff API.

Revisions Scenarios & Structures

In this section we outline different revisioning scenarios that can happen, and for each of these scenario, we show what the revision structure looks like.

Basic Revision

This is the most basic scenario of the revisioning system. We have three revisions for a single record. The last revision is the one that is published on the different portals.

Osf revisioning basic.png

Revisioning adding a new unpublished revision

This second revisioning scenario is one where the revision that is currently published on the portals is not the last revision of the record. This scenario means that there exists a more recent revision for this record that is not yet published on the portals. It is probably waiting for approval in a governance workflow.
Osf revisioning new unpublished revision.png

Revisioning reverting to a previous revision

This other revisioning scenario shows what happens when a user chooses to re-publish an older version of a record. This means that the Rev 3 revision still exists in the revisioning system, but that it is not published on the portals anymore. It is the Rev 2 revision that is now exposed on the portals.
Osf revisioning revert previous.png

Revisioning deleting an existing revision

This scenario shows the impact of deleting a revision in the middle of a sequence of revisions. If Rev 2 would be the published revision, an error would be returned to the requester telling him that he has to publish another revision if he wants to be able to delete that revision.
Osf revisioning delete revision.png

Revisioning Graph

Every time that a new dataset is created in PSF, a new "revisions" dataset is created at the same time. This dataset is where all the revisions of the records will be saved. The rules for creating the revisions datasets, and to create the revisions records are simple:

Here is an example of what these two datasets looks like, and what are the relations between the two:

Osf revisioning graphs.png
What this schema shows is that the record that is currently available in the dataset is the Rev 3 revision. As we saw above, the published record (the one that is available in the dataset) is not necessary the last revision. If the Rev 4 revision is eventually published, then the current record in the dataset will be deleted and replaced by the Rev 4 record. Then the published pointer will now be targeting the Rev 4 revision record.

Revisioning Vocabulary

These additions require adding new vocabulary to the WSF Ontology (Web Service Framework Ontology), the ontology for describing instances of the OSF Web services framework. This new revisioning vocabulary is:

  • Classes
    • wsf:Revision
    • wsf:RevisionStatus
  • Properties
    • wsf:revisionUri
      • This is the URI of the record to be revisioned
    • wsf:fromDataset
      • This is the URI of the dataset where this record is published
    • wsf:revisionTime
      • This is the Unix time stamp (which includes microseconds) when the revision got created
      • As shown in the revisions structure above, the revisions are ordered in a linear time series. This means that the sequence of revisions is determined by the time when they got created. The sequence can be re-created by ordering them by these time stamps
      • The value of this property is filled at creation time of the revision.
    • wsf:performer
      • Refers to the user that did the change
    • wsf:revisionStatus
      • Specify the current status of the revision
  • Named Invididuals
    • Of type wsf:RevisionStatus
      • wsf:published
        • Specify that the revision is the one, within the revisions sequence, that is currently published in the dataset

Revision Example

Here is an example of a published record. Below you have one of the revisions that exists for that same record. This shows how the revisioning vocabulary is used to describe revisions saved into the revisions graph.

@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix iron: <http://purl.org/ontology/iron#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix wsf: <http://purl.org/ontology/wsf#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix dcterms: <http://purl.org/dc/terms/> .

<http://localhost/datasets/HistImages/histImage_5> a foaf:Image ;
  foaf:thumbnail """http://localhost/files/images-datasets/thumbs/tb_iowa_city_depot.jpg""" ;
  foaf:page """http://en.wikipedia.org/wiki/Chicago,_Rock_Island_and_Pacific_Railroad_Passenger_Station""" ;
  foaf:img """http://localhost/files/images-datasets/iowa_city_depot.jpg""" ;
  foaf:topic <http://purl.org/ontology/muni#Historic_buildings> ;
  foaf:topic <http://purl.org/ontology/muni#Railway_stations> ;
  iron:prefLabel """Iowa City Old Depot""" ;
  geo:lat """41.65361""" ;
  geo:long """-91.53361""" .

As you can see below with the revision record:

  • All of the triples of the published record are part of the revision's record description
    • This enable us to analyze all the revision records using SPARQL queries
  • The URI of the revision record is different
  • All the additional triples required by the revisioning system is constrained in the wsf ontology namespace
    • This means that if we want to recreate the initial state of the record that leaded to a particular revision, we can easily do this by:
      • Changing the URI of the revision by the URI value of the wsf:revisionUri property
      • By removing all the revisioning properties and the wsf:Revision class assertion
<http://localhost/datasets/HistImages/revisions/c99a11a53a3748269e3f86d7ac38df11> a foaf:Image ;
  a wsf:Revision ;
  wsf:revisionUri <http://localhost/datasets/HistImages/histImage_5> ;
  wsf:fromDataset <http://localhost/datasets/HistImages/> ;
  wsf:revisionTime """1368196492""" ;
  wsf:performer <http://localhost/user/1> ;
  wsf:revisionStatus wsf:published ;  
  foaf:thumbnail """http://localhost/files/images-datasets/thumbs/tb_iowa_city_depot.jpg""" ;
  foaf:page """http://en.wikipedia.org/wiki/Chicago,_Rock_Island_and_Pacific_Railroad_Passenger_Station""" ;
  foaf:img """http://localhost/files/images-datasets/iowa_city_depot.jpg""" ;
  foaf:topic <http://purl.org/ontology/muni#Historic_buildings> ;
  foaf:topic <http://purl.org/ontology/muni#Railway_stations> ;
  iron:prefLabel """Iowa City Old Depot""" ;
  geo:lat """41.65361""" ;
  geo:long """-91.53361""" .

Diff Algorithm

One of the requirements is to be able to differentiate two given revisions of the same record. This functionality will be exposed as a new Web service as outlined below. This Web service will compare two revisions of a same record, and will outline all the changes between the two revisions as a ChangeSet.

The basic RDF Diff algorithm that will be implemented is:

  • Get as input the two complete descriptions of a same record, but for different revisions. Refer to them as version-1 and version-2
  • Parse both version-1 and version-2 into two sets of triples
    • Iterate over the version-1 triples
      • For each version-1 triple, look for an identical triple in the version-2 set. If you don't find a match, reify that version-1 triple as an rdf:Statement, and add that rdf:Statement as acs:removal to the ChangeSet.
    • Iterate over the version-2 triples
      • For each version-2 triple, look for an identical triple in the version-1 set. If you don't find a match, reify that version-2 triple as an rdf:Statement, and add that rdf:Statement as acs:addition to the ChangeSet.

Then the new Web service endpoint will return that ChangeSet in its resultset.

Revisioning and CRUD Web Service Endpoints Overview

This is a summary and overview of the different Revisioning and existing CRUD Web service endpoints, outlining the roles and goals of each endpoint in this new revisioning and publication environment:

  • CRUD: Create
    • It is used to create the first version of a record. The first version of a record is indexed in the core dataset, and not (initially) into the revisions dataset.
    • It is used to reload the Solr index with the description of the published version of the record(s)
      • mode = full – Index in both the triple store (Virtuoso) and search index
      • mode = triplestore – Index in the triple store (Virtuoso) only. This mode cannot be used if the record is already existing.
      • mode = searchindex – Re-index the records in the search index (Solr) using the triples currently indexed into the triple store. This mode can only be used on published records. The payload of this query can be composed of records that only have a single type triple since the other information won't be used by the endpoint to populate the search index.
        • Note about this mode: if a record get unpublished, but that a revision still exits for that record, then a "WS-CRUD-CREATE-313" error will be returned by the endpoint. The reason for this behavior is that only published records can be reloaded into Solr using this mode. If this won't be the case, if we we could reload the Solr index with unpublished records, then this would mean that unpublished record could be visible in the Search endpoint, but they won't be visible on any other endpoints such CRUD: Read. This is the reason why this mode can only be used on published records, otherwise inconsistencies between published and unpublished records would arise.
  • CRUD: Delete
    • It is used to delete a published record from the core dataset. Then the endpoint exposes two options: to delete all the revisions for that record at the same time, or to only delete the published record in the core dataset while keeping the revisions in the revision graph (this means that a record could be restored by marking one of its revision as published, which would re-create it in the core dataset)
  • CRUD: Read
    • It is used to read the published revision of a record in the core dataset
  • CRUD: Update
    • It is used to update the published version of a record
    • It is also used to create new (unpublished) revisions of a record. These revision would be potential future published revisions of the record
  • Revision: Read
    • It is used to read (get all the triples) of a specific revision record
  • Revision: Delete
    • It is used to delete a specific revision record
  • Revision: Update
    • It is used to update the lifecycle stage of a revision
  • Revision: Lister
    • It is used to get the full listing of revisions for a given record
  • Revision: Diff
    • It is used to compare two revisions of a same record