Data Validator Tool: Validating Dataset Content Based On Ontology Descriptions

Introduction
The Data Validator Tool (DVT) is a command line tool that is used to validate data records indexed in OSF datasets according to the description of loaded ontologies. Depending on how the ontologies are described, the DVT will validate the content of the datasets and report possible issues. The DVT is a post-indexation validation mechanism. It doesn't enforce any data validation at indexation time. It does report validation issues when the DVT is run against OSF. Once validation errors are detected, different mechanisms have to be put in place to fix these issues.

This document explains how the DVT should be used. It also explains how the current data validation tests works and how the reported errors should be interpreted. It also explains how the ontologies should be described, using the Protégé ontology editor, to better specify the ontologies in order to take full advantage of the DVT validation tests.

Installation & Configuration
All the installation & configuration steps are directly available on the Data Validator Tool page.

Command Line
Using the DVT command line tool is pretty easy. Its command line options and parameters are:

Usage: dvt [OPTIONS] Usage examples: Validate data: dvt -v Options: --output-xml="[PATH]"                Output the validation reports in a file specified by the path in XML format. --output-json="[PATH]"               Output the validation reports in a file specified by the path in JSON format. --allocated-memory="M"               Specifies the number of Mb of memory allocated to the DVT The number of Mb should be specified in this parameter -v                                   Run all the data validation tests -s                                   Silent. Do not output anything to the shell. -f, --fix                            Tries to automatically fix a validation test that fails Note: not all checks support this option -h, --help                           Show this help section

Let's take a deeper look into each of these parameters. Note that any parameter can be used with any other parameter. Here are a few command line examples for using the DVT:
 * 1) * If you specify this parameter, then you will start the validation process. If you don't specify it, then no validation will be performed by the DVT
 * 2) * If you specify this parameter, then nothing will be outputted to the shell terminal. This is usually used when an external tool performed automated validation using the DVT
 * 3) * If you specify this parameter, then you are asking each of the validation test to try to automatically fix the tests that failed. This option is not supported by all validation checks, so only the ones that support that option will try to fix the validation issues.
 * 4) * If you specify this parameter, then this help will be output to the shell terminal
 * 5) * If you specify this parameter, then all the tests, warnings and errors will be written into a XML file, as specified by the  value. Make sure that the user that runs the DVT do have write permission on the specified  . This is normally used to log validation tests
 * 6) * If you specify this parameter, then all the tests, warnings and errors will be written into a JSON file, as specified by the  value. Make sure that the user that runs the DVT do have write permission on the specified  . This is normally used to log validation tests
 * 7) * If you specify this parameter, then the amount of memory specified will be used by the DVT to run the tests. Depending on the size of the datasets and the tests defined within the ontologies, more memory may be required by the DVT to work normally
 * 1) * If you specify this parameter, then this help will be output to the shell terminal
 * 2) * If you specify this parameter, then all the tests, warnings and errors will be written into a XML file, as specified by the  value. Make sure that the user that runs the DVT do have write permission on the specified  . This is normally used to log validation tests
 * 3) * If you specify this parameter, then all the tests, warnings and errors will be written into a JSON file, as specified by the  value. Make sure that the user that runs the DVT do have write permission on the specified  . This is normally used to log validation tests
 * 4) * If you specify this parameter, then the amount of memory specified will be used by the DVT to run the tests. Depending on the size of the datasets and the tests defined within the ontologies, more memory may be required by the DVT to work normally
 * 1) * If you specify this parameter, then all the tests, warnings and errors will be written into a JSON file, as specified by the  value. Make sure that the user that runs the DVT do have write permission on the specified  . This is normally used to log validation tests
 * 2) * If you specify this parameter, then the amount of memory specified will be used by the DVT to run the tests. Depending on the size of the datasets and the tests defined within the ontologies, more memory may be required by the DVT to work normally
 * 1) * If you specify this parameter, then the amount of memory specified will be used by the DVT to run the tests. Depending on the size of the datasets and the tests defined within the ontologies, more memory may be required by the DVT to work normally

Automatic Validation Error Fixing
Some of the validation check procedure does support the automatic error fixing command line option. If the check does support that option, then it will run an internal procedure to try to fix the validation error itself. Be careful to read the " " section for each of the test to see how the validation errors get fixed.

When a validation error get fixed, it means that the description of the record that failed the validation test will get modified such that the test doesn't fails again. All the automatic validation errors fixing procedures uses the  web service endpoint, and does specify that a revision need to be created for that updated record. What that means is that all the records that get modified by one of the validation procedure will get revisioned, which means that all the fix changes can be roll-backed using the  web service endpoint.

Finally, all fixes are recorded into the log file if the  or the   command line options have been specified for the DVT command.

=Data Validation Tests=

Overview
The DVT includes a series of data validation tests that can be used to test the completeness and consistency of instance records indexed in OSF. If a test fails for a given record, then the error will be reported, explained and logged depending of the DVT parameters that have been specified. These validation tests cover the most common data validation usecases. A test can be used in different ways to validate different things within the dataset. Each of these ways to define a test is explained below within each test description.

In this section, the tests are introduced. Then, a description of the way the test works is provided. If some more technical background is required, then a specific section calling this out follows. A section also explains the different ways you can define ontologies and the impacts on that test. Finally an explanation of how the reported errors and warnings should be interpreted is provided.

Introduction
The  test is used to if the referenced URIs exist within OSF or not. If a record references an undeclared record (because of a missing URI), then an error will be reported.

How it Works
This test gets the list of all records that are referenced by other records but that are not (currently?) defined into OSF. For each of these undefined records, an error will be returned.

It checks for all the values of all the triples at the exception of the  property. This means that all the triples were  is the predicate of the triple will be ignored by this test.

Technical Explanation
In RDF, everything is a triple. A triple is a 3-tuple of the form:. Every record is described by one or more of these triples. The  is the record being described. The  is a property/predicate/attribute of that record. The  is the value of a property.

In RDF, the  can loosely be one of two things: What the  test does is to get the complete list of all the   which are reference to another record. Then, once this list is compiled, the test validates that the  references are described in OSF, in the same, or another dataset/ontology. This heuristic has been implemented as a SPARQL query that is used internally.
 * 1) a   value
 * 2) a reference to another record

Automatic Validation Error Fixing
If the  parameter is specified for a DVT command, then the   test will try to fix all the validation errors that occurred. The fix that will be applied is that the triple where the value is a URI which is not existing in any other dataset, or any other ontologies, will be deleted in the dataset.

However, the DVT uses the Revisioning capabilities of OSF when it does the automatic fixing of errors. This means that it will always be possible to revert changes performed by the DVT by using the revisioning web service endpoints.

Fixing Exceptions
There is one kind of triple that cannot be fixed by this  check. If the predicate of a value that is not existing is, then this triple won't be fixed. It will be reported to the user interface and in the XML or the JSON logs, but it won't be fixed.

The reason why it won't be fixed is simple, it is because if we remove the  associated with a record, then we will untype that record unnecessarily. What we do is to report the issue such that the data maintainers does fix the type by hands, or does create the class, representing that type, into one of the loaded ontologies.

Logging Error Fixes
All the fixes are logged into the XML or JSON log files if the  and/or the   options were specified in the DVT command. In this section we will explain how to interpret the log files specifically for the fixes reported in the logs for that  check.

XML Logs Files
Here is the explanation for the meaning of each element of that file:

Here is an example of such a (partial) XML log file that includes the fixes reports:

JSON Logs Files
Here is the explanation for the meaning of each element of that file:

Here is an example of such a (partial) JSON log file that includes the fixes reports:

Errors
{| border="1" cellpadding="5" cellspacing="0" !

URI-EXISTENCE-100

 * Description
 * This error is returned when a URI if used as an  reference but that is not currently defined in any dataset accessible by the DVT. This means that an "undefined" URI has been referenced by another record within the datasets.
 * Fields
 * }
 * Fields
 * }
 * }

Warnings
{| border="1" cellpadding="5" cellspacing="0" !

URI-EXISTENCE-50

 * Description
 * This warning is returned when the test couldn't check if referenced URIs exists in the OSF instance. This means that the SPARQL query failed to execute the query.
 * Fields
 * No additional fields
 * }
 * No additional fields
 * }

{| border="1" cellpadding="5" cellspacing="0" !

URI-EXISTENCE-51

 * Description
 * We couldn't get the list of affected records from the OSF instance.
 * Fields
 * No additional fields
 * }
 * No additional fields
 * }

{| border="1" cellpadding="5" cellspacing="0" !

URI-EXISTENCE-52

 * Description
 * We couldn't read the description of an affected record from the OSF instance.
 * Fields
 * No additional fields
 * }
 * No additional fields
 * }

{| border="1" cellpadding="5" cellspacing="0" !

URI-EXISTENCE-53

 * Description
 * We couldn't update the description of an affected record from the OSF instance
 * Fields
 * No additional fields
 * }
 * No additional fields
 * }

Property Validation
Properties, the middle part of an RDF triple, may be one of three kinds: 1), for which the object is a value that conforms to a specific type of data type; 2)  , for which the object is another instance denoted by a URI; or 3) an  , where the object is a literal (string) value. Both   and   may be further defined using the concepts of   and  , as described below. Annotation properties do not have domains or ranges. This section describes how the DVT validates against   and.

Introduction
The  test is to check if all of the datatypes defined for all used datatype properties have been respected and are valid. With this test, we make sure that all the expected value types have been respected when indexed into OSF.

How it Works
The heuristic used by this check is as follows: Notes regarding this heuristic:
 * 1) Get the list of all the properties that have a non-URI value and that have a   defined for them in one of the loaded ontologies
 * 2) For each datatype property we get the list of all the values. At this step, we will have two pieces of information about the value. We will have the actual textual value, and the datatype of that value as defined in the triple store.
 * 3) For each value we make sure that the datatype defined for that value in the triple store is the same as the one defined in the ontology
 * 4) If the value's defined datatype is the same as the one defined in the ontology, then we validate the actual value according to internal XSD and RDFS data validation internal procedures
 * 5) If the actual value is not valid according to these internal validation tests, we return a   error
 * 6) If the value's defined datatype is not the same as the one defined in the ontology, we return a   error
 * 1) If no range is defined for a property, then its range is considered " ", which means that no specific datatype is defined for the value, and that any value can be used as a value of this property.
 * 2) Even if a value is defined as   in the triple store, it doesn't mean that the value is actually a valid   since the triple store won't validate according to this datatype, but will only tag the value as being of that type. So this is why we have to perform the test

Technical Explanation
In RDF, everything is a triple. A triple is a 3-tuple of the form:. Every record is described by one or more of these triples. The  is the record being described. The  is a property/predicate/attribute of that record. The  is the value of a property.

OWL is a specification framework that is used to create the ontologies that are used to define the semantics of the properties/predicates/attributes and the types/classes used to describe the instance records indexed in CCR datasets.

When we define a  in an ontology, each predicate may have at least two different characteristics: The  of a property is the left side of the property. What the  does is to specify where the   can be used, which type/kind of   it can be used to describe. That is, the  for a given property defines valid subject types to which it applies. If a  type is not in the   of a property, then that property cannot be used to describe that type of.
 * 1) It may have a
 * 2) It may have a

The  of a property is the right side of the property. What the does is to specify the datatype of the value  of such a. That is, the range for a given property defines valid object types to which it can apply. For example, if we have a  property where the range of that property is , then it means that all the instance records that uses this   property need to have a value of type.

Specifying within an Ontology
For this data validation test to work, the ontologies loaded in OSF have to be properly defined. If no datatypes are defined for any property, then the test will consider that their default datatype is  which is equivalent to say that any value can be entered for each of the properties. Otherwise, any datatype specified into any loaded ontology will have a direct impact on this test.

When you edit an ontology into Protégé, you have a series of tabs. One of which is called "Data Properties". This is the tab where all the datatype properties defined in the ontology will appear. If you click on any of these datatype properties that appears on the left side of the application, you will see the property's complete description appearing on the right side of the application.

There is one section that is highlighted on the right side section that is of interest for this test, which is the  section. This is where the  of a property is defined in Protégé. There are 3 buttons related to such a range that interest us particularly: To add a new datatype to a given property, you have to click the  button. When clicked, a list of available datatypes will then appear. From that list, you choose the datatype you want to specify for this property and click the  button.Once you added/modified/removed a datatype assignation to a property, you have to reload the ontology in OSF to have the modification taken into account by the DVT.
 * – The add button is used to add a new datatype range to the property
 * – The edit button is used to edit the current datatype range assignation of the property
 * – The remove button is used to remove a datatype range assignation of the property

Supported Datatypes
This validation test does perform additional internal data validation procedure to make sure that the value is a valid value according to the specified datatype. Here is a list of all the supported datatypes:
 * 1) xsd:anyURI
 * 2) xsd:base64Binary
 * 3) xsd:boolean
 * 4) xsd:byte
 * 5) xsd:dateTime
 * 6) xsd:dateTimeStamp
 * 7) xsd:decimal
 * 8) xsd:double
 * 9) xsd:float
 * 10) xsd:hexBinary
 * 11) xsd:int
 * 12) xsd:integer
 * 13) xsd:language
 * 14) xsd:long
 * 15) xsd:Name
 * 16) xsd:NCName
 * 17) xsd:negativeInteger
 * 18) xsd:NMTOKEN
 * 19) xsd:nonNegativeInteger
 * 20) xsd:nonPositiveInteger
 * 21) xsd:normalizedString
 * 22) xsd:positiveInteger
 * 23) xsd:short
 * 24) xsd:string
 * 25) xsd:token
 * 26) xsd:unsignedByte
 * 27) xsd:unsignedInt
 * 28) xsd:unsignedLong
 * 29) xsd:unsignedShort
 * 30) rdfs:Literal
 * 31) rdf:PlainLiteral
 * 32) rdf:XMLLiteral

Errors
{| border="1" cellpadding="5" cellspacing="0" !

DATATYPE-PROPERTIES-DATATYPE-100
{| border="1" cellpadding="5" cellspacing="0" !
 * Description
 * This error is returned when the datatype specified in the triple store and the range specified in the ontology for that property are different.
 * Fields
 * }
 * Fields
 * }
 * }

DATATYPE-PROPERTIES-DATATYPE-101

 * Description
 * This error is returned when the datatype specified in the triple store and the range specified in the ontology for that property are the same, but when the actual indexed value is invalid according to the internal datatype validation procedures.
 * Fields
 * }
 * Fields
 * }
 * }

Warnings
{| border="1" cellpadding="5" cellspacing="0" !

DATATYPE-PROPERTIES-DATATYPE-50

 * Description
 * This warning is returned when a datatype property is being used, but for which we don't have any  defined for it in any loaded ontologies. No immediate actions are required when this warning is sent, but they show areas where the ontologies may be updated/improved.
 * Fields
 * Fields
 * Fields


 * }

{| border="1" cellpadding="5" cellspacing="0" !

DATATYPE-PROPERTIES-DATATYPE-51

 * Description
 * This warning is returned when we couldn't get the list of datatype properties from the structWSF instance. The SPARQL query failed in some way.
 * Fields
 * No additional fields
 * }
 * No additional fields
 * }

{| border="1" cellpadding="5" cellspacing="0" !

DATATYPE-PROPERTIES-DATATYPE-52

 * Description
 * This warning is returned when we couldn't get the list of values for a specific property
 * Fields
 * No additional fields
 * }
 * No additional fields
 * }

Introduction
The  test is to check if all the properties are used to describe the proper instance records currently indexed in OSF as defined into the loaded ontologies. Not all the properties can be used to describe all the type of instance records, so this test make sure that all the properties have been used to define the proper type of instance records.

How it Works
The heuristic used by this check is as follows: Notes regarding this heuristic:
 * 1) Get the list of all the properties that are used to describe any record within OSF
 * 2) For each of the property we get the list of all the distinct types of all the records that uses this property.
 * 3) For each type we make sure that the type belongs to the domain defined for this property in the loaded ontologies
 * 4) If the the type of one of the record doesn't belong to the domain of the property as described in the ontologies, then a   error will be returned
 * 1) If no   is defined for a property, than its   is considered " " which means that any type of instance records can use this property

Technical Explanation
In RDF, everything is a triple. A triple is a 3-tuple of the form:. Every record is described by one or more of these triples. The  is the record being described. The  is a property/predicate/attribute of that record. The  is the value of a property.

OWL is a specification framework that is used to create the ontologies that are used to define the semantic of the properties/predicates/attributes and the types/classes used to describe the instance records indexed in OSF datasets.

When we define a  in an ontology, each predicate have at least two different characteristics: The  of a property is the left side of the property. What the  does is to specify where the   can be used, which type/kind of   it can be used to describe. If a  type is not in the   of a property, then that property cannot be used to describe that type of.
 * 1) It does have a
 * 2) It does have a

The  of a property, is the right side of the property. What the range does is to specify the datatype of the value of such a. For example, if we have a  property where the range of that property is , then it means that all the instance records that uses this   property need to have a value of type.

Specifying within an Ontology
For this data validation test to work, the ontologies loaded in OSF have to be properly defined. If no domains are defined for any property, then the test will consider that their default domains is  which is equivalent to say that any property can be used to define any type of instance record. Otherwise, any domain specified into any loaded ontology will have a direct impact on this test.

When you edit an ontology into Protégé, you have a series of tabs. One of which is called " " and another one which is called " ". These are the tabs where all the object and datatype properties are defined in the ontology will appear. If you click on any of these properties that appears on the left side of the application, you will see the property's complete description appearing on the right side of the application.

Note that the following explanations are the same for the object, or the datatype properties sections. However, the current example is based on the " " tab.

There is one section that is highlighted on the right side section that is of interest for this test, which is the  section. This is where the  of a property is defined in Protégé. There are 3 buttons related to such a domain that interest us particularly:
 * – The add button is used to add a new domain to the property
 * – The edit button is used to edit the current domain assignation of the property
 * – The remove button is used to remove a domain assignation of the property

To add a new domain to a given property, you have to click the "+" button. When clicked, a list of available domain types will appear under the  tab. From that list, you choose the type (class) you want to specify for this property and click the "OK" button.Once you add/modify/remove a range assignation to a property, you have to reload the ontology in OSF to have the modification taken into account by the DVT.

Errors
{| border="1" cellpadding="5" cellspacing="0" !

OBJECT-DATATYPE-PROPERTIES-DOMAIN-100

 * Description
 * This error is returned when the type of a record is not part of the domain of a property used to describe the record.
 * Fields
 * }
 * Fields
 * }
 * }