Multilinguality Capabilities of OSF

Introduction
This document outlines all of the current multiple language capabilities of the CCR. It discusses how new languages can be supported by the CCR. It discusses the language capabilities of specific CCR web service endpoints. And it finally explains how the content that is added to the CCR may be described with language markup.

UTF-8 Encoding
The information manipulated by the CCR endpoints are encoded in UTF-8. This means that all of the records that are created into the CCR should be encoded in UTF-8, and that all records returned by any web service endpoints are also encoded in UTF-8. All record information transmitted between the CCR endpoints and the Virtuoso and Solr servers are also transmitted using UTF-8 encoding.

Because of the universal use of UTF-8 encoding within the CCR, please take care to encode all language variants with UTF-8. Not following this rule will cause unexpected behaviour.

Specifying Language Tags
Language tags are used to indicate the language of text as specified below. These language tags use the equivalent of the  attribute for XML, the particular codes for which are defined in the IETF's BCP 47. (BCP stands for 'Best Current Practice', and is a persistent name for a series of RFCs whose numbers change as they are updated.) This document provides multiple tag options for various languages. Use the two-letter designator variant wherever possible (such as 'en' for English, 'fr' for French, 'de' for German, etc.) The full list of these codes may be found in the IANA registry.

For additional information about language codes and tags in semantic documents, see the W3C's Language tags in HTML and XML.

Multiple Language Capabilities of Endpoints
In this section, we cover all CCR web service endpoints that have multilingual capabilities. We will explain how they work, how they should be used, and what users should expect when using these endpoints.

If a structWSF instance is not configured for multilinguality, then it means that the default language being used is 'en' (English). Under this default condition, all property strings (literal values) used in CCR records are understoof to be English strings or literals.

CRUD: Create
The CRUD: Create web service endpoint is used to create new content into the CCR. All of the content that is created by this web service endpoint is serialized into RDF/XML or RDF/N3. Depending on how the RDF data is defined, three different behaviours may occur:


 * 1) When indexing content in Virtuoso:


 * 1) If no language is specified for a literal value, then:


 * 1) No specific language information is indexed in Virtuoso


 * 1) If a language is specified for a literal value, but this language is not configured on the CCR, then:


 * 1) The language tag defined in the input RDF document will be indexed in Virtuoso (even if not supported)


 * 1) If a language is specified for a literal value and this language is configured on the CCR, then:


 * 1) The language tag defined in the input RDF document will be indexed in Virtuoso


 * 1) When indexing content in Solr:


 * 1) If no language is specified for a literal value, then:


 * 1) The CRUD: Create endpoint stores this literal using the default language, which is 'en' (English)


 * 1) If a language is specified for a literal value, but this language is not configured on the CCR, then:


 * 1) The CRUD: Create endpoint stores this literal using the default language, which is 'en' (English)


 * 1) If a language is specified for a literal value and this language is configured on the CCR, then:


 * 1) The CRUD: Create endpoint properly stores and indexes this literal using the specified language.

What is important to note with these different behaviour depending if the RDF document is indexed in Virtuoso or Solr is that everything that is indexed in Virtuoso is a faithful representation of what got indexed. This means that even if the languages are not supported, that information will be indexed in Virtuoso. That way, if the languages become supported in the future, we will be able to use the DMT (Datasets Management Tool) tool to re-index the content in Solr that is existing in Virtuoso to properly populate the updated Solr schema index.

Once the proper usecase is detected, then the endpoint will take care to properly index that information into the underlying data management systems, namely Virtuoso and Solr.

This means that what drives the multilingual capabilities of this CRUD: Create endpoint is the way the input RDF data is described. There are no input parameters related to multiliguality for this endpoint.

Note: read the section "How to Describe RDF Data for Multilinguality" below to know how to write your RDF document to transmit to the CRUD: Create endpoint to enable multilinguality.

 Here is an example of such a RDF/XML document that uses two different languages for some of its literals descriptions:

CRUD: Update

The CRUD: Update web service endpoint is used to update existing content into the CCR. All of the content that is being updated by this web service endpoint is serialized into RDF/XML or RDF/N3. Depending how the RDF data that is being updated by this endpoint is defined, three different behaviours may occur. These are the same behaviour as the CRUD: Create web service endpoint we described above:


 * 1) When indexing content in Virtuoso:


 * 1) If no language is specified for a literal value, then:


 * 1) No specific language information is indexed in Virtuoso


 * 1) If a language is specified for a literal value, but this language is not configured on the CCR, then:


 * 1) The language tag defined in the input RDF document is indexed in Virtuoso (even if not supported)


 * 1) If a language is specified for a literal value and this language is configured on the CCR, then:


 * 1) The language tag defined in the input RDF document is indexed in Virtuoso


 * 1) When indexing content in Solr:


 * 1) If no language is specified for a literal value, then:


 * 1) The CRUD: Create endpoint stores this literal using the default language, which is 'en' (English)


 * 1) If a language is specified for a literal value, but this language is not configured on the CCR, then:


 * 1) The CRUD: Create endpoint stores the literal using the default language, which is 'en' (English)


 * 1) If a language is specified for a literal value and this language is configured on the CCR, then:


 * 1) The CRUD: Create endpoint properly indexes this literal using the specified language.

What is important to notice with these different behaviours depending if the RDF document is indexed in Virtuoso or Solr is that everything that is indexed in Virtuoso is a faithful representation of what got indexed. This means that even if the languages are not supported, that information will be indexed in Virtuoso. That way, if the languages become supported in the future, we will be able to use the DMT (Datasets Management Tool) tool to re-index the content in Solr that is existing in Virtuoso to properly populate the updated Solr schema index.

Once the proper usecase is detected, then the endpoint will take care to properly update that information into the underlying data management systems, namely Virtuoso and Solr.

This means that what drives the multilingual capabilities of this CRUD: Update endpoint is the way the input RDF data is described. There are no input parameters related to multiliguality for this endpoint.

Note: read the section "How to Describe RDF Data for Multilinguality" below to know how to write your RDF document to transmit to the CRUD: Create endpoint to enable multilinguality.

CRUD: Read
The CRUD: Read web service endpoint is used to read content from the CCR. Different multilinguality behaviour exists within the CRUD: Read web service endpoint depending how the RDF data that is currently indexed in the CCR has been described. If the  parameter of the web service endpoint is omitted, then the default language is used, which is 'en' (English). Let's describe the different behaviours of the CRUD: Read web service endpoint depending on what is specified with the  parameter, which determines what is indexed into the CCR.

Here are the different behaviours that may occur with the CRUD: Read web service endpoint, depending on how it is being used (input parameters) and how the data is being described in the CCR:


 * 1) If the input parameter   is not specified for the query, then:


 * 1) This means that the default 'en' language will be used by the endpoint
 * 2) The endpoint will then return:


 * 1) all of the triples where the value is a URI
 * 2) all of the triples where the literal values are defined to be using the language 'en' (English)
 * 3) all of the triples where the literal values have no defined languages (this means that the language string was not specified when the record got indexed using CRUD: Read)


 * 1) If the input parameter   is specified with 'fr' (French) (for this example) for the query, then:


 * 1) The endpoint will then return:


 * 1) all of the triples where the value is a URI
 * 2) all of the triples where the literal values are defined to be using the language 'fr' (French)
 * 3) all of the triples where the literal values have no defined languages (this means that the language string was not specified when the record got indexed using CRUD: Read).

Now, let's check a few examples of what will be returned depending on the CRUD: Read requests that are being sent, and what is indexed in the CCR:

In the above RDF/XML document, we can see that we have a property defined with an English value, and another one with a French value. Then we have a  where its value is not specified with any particular language string. Now let's see what is the RDF/XML document looks like that will be returned by the CRUD: Read endpoint depending on these input parameters:


 * CRUD: Read query, input parameters:


 * : http://ccr.nhccn.com.au/datasets/global/documents/10695
 * : http://ccr.nhccn.com.au/datasets/global/documents/


 * Returned resultset:

Since the  parameter is unspecified, 'en' (English) is being used by the endpoint. That means that the English values are returned, along with the values that have no language specified for them. Now let's see what happens when we define the language parameter for the French language:


 * CRUD: Read query, input parameters:


 * : http://ccr.nhccn.com.au/datasets/global/documents/10695
 * : http://ccr.nhccn.com.au/datasets/global/documents/
 * : fr


 * Returned resultset:

Since the  parameter 'fr' (French) is being used by the endpoint, the French strings are returned. Also, given the behaviour outlined at the beginning of this section, all the values without any language specified are returned as well.

Ontology: Read

The Ontology: Read web service endpoint is used to read ontology Classes, Properties and Named Individuals content from the defined ontologies on the CCR. Exactly the same behaviour as the CRUD: Read endpoint discussed above applies here as well.

Search
The Search web service endpoint is used to read content from the CCR. Different multilinguality behaviour exists within the Search web service endpoint depending how the RDF data that is currently indexed in the CCR as been described. If the lang parameter of the web service endpoint is omitted, then the default language is used, which is 'en' (English). Let's describe the different behaviours of the Search web service endpoint depending what is specified with the  parameter, and depending what is indexed into the CCR.

Here are the different behaviours that may happens with the Search web service endpoint, depending on how it is being used (input parameters) and how the data is being described in the CCR:


 * 1) If the input parameter   is not specified for the query, then:


 * 1) This means that the default 'en' language will be used by the endpoint
 * 2) The endpoint will then return:


 * 1) all of the triples where the value is a URI
 * 2) all of the triples where the literal values are defined to be using the language 'en' (English)
 * 3) all of the triples where the literal values have no defined languages (this means that the language string was not specified when the record got indexed using Search)


 * 1) If the input parameter   is specified with 'fr' (French) for the query, then:


 * 1) The endpoint will then return:


 * 1) all of the triples where the value is a URI
 * 2) all of the triples where the literal values are defined to be using the language 'fr' (French)

Now, let's check a few examples of what will be returned depending on the Search requests that are being sent, and what is indexed in the CCR:

In the above RDF/XML document, we can see that we have a property defined with an English value, and another one with a French value. Then we have a  where its value is not specified with any particular language string. Now let's see what is the RDF/XML document that will be returned by the Search endpoint depending on the input parameters:


 * CRUD: Read query, input parameters:


 * : http://ccr.nhccn.com.au/datasets/global/documents/10695
 * : http://ccr.nhccn.com.au/datasets/global/documents/


 * Returned resultset:

Since the  parameter is unspecified, 'en' (English) is being used by the endpoint. This default means that the English values are returned, along with the values that have no language specified for them. Now let's see what happens when we define the language parameter for the French language:


 * CRUD: Read query, input parameters:


 * : http://ccr.nhccn.com.au/datasets/global/documents/10695
 * : http://ccr.nhccn.com.au/datasets/global/documents/
 * : fr


 * Returned resultset:

Since the  parameter 'fr' (French) is being used by the endpoint, only the French strings are returned.

SPARQL

The SPARQL web service endpoint is used to read content from the CCR using SPARQL queries. Different multilingual behaviour exists within the Search web service endpoint depending how the RDF data that is currently indexed in the CCR has been described. The way to get the language tag of a literal value is by properly creating your SPARQL queries. In this section, we will see how we can create SPARQL queries to get the description of the records with their language definition.

Let's take this input RDF that will be indexed into the CCR:

The only mimes supported by the SPARQL endpoint that will handle returning the language definition of a literal value are:


 * 1) application/sparql-results+xml (SPARQL resultset in XML)
 * 2) application/sparql-results+json (SPARQL resultset in JSON)

Then, the kind of SPARQL query that we have to create to get the language definition of a literal value is:

What will be returned by such a query, if the  is being used, is:

As you can notice, if no language tag is defined in the input RDF then this binding will be displayed:. However, if there is one, then this kind of binding will be displayed:  or

How to Configure the CCR for Multilinguality
This section explains how to configure the various pieces in a CCR instance in order to properly provide multilingual support. There are basically two pieces to configure: structWSF and Solr.

How to Configure the structWSF for Multilinguality
If you want to support more than one language with the structWSF web service endpoints, then you have to properly configure structWSF to support more than one distinct language.

Configuring structWSF to support a new language is really easy and straightorward. The only thing that is required is to edit the  configuration file, and to add the new language strings into the configuration. The  file configured to support both English and French languages is defined like this:

You can easily add new language just by adding them to that array.

You can add a new language at any time. The only thing you have to do is to change this configuration, then make sure you properly describe the RDF documents you are indexing using this new language string. The real impact is on the Solr configuration that we will see below. Other than that, languages can be added and removed anytime in the  configuration file.

How to Configure Solr for Multilinguality
Supporting more languages in structWSF does have big impacts on the Solr schema used by a structWSF instance. The method used to support multiple languages for the same record is simple:


 * A new set of fields is created in the Solr schema for that new language we want to support

The method is simple, but the impact on the  file is big.

What this means is that for all and every Schema fields that have a language suffix "_en", you have to duplicate that field to support a new language. Let's take that Solr schema.xml file that currently only supports the English language:

Now, let's extend it to support a new language, namely the French language. What we have to do is to create a series of new fields with the "_fr" suffix like this:

Once this new schema is in place and used by the running Solr instance(s), then every time CRUD: Create, CRUD: Update, Ontology: Create or Ontology: Update will be used, then this language will be handled and properly indexed into Solr.

Once these two configurations are in place, then you are ready to index new content that uses this newly configured language file. Then all the web service endpoints outlined above will work with that newly added language.

How to Describe RDF Data for Multilinguality
This section explains how the language for the literals contained in a record description can be serialized in RDF+XML or RDF+N3 formats.

If you properly configured the structWSF instance (and the Solr schema) to support the languages you will be using in the RDF files, then you won't have anything special to do other than specifying the language used for each literal value defined in a RDF document. Then, structWSF will properly index all the language related information in both Solr and Virtuoso, and all the endpoints outlined above will behave the way we outlined in this document.

However, what happens if you define language tags that are not supported by the structWSF instance in the RDF files you are indexing? Nothing untoward happens, but the following behaviour will happen:


 * 1) Virtuoso indexes all of the language related information for all the language strings (supported or not)
 * 2) Solr indexes the unsupported language as if it uses the default language
 * 3) If you try to use a unsupported language for the   parameter of the web services, then an unsupported language error will be returned by the endpoint.

RDF+XML

In this section, we will see how we can define the language for specific literals. Let's take this initial RDF file serialized in XML:

Initially, no language related markup is used in this RDF+XML file for describing this record. However, language specific tagging can easily be added in this serialization using the  XML attribute. It can be done this way:

As you can see, we used the  attribute to specify the language that has been used to write each of these literal values. This new information will then be taken into account by structWSF to properly index this information into the different data management systems, namely: Virtuoso and Solr.

RDF+N3
In this section, we see how we can define language for specific literals. Let's take this initial RDF file serialized in N3:

Initially, no language related markup is used in this RDF+N3 file for describing this record. However, language specific tagging can easily be added in this serialization using the special N3 markup. It can be done this way:

As you can see, we used the N3 markup  to specify the language that as been used to write each of these literal values. This new information will then be taken into account by structWSF to properly index this information into the different data management systems, namely: Virtuoso and Solr.