Modifying Search Results via Terms Boosting

From OSF Wiki
Jump to: navigation, search

The Search endpoint does support a series of parameter that enable users to boost the terms of their search query to modify the relevancy of the returned results. The main characteristic of OSF Web Service's search endpoint is that the structure of the data can be used to help getting more relevant results from a full text search query.

There are three main areas that can be influenced to change the scoring of returned search results:

  1. Record's types boosting
  2. Record's datasets provenance boosting
  3. Record's attribute and/or attribute/value boosting

Boosting the weight of a type, a dataset or an attribute/value only affect the score of each result. This doesn't determine if a result is returned or not. What does determine if a result will be added to the returned results is the full text search query and the filters defined for that query.

Using the Data Structure To Improve Relevancy

In OSF Web Service, add the data is in RDF. This means that all the content is fully structured, and that a long series of attributes and values have been used to describe the records that have been indexed in the system, all with their own semantic.

All these characteristics of the data can be leveraged to influence the scoring of the results for a filtered full text search query.

In the following tutorial, we will consider that we have the following set of data accessible via the Search endpoint:

  1. A series of datasets with information about people and organizations
  2. A series of ontologies that define thousands of concepts specific to a domain (healthcare in this example)
  3. A series of datasets with documents records (healthcare related documents). Each of these records have been related to domain concepts using OSF Tagger (scones). There exists a hierarchy of documents types, and all the documents are related to people and organizations from the other datasets.

As you can see, the OSF Web Service instance of that Search endpoint is rich of fully structured data.

One thing to note is that the context of a search query will greatly influence how the boosting techniques will be used. For example, if I send a search query that will be used to display a list of relevant articles for a page that talks about pregnant womens, it will be quite different than if I do the same but that talks about breast cancer.

Now, let's take a few examples of how this structure can be used to help improving the relevancy of the returned results.

Goals Boosting rules
I want to get 10 webpages articles that talks about pregnancy and mother
  • Filter documents to get the records of type document only
  • Boost documents that have the type webpage
  • Boost documents are related to the subject mothers
  • Boost documents are related to the subject pregnancy
I want the most relevant articles about knee pain and make an emphais on the ones published by Cochrane Reviews
  • Filter documents to get the records of type document only
  • Boost documents that are published by Cochrane Reviews
  • Boost documents that are related to the subject knee
  • Boost documents that are related to the subject pain

Some Real Boosting Examples

In this article we will use the Web Service-PHP-API OSF Web Service PHP API to generate the search queries that will be sent to the Search endpoint.

I want to get 10 webpages articles that talks about pregnancy and mother

OSF Web Service PHP API Code

  <?php

    $network = "http://localhost/ws/";
    $search = new SearchQuery($network);

    $search->typeFilter("http://purl.org/ontology/bibo/Document")
           ->typeBoost("http://purl.org/ontology/bibo/Webpage", 200)
           ->attributeValueBoost("http://purl.org/dc/terms/subject", 100, "http://purl.org/ontology/doha#mothers", TRUE)
           ->attributeValueBoost("http://purl.org/dc/terms/subject", 300, "http://purl.org/ontology/doha#pregnancy", TRUE)
           ->excludeAggregates()
           ->items(3)
           ->sort("score", "desc")
           ->mime("text/xml")
           ->send();

  ?>

Results

Here is the top 3 result for the query defined above. There were 22685 results, and these are the 3 most relevant according to this query:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE resultset PUBLIC "-//Structured Dynamics LLC//Search DTD 0.1//EN" "search/search.dtd">
  <resultset>
  <prefix entity="owl" uri="http://www.w3.org/2002/07/owl#" />
  <prefix entity="rdf" uri="http://www.w3.org/1999/02/22-rdf-syntax-ns#" />
  <prefix entity="rdfs" uri="http://www.w3.org/2000/01/rdf-schema#" />
  <prefix entity="iron" uri="http://purl.org/ontology/iron#" />
  <prefix entity="xsd" uri="http://www.w3.org/2001/XMLSchema#" />
  <prefix entity="wsf" uri="http://purl.org/ontology/wsf#" />
  <prefix entity="bibo" uri="http://purl.org/ontology/bibo/" />
  <prefix entity="dcterms" uri="http://purl.org/dc/terms/" />
  <prefix entity="nhccn" uri="http://purl.org/ontology/nhccn#" />
  <subject type="bibo:Document" uri="http://domain.com/datasets/global/documents/archive/451C8998-0418-C53A-5946CDA4206F6800">
    <predicate type="iron:prefLabel">
      <object type="rdfs:Literal">Think pregnancy - think immunisation</object>
    </predicate>
    <predicate type="iron:description">
      <object type="rdfs:Literal">It's an important for adults to keep up their immunisation as it is to ensure babies and children are immunised, particularly as whooping cough makes a comeback.</object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#mothers" >
        <reify type="wsf:objectLabel" value="Mothers " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#pregnancy" >
        <reify type="wsf:objectLabel" value="Pregnancy " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#immunisation_programs" >
        <reify type="wsf:objectLabel" value="Immunisation Programs " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#australia" >
        <reify type="wsf:objectLabel" value="Australia " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#immunisation" >
        <reify type="wsf:objectLabel" value="Immunisation " />
      </object>
    </predicate>
    <predicate type="nhccn:partner">
      <object uri="http://domain.com/datasets/groups/consumershealthforum_author" >
        <reify type="wsf:objectLabel" value="consumershealthforum_author" />
      </object>
    </predicate>
    <predicate type="rdf:type">
      <object uri="http://purl.org/ontology/nhccn#Resource" />
    </predicate>
  </subject>
  <subject type="bibo:Document" uri="http://domain.com/datasets/global/documents/000481AD-006B-144F-8D6683978717FE97">
    <predicate type="iron:prefLabel">
      <object type="rdfs:Literal">Preparing your toddler for the new baby</object>
    </predicate>
    <predicate type="iron:description">
      <object type="rdfs:Literal">So you're pregnant and still breastfeeding your baby or toddler! You may be wondering if you can continue to breastfeed though your new pregnancy, and even beyond. This article looks at some issues and concerns you may have.</object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#mothers" >
        <reify type="wsf:objectLabel" value="Mothers " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#pregnancy" >
        <reify type="wsf:objectLabel" value="Pregnancy " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#infants" >
        <reify type="wsf:objectLabel" value="Infants " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#breastfeeding" >
        <reify type="wsf:objectLabel" value="Breastfeeding " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#siblings" >
        <reify type="wsf:objectLabel" value="Siblings " />
      </object>
    </predicate>
    <predicate type="nhccn:partner">
      <object uri="http://domain.com/datasets/groups/austbreastfeedingassoc_author" >
        <reify type="wsf:objectLabel" value="austbreastfeedingassoc_author" />
      </object>
    </predicate>
    <predicate type="rdf:type">
      <object uri="http://purl.org/ontology/nhccn#Resource" />
    </predicate>
  </subject>
  <subject type="bibo:Document" uri="http://domain.com/datasets/global/documents/50A6F79F-C17D-23DA-79BF4225E3EE9954">
    <predicate type="iron:prefLabel">
      <object type="rdfs:Literal">Mum's having a baby</object>
    </predicate>
    <predicate type="iron:description">
      <object type="rdfs:Literal">Mum's don't always tell everyone straight away when they are pregnant, (expecting a baby), so it may be a while before you get to know.</object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#mothers" >
        <reify type="wsf:objectLabel" value="Mothers " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#pregnancy" >
        <reify type="wsf:objectLabel" value="Pregnancy " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#emotions" >
        <reify type="wsf:objectLabel" value="Emotions " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#life_change_events" >
        <reify type="wsf:objectLabel" value="Life Change Events " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#siblings" >
        <reify type="wsf:objectLabel" value="Siblings " />
      </object>
    </predicate>
    <predicate type="nhccn:partner">
      <object uri="http://domain.com/datasets/groups/childyouthhealthsa_author" >
        <reify type="wsf:objectLabel" value="childyouthhealthsa_author" />
      </object>
    </predicate>
    <predicate type="rdf:type">
      <object uri="http://purl.org/ontology/nhccn#Resource" />
    </predicate>
  </subject>
</resultset>


I want the most relevant articles about knee pain and make an emphais on the ones published by Cochrane Reviews

OSF Web Service PHP API Code

  <?php

    $search->typeFilter("http://purl.org/ontology/bibo/Document")
           ->attributeValueBoost("http://purl.org/ontology/nhccn#partner", 300, "http://ccr.nhccn.com.au/datasets/groups/cochranereviews_author", TRUE)
           ->attributeValueBoost("http://purl.org/dc/terms/subject", 200, "knee", TRUE)
           ->attributeValueBoost("http://purl.org/dc/terms/subject", 200, "pain", TRUE)
           ->excludeAggregates()
           ->items(3)
           ->sort("score", "desc")
           ->mime("text/xml")
           ->send();

  ?>


Results

Here is the top 3 result for the query defined above. There were 22685 results, and these are the 3 most relevant according to this query:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE resultset PUBLIC "-//Structured Dynamics LLC//Search DTD 0.1//EN" "search/search.dtd">
  <resultset>
  <prefix entity="owl" uri="http://www.w3.org/2002/07/owl#" />
  <prefix entity="rdf" uri="http://www.w3.org/1999/02/22-rdf-syntax-ns#" />
  <prefix entity="rdfs" uri="http://www.w3.org/2000/01/rdf-schema#" />
  <prefix entity="iron" uri="http://purl.org/ontology/iron#" />
  <prefix entity="xsd" uri="http://www.w3.org/2001/XMLSchema#" />
  <prefix entity="wsf" uri="http://purl.org/ontology/wsf#" />
  <prefix entity="bibo" uri="http://purl.org/ontology/bibo/" />
  <prefix entity="dcterms" uri="http://purl.org/dc/terms/" />
  <prefix entity="nhccn" uri="http://purl.org/ontology/nhccn#" />
  <subject type="bibo:Document" uri="http://domain.com/datasets/global/documents/000ED7FA-355B-101F-AC6283032BFA006D">
    <predicate type="iron:prefLabel">
      <object type="rdfs:Literal">Intensity of exercise for osteoarthritis</object>
    </predicate>
    <predicate type="iron:description">
      <object type="rdfs:Literal">Either high intensity or low intensity aerobic exercise improves functional status, pain, gait, and aerobic capacity in people with osteoarthritis of the knee Therapeutic exercise has been recommended as part of a treatment regime for people with osteoa...</object>
    </predicate>
    <predicate type="nhccn:url">
      <object type="rdfs:Literal">http://summaries.cochrane.org/CD004259/intensity-of-exercise-for-osteoarthritis</object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#exercise_therapy" >
        <reify type="wsf:objectLabel" value="Exercise Therapy " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#joint_diseases" >
        <reify type="wsf:objectLabel" value="Joint Diseases " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#knee" >
        <reify type="wsf:objectLabel" value="Knee " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#osteoarthritis" >
        <reify type="wsf:objectLabel" value="Osteoarthritis " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#pain" >
        <reify type="wsf:objectLabel" value="Pain " />
      </object>
    </predicate>
    <predicate type="nhccn:partner">
      <object uri="http://domain.com/datasets/groups/cochranereviews_author" >
        <reify type="wsf:objectLabel" value="cochranereviews_author" />
      </object>
    </predicate>
    <predicate type="rdf:type">
      <object uri="http://purl.org/ontology/nhccn#Resource" />
    </predicate>
  </subject>
  <subject type="bibo:Document" uri="http://domain.com/datasets/global/documents/000A6910-564B-1E50-99E283032BFA006D">
    <predicate type="iron:prefLabel">
      <object type="rdfs:Literal">Therapeutic ultrasound for osteoarthritis</object>
    </predicate>
    <predicate type="iron:description">
      <object type="rdfs:Literal">This summary of a Cochrane review presents what we know from research about the effect of therapeutic ultrasound on knee or hip osteoarthritis. The previous version of this review concluded that therapeutic ultrasound had no benefit over fake therapeuti...</object>
    </predicate>
    <predicate type="nhccn:url">
      <object type="rdfs:Literal">http://summaries.cochrane.org/CD003132/therapeutic-ultrasound-for-osteoarthritis</object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#knee" >
        <reify type="wsf:objectLabel" value="Knee " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#induced_hyperthermia" >
        <reify type="wsf:objectLabel" value="Induced Hyperthermia " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#inflammation" >
        <reify type="wsf:objectLabel" value="Inflammation " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#osteoarthritis" >
        <reify type="wsf:objectLabel" value="Osteoarthritis " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#pain" >
        <reify type="wsf:objectLabel" value="Pain " />
      </object>
    </predicate>
    <predicate type="nhccn:partner">
      <object uri="http://domain.com/datasets/groups/cochranereviews_author" >
        <reify type="wsf:objectLabel" value="cochranereviews_author" />
      </object>
    </predicate>
    <predicate type="rdf:type">
      <object uri="http://purl.org/ontology/nhccn#Resource" />
    </predicate>
  </subject>
  <subject type="bibo:Document" uri="http://domain.com/datasets/global/documents/4F47E12F-FC63-EE2D-3D34B5167894B185">
    <predicate type="iron:prefLabel">
      <object type="rdfs:Literal">Moulded foot insoles for adults with pain around the knee cap</object>
    </predicate>
    <predicate type="iron:description">
      <object type="rdfs:Literal">Pain around the knee cap is a common problem. The pain may be brought on or made worse by day to day or sporting/exercise activities. Pain around the knee cap can have many different causes, such as the way the knee cap glides over the bones or because ...</object>
    </predicate>
    <predicate type="nhccn:url">
      <object type="rdfs:Literal">http://summaries.cochrane.org/CD008402/moulded-foot-insoles-for-adults-with-pain-around-the-knee-cap</object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#therapy" >
        <reify type="wsf:objectLabel" value="Therapy " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#joint_diseases" >
        <reify type="wsf:objectLabel" value="Joint Diseases " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#knee" >
        <reify type="wsf:objectLabel" value="Knee " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#pain" >
        <reify type="wsf:objectLabel" value="Pain " />
      </object>
    </predicate>
    <predicate type="dcterms:subject">
      <object uri="http://purl.org/ontology/doha#orthopaedic_equipment" >
        <reify type="wsf:objectLabel" value="Orthopaedic Equipment " />
      </object>
    </predicate>
    <predicate type="nhccn:partner">
      <object uri="http://domain.com/datasets/groups/cochranereviews_author" >
        <reify type="wsf:objectLabel" value="cochranereviews_author" />
      </object>
    </predicate>
    <predicate type="rdf:type">
      <object uri="http://purl.org/ontology/nhccn#Resource" />
    </predicate>
  </subject>
</resultset>

Sorting and Boosting

What boosting does, is to modifying the scoring of each result, within a set a results. From there, if you want the results with the highest score results at the top, or the bottom, of the returned list of results, you have to sort the resultset by score in ascending and descending order.