Assembling Named Entities

From OSF Wiki
Jump to: navigation, search

Named entities, along with ontology concepts, are a key means for tagging content and providing a structural basis for relating to other information. As such, named entities should be purposefully assembled and maintained as part fo the structural information bases of a open semantic framework (OSF) instance or portal.

This document describes how named entities are identified, harvested and assembled, along with some basic tagging. For more general information, see the named entities category.

Overview of the Assembly Process

The basic idea in the named entity process is to collect relevant named entities for the OSF instance at hand (people, places, organizations, other notable things or instances), to give them types and to characterize them, and then to put them into named entity dictionaries for ongoing use. This basic workflow is as follows:

Named entities.png

Note that relevant named entities can come from internal or external sources. If external, the sources need to be harvested.

More specifics on each of these broad steps is provided below.

Collating Internal Entity Lists

The first primary source of named entity candidates is the organization's own internal information and records. There are three main sources for these.

Internal Database Records

Many named entities occur as individual records within standard relational databases. In these cases, appropriate tables can be identified with appropriate row records. Certain fields (columns) in these tables can also contribute to the standard named entity dictionary contents (see last section).

Specific scripts of SQL queries will need to be created to extract and assemble this information. This task should be coordinated with IT or the local database administrator.

Dataset Records

Manually created datasets are another source of structured data to an OSF instance (see, for example, the commON case study). Some of these datasets have records (rows) that also are useful named entity candidates. These datasets should be identified with appropriate attributes (columns) assembled in order to contribute to the standard named entity dictionary contents (see last section).

Other Listings

Other internal sources of named entity candidates include spreadsheets, listings on internal Web sites, written documents (especially appendices or specialized listings), etc. These should be assembled and scanned for possible inclusion in the named entity dictionaries.

Harvesting Entities

Sources of external named entities are primarily found on the Web. To obtain these candidates, it is first necessary to harvest relevant documents that might contain these entity listings.

Dedicated harvesters might be employed (such as Nutch or other open source tools). Generally, for OSF, we combine harvesting with entity extraction using the free Extractiv online service (there is also as similar harvest-only service provided by the same company called 80Legs).

As a result, we will address the harvest steps at the same time as we discuss the Extractiv service (next). Please note, however, that similar steps may be applied directly using the 80Legs online service. This might be appropriate, for example, if you use a local entity tagger such as the Illinois tagger in lieu of a service like Extractiv.

Extracting Candidate Entities

Extracting candidate named entities involves using a "tagger" to find (and possibly type) entities within candidate text. The tagging process can use a lookup of known entities in available gazetteers (Wikipedia is a common source) or via various pattern matching algorithms that use word and character patterns. These patterns may be heuristic or based on some form of machine learning.

Some taggers, such as the ones generally recommended for OSF, use both lookup and pattern-matching techniques.

Using Extractiv

Extractiv is a relatively recent online service that combines harvesting from 80legs with natural language processing from Language Computer Corporation. Both free and subscription versions of this service are available. We describe below the free variants of this service.

To use Extractiv, a user must first have a registered login to the service. Though the free service has size and scope limitations, these are not unduly limiting for our purposes of simply building named entity dictionaries.

Once registered, you are able to log into the service, which presents the following screen. From here you can give your extraction job a name and identify the various entity types you want extracted from your candidate text:

Extractiv1.png

Your first entry is to give your job a name. This is very helpful since it is possible to call up these settings for subsequent jobs, useful particularly in areas of entity types and seed lists for harvests.

There are scores of entity types that the system can extract. You can pick multiples of these -- as well as pre-assembled groupings -- up to a number of total allowable points (free system). Sentiment and relation extractions are also available. However, these are excluded for the named entity purposes.

Your next set of questions involve the relation settings (that we will ignore) and the harvesting steps (most important):

Extractiv2.png

The harvesting step is a crucial one. The harvest is based on one or more starting URLs (the "seed list") from which the Web crawl begins. Typically, site maps for internal Web sites are one useful starting point; any other Web site that has links througout your content of interest would also be a good candidate.

For the most relevant references, it is probably best to restrict your crawl to internal links only. Depth is also a key parameter with larger depth numbers adding substantially to the total possible number of links crawled. If, after testing a run or two you see that link limits are being met, then you should pare back the number of URLs on your seed list or reduce the depth before running again. Though the system limits the amount and frequency of crawls, you can split your needs over multiple runs and spread your harvests out over multiple days.

Like the overall settings, the seed lists can also be named and saved. This feature is helpful when you are working out the dynamics of your harvests, or need to repeat them in the future.

As the next screen shows, you can also only retain pages that meet certain keyword requirements, again a helpful feature for targeting your harvests:

Extractiv3.png

Since we combine harvest and extraction steps we use the inline XML format. We have found the RDF to be unduly verbose (the company is reportedly looking at this matter). JSON is the recommended format by the company, and easier to work with if the full harvests are not needed.

Post-processing Script

To be completed.

Using the Illinois Tagger

Another tagger used for OSF is the Illinois tagger. It uses gazetteers extracted from Wikipedia, as well as its own pattern recognition methods. It is open source, written in Java.

To be completed.

Other Extraction Options

Here are some other leading options for named entity extraction (though not standardly used in most OSF instances):

  • AcroMine: http://www.nactem.ac.uk/software/acromine/
  • AlchemyAPI from Orchestr8 provides an API based application that uses statistical and natural language processing methods. Applicable to webpages, text files and any input text in several languages
  • BooWa is a set expander for any language (formerly known as SEALS); developed by RC Wang of Carnegie Mellon
  • Google Sets for automatically creating sets of items from a few examples
  • Open Calais is free limited API web service to automatically attach semantic metadata to content, based on either entities (people, places, organizations, etc.), facts (person ‘x’ works for company ‘y’), or events (person ‘z’ was appointed chairman of company ‘y’ on date ‘x’). The metadata results are stored centrally and returned to you as industry-standard RDF constructs accompanied by a Globally Unique Identifier (GUID)
  • SemanticHacker (from Textwise) is an API that does a number of different things, including categorization, search, etc. By using 'concept tags', the API can be leveraged to generate metadata or tags for content
  • TagFinder is a Web service that automatically extracts tags from a piece of text. The tags are chosen based on both statistical and linguistic analysis of the original text
  • Tagthe.net has a demo and an API for automatic tagging of web documents and texts. Tags can be single words only. The tool also recognizes named entities such as people names and locations
  • TermExtractor extracts terminology consensually referred in a specific application domain. The software takes as input a corpus of domain documents, parses the documents, and extracts a list of “syntactically plausible” terms (e.g. compounds, adjective-nouns, etc.)
  • TermFinder uses Poisson statistics, the Maximum Likelihood Estimation and Inverse Document Frequency between the frequency of words in a given document and a generic corpus of 100 million words per language; available for English, French and Italian
  • TerMine is an online and batch term extractor that emphasizes part of speech (POS) and n-gram (phrase extraction). TerMine is the terminological management system with the C-Value term extraction and AcroMine acronym recognition integrated
  • TextDigger offers costed and API services for concept and term extraction
  • Topia term extractor is a part-of-speech and frequency based term extraction tool implemented in python. Here is a term extraction demo based on this tool
  • Wikify! is a system to automatically "wikify" a text by adding Wikipedia-like tags throughout the document. The system extracts keywords and then disambiguates and matches them to their corresponding Wikipedia definition
  • Yahoo! Placemaker is a freely available geoparsing Web service. It helps developers make their applications location-aware by identifying places in unstructured and atomic content – feeds, web pages, news, status updates – and returning geographic metadata for geographic indexing and markup

Creating Named Entity Dictionaries

A named entity dictionary is merely a special instance of a standard irON dataset. What makes it special is merely its attributes and metadata. Here are the suggested fields for a named entity dictionary; only id and type are required entries.

Attribute Allowed Value(s) Description
id primitive: Id Identifier string used to uniquely identify the named entity (record).
type type: Object Class (type) of the named entity (record) being described.
prefLabel primitive: String Human readable label used to refer to this named entity (instance). The prefLabel is the preferred label to use to refer to the given named entity (instance) and is the preferred string used in the user interface. If not specified, the id is used as the label.
altLabel primitive: String Human readable strings that are either synonyms, lingo, jargon, acronyms or other alternatives to refer to the named entity (instance) and its preferred label. altLabel may also be used to help map or disambiguate named entities (instances).
note primitive: String Human readable note(s) related to the named entity. Information to include might be information source, background explanatory material, description of the named entity (record). Could be possibly sub-categorized by note, changeNote, editorialNote, historyNote or scopeNote
prefURL primitive: Url

Often systems can benefit from a reference to a Web page with additional information about anamed entity (record). The use of the prefURL attribute is also recommended for user interface generation purposes.

prefURL is a URL reference to a Web page. For whatever named entity (record) you are considering, think of the assignment to prefURL as representing the "best" human-viewable Web page available for it.

For example, if a named entity (record) is of type person, its prefURL can be his personal Web page, a Web page with his CV, biography, etc. If the record is of type organization, its prefURL can be its enterprise Web page, etc.

Like the prefLabel, the notion of "preferred" and "alternative" Web pages (href) also applies.

href primitive: Url

href is a single or list of valid URIs that refers to or describes the current named entity (record).

In the absence of a prefURL, the first href encountered may be substituted as the user interface URL.

When there are multiple values provided for href in combination with a prefURL attribute, they act as "see also" link references for the named entity (record).

ref Id A ref attribute refers to the local ID if the first character of the id is "@". A ref attribute refers to a global ID (URI) if the first two characters of the id is "@@". If the ref attribute refer to the global ID of a named entity (record), this record can be local, or remote (this means that the record referred by the ref attribute is defined in another dataset).

For further information about this format, see the irON specification.