Installing GATE

From OSF Wiki
Jump to: navigation, search

Introduction

This tutorial explains to you how to install, run and use GATE 5.1. The tutorial explains how to tag text articles of all kinds with concepts that come from a pre-created OWL ontology and named entities that are part of a pre-created named entities dictionary (called a "gazetteer" in GATE).

Installing GATE 5.1

Installing GATE is quite simple. The only thing you have to do is to download it from the GATE website, to then to unarchive it in some proper location on your local desktop. Then the only thing you have to do is to run it by executing the Gate.exe executable file.

First time Configuration

GATE is composed of multiple plug-ins that are put into processing pipelines (that we call applications). These processing pipelines will modify or tag a corpus of texts, step after step. By default, some plug-ins are loaded into GATE, however, not all the ones we need for this tutorial are loaded by default.

The first thing we have to do is to load these new plug-ins, but this is a step you only have to do once.

In the menu, click: File -> Manage CREOLE Plug-ins

This will open a window with the list of all available plug-ins that GATE is aware of on your desktop. Some of them are enabled by default, others not. For each of the plug-ins in the list below, make sure that you have the checkbox "Load now" checked for each of them. Once you are done, click on the "OK" button.

Here is the list of all plug-ins that have to be loaded:

  • ANNIE
  • Annotation_Merging
  • Gazetteer_Ontology_Based
  • Jape_Compiler
  • Ontology
  • Ontology_Based_Gazetteer
  • Ontology_BDM_Computation
  • Ontology_OWLIM2
  • Ontology_Tools
  • Tools

Cleaning Up the Interface

The left sidebar is the main part of the GATE application, It is where all the tools that you will use to create applications and pipelines will be displayed. It is also where you will setup each of them, load texts and corpus of texts, etc.

However, for this tutorial, we want to load them one by one, and so, we don't want anything loaded by default.

What you have to do is to remove each item that can appear below these four different sections:

  • Applications
  • Language Resources
  • Processing Resources
  • Datastores

To close any item below any of these sections, you only have to select one of these, to right-click with your mouse, and to click on the "close" contextual menu item.

Perform this process until you only see this in the left sidebar:

  • GATE
    • Applications
    • Language Resources
    • Processing Resources
    • Datastores

Loading Core Processing Resources

The next thing we have to do is to load the "processing resources" that will be used to create the pipeline that will tag our corpus of texts given an ontology and pre-defined dictionary of named entities.

To load new processing resources, you have to right click on the "Processing Resources" item of the left-sidebar, then click new and on the plug-in you want to use. For this tutorial, you have to load, one-by-one, each of these processing resources:

  • ANNIE English Tokenizer

When the window appears, just leave the default values, and click the "OK" button

  • ANNIE POS Tagger

When the window appears, just leave the default values, and click the "OK" button

  • ANNIE Sentence Splitter

When the window appears, just leave the default values, and click the "OK" button

  • GATE Morphological Analyser

When the window appears, just leave the default values, and click the "OK" button

  • Document Reset PR

When the window appears, just leave the default values, and click the "OK" button

Loading Ontology

The next step is to create another Processing Resource which is a gazetteer that is created from an OWL ontology. The first thing to do is to locate the PEG Ontology on your local computer. If you don't have it, go to the (example) PEG Ontology Framework wiki page, copy and paste the content of that page into a file named "peg_ontology_framework_v1.owl" on your local desktop computer.

Once you have the PEG Ontology OWL file on your computer, the next step is to load it in GATE as a Language Resource. Right-click on the "Language Resources" item of the left side-bar, then new. Select the contextual menu item "OWLIM2 Ontology LR".

Once the window opens, setup these settings:

  • Name: PEG Ontology
  • loadImports: select "false"
  • rdfXmlURL: Select the peg_framework_v1.owl file on your local computer

If everything goes fine, you should see a new "PEG Ontology" language resource appearing in the left sidebar.

If you double-click on the "PEG Ontology" item, you will be able to browse the ontology's architecture with a basic ontology visualization tool.

Once the ontology language resource has been created, we can now create an ontology gazetteer processing resource that uses this loaded ontology.

Right-click on the "Processing Resource -> New -> Onto Root Gazetteer" contextual menu item.

Then, set it up in the following manner:

  • Name: PEG Ontology Gazetteer
  • morpher: GATE Morphological analyser_XYZ
  • ontology: PEG Ontology
  • pos tagger: ANNIE POS Tagger_XYZ
  • tokenizer: ANNIE English Tokeniser_XYZ

Finally click the OK button.

The last thing we have to create is called a "flexible gazetteer." This is the gazetteer used to create our processing pipeline. This new gazetteer uses the one we just created: the Onto Root Gazetteer.

Right-click on: "Processing Resource -> New -> Flexible Gazetteer"

In the window that will appear, configure it such as:

  • Name: Flexible PEG Ontology Gazetteer
  • gazetteerInst: select "PEG Ontology Gazetteer"
  • inputFeatureNames: click on the right-most button. In the top box, write "Token.root", then click the "add" button, select the newly created item in the list, and then click the "OK" button.

Then click the "OK" button again to create the new flexible gazetteer.

Loading the Named Entities Dictionary

The idea behind OSF Tagger (scones) is to tag both concepts from an ontology as well as named entities. In this part, we configure GATE for named entities. Unlike concepts, named entities do not come from the ontology, but from a list of files where all the named entities used as the basis for annotation are listed.

The first thing you have to do is to download the named entities dictionary zip file here: File:Peg named entities dictionary.zip. Then you have to extract it on your local desktop in some folder (all files should be in the same folder).

Then, we have to create a new kind of gazetteer that will annotate the text with these named entities; right click on "Processing Resources -> new -> Hash Gazetteer". Then configure it that way:

  • Name: PEG Named Entities Dictionary
  • listsURL: select the file "peg_gazetteer.def" which is part of the zip archive you unzipped on your local desktop computer.

Then click the "OK" button. A new "PEG Named Entities Dictionary" item should appear under "Processing Resources".

Creating a Corpus of Texts

The main goal of GATE is to take a text -- or a corpus of texts -- as input, to process them in a pipeline so that the end result is a possible modification of the texts, and an annotation of them. The next step is to create a corpus of texts that we will then process with a processing pipeline that has yet to be created.

To create a corpus of texts, you have to right-click on: "Language Resources -> New -> GATE Corpus". In the window that will be displayed, do:

  • Name: PEG Corpus

Then click the "OK" button.

A "PEG Corpus" item should appear as a language resource in the left-sidebar. The next step is to load texts in this corpus. Right-click on the "PEG Corpus" item, and click on the "populate" contextual menu item. Select the directory where all your text documents are available on your local desktop. Once you selected the directory, click the "OK" button.

Then, all the texts that were available in that folder will get loaded and displayed as text language resources under the "Language Resources" item in the left-sidebar.

Creating a Processing Pipeline

The next step is to put all the processing resources we created above in a processing pipeline that will take all the texts of our corpus, and process them according to this pipeline. The end-result of the processing of this pipeline is that all the texts of the corpus will now be tagged with related concepts of our ontology, and named entities of our named entities dictionary.

To create a new corpus processing pipeline, right-click on "Applications -> new -> Corpus Pipeline". Use the name "PEG Corpus Pipeline" and click the "OK" button.

Double-click on the "PEG Corpus Pipeline" item that appeared under the "Applications" item in the left-sidebar.

This is where the complete processing pipeline gets created, and where that processing pipeline runs to modify and annotate the corpus of texts.

The next step is to put all the processing resources in the right order, to select our corpus of texts, and to run the application!

In the left-sidebar of the PEG Corpus Pipeline window, select items in the left panel, and click on the middle-right arrow to add them to the pipeline. Make sure that all these items appear in the right-sidebar list of the window:

  • Document Reset PR
  • ANNIE Sentence Splitter
  • ANNIE English Tokeniser
  • ANNIE POS Tagger
  • GATE Morphological Analyser
  • PEG Named Entities Dictionary
    • annotationSetName: "Tagged Named Entities"
  • Flexible PEG Ontology Gazetteer
    • outputAnnotationSetName: "Tagged Concepts"

Then select the "PEG Corpus"

Finally, click the "Run this Application" button!

If everything has been properly set up, you should see GATE processing each text of the PEG Corpus according to the processing pipeline we just put in place.

Checking Tagged Concepts and Entities

The final step is to check how the processing pipeline acted on our corpus of text. Normally, if everything did goes fine, we should have different annotations to our corpus of texts:

  • Word tokens
  • Spaces tokens
  • Sentences tokens
  • Named Entities tokens
  • Concepts tokens

To check if this happened, you have to double-click on any text that is available in your corpus, under the "Language Resources" item. Once you double click on one of these text items, you should see the text appear in a text-visualization window appropriate to the item.

You have different top button menu items available to you. The two that interest us right now are the "Annotation Sets" and "Annotations List" ones. Click on each of them. Once you clicked them, you will see appearing both right and bottom panels. The right panel is where you can visualize annotated sequence of characters (words or list of words). The bottom panel is a list of all annotations you selected in the right panel (highlighted words).

In the right sidebar, you will see appearing a few items that can be extended by click on the down right-pointing arrows. The first one is the default one. This default is where all the sentence, token, etc. annotations get displayed. What interests us is what items got tagged by our concept and named entities tagger. They will appear in the "Tagged Concepts" and "Tagged Named Entities" items. If you extend these items, you will see a "lookup" check-box appearing. If you check any of these boxes, you will see all tagged concepts and named entities (if any) within the text. Additionally, all of these annotations will also appear in the bottom panel.

By using this text annotation tool, you will be able to see how the processing pipeline tags concepts and named entities against your corpus of text. If you click on the highlighted words in the window, it will give you the possibility to edit and remove the annotation in case of automatic tagging error. You can also delete and modify the annotations from the bottom panel list.

This is really the tool that the user has to use to assess and clean annotated texts.