Final Student Report for Research Presentations Format should include:

Mentored Research:

Transforming a Word Conflation Algorithm Into a Minimal Stem Algorithm

Mentor: Miroslav Martinovic

Student: Christina Grape and Emily Gibson

Introduction

Word conflation is a process used to improve on the performance of information retrieval (IR) systems. An information retrieval system usually contains a large database of documents. Each of the documents is analyzed and statistical information is stored on the words and phrases contained within them. Terms with common stems are regarded as having similar meanings.

The performance of an IR system can be improved if this process reduces similar terms to a single stem. This conflation can be completed using several steps of suffix and ending removals. This will reduce the number of terms that exist in the system making the size and complexity significantly smaller.

Background

Many algorithms exist today to perform word conflation. This project concentrated on two major ones, Porter and Paice. C. D. Paice's Algorithm, as cited in Oliver Mason's book, is a simple suffix-stripping algorithm. There are no condition tests of the original word. Rather there are a number of rules that are executed when there is matching suffix to be removed. The order of the rules plays an important role in reducing to the smallest stem. Paice solves this problem by executing the longest suffixes, followed by the removal of shorter suffixes. If a shorter suffix were executed first, the term would not conflate to the correct stem. Mason gives such a case: Given the word TESTIFIED, both suffices -IED and -ED match. The first suffix -IED account for more letters and thus conflates the word correctly to TESTIFY, while the second one -ED conflates it incorrectly to TESTIFI. (Mason, p.179).

The word conflation algorithm introduced by M.F. Porter in 1980 (http://www.tcnj.edu/~mmmartin/Porter.html), has long been recognized as a rather simple, computationally inexpensive and successful technique to bring together the words conveying the same or similar meaning and treat them as the same content contributors. The algorithm comprises of five simple modules each dedicated to handling certain kinds of word transformations. These modules are applied to a given word sequentially, producing their own simplified versions of the word (i.e. for the word RADICALLIZATIONS, the following words are produced by the five modules, respectively: RADICALLIZATION, RADICALLIZE, RADICALLIZE, RADICALL, and RADICAL - the final product). It has been observed though, that in some of the cases, the algorithm did not conflate related words into a same common stem word (i.e. DEEPENINGS conflated to DEEPEN, while DEEP stayed DEEP. Also, RELATEDNESS conflated to RELATED, while RELATED transformed into RELAT).

Design & Methodology

As noted above, the current algorithms have their shortcomings in conflating related words to single terms. As can be seen in a Porter evaluation of the words RELATEDNESS:

After processing RELATEDNESS, the resulting stem is REALTED. Now a similar word RELATED is evaluated using Porter's Algorithm:

Here we can see the resulting stem is RELAT. Since RELATEDNESS and RELATED are similar terms they should conflate to the same stem. This faults from the order in which Porter's rules are executed.

In order to address these deficiencies in the algorithm Porter's algorithm was re-evaluated. An equivalence relation was developed to relate two sets of stems. The five word products after each step of Porter's algorithm becomes a set. Then two sets are tested based on derived equivalence relations to see if they indeed should have the same stem as in the RELATED and RELATEDNESS example.

Equivalence Relation:

Word Relation à ₁

Definition:

Words u is said to Porter-derive 1 word v iff the Step 1 modification of Porter’s algorithm for word conflation transforms word u into word v. We denote this by u à ₁ v.

Word Relation à ₂

Definition:

Words u is said to Porter-derive 2 word v iff the Step 2 modification of Porter’s algorithm for word conflation transforms word u into word v. We denote this by u à ₂ v.

Word Relation à ₃

Definition:

Words u is said to Porter-derive 3 word v iff the Step 3 modification of Porter’s algorithm for word conflation transforms word u into word v. We denote this by u à ₃ v.

Word Relation à ₄

Definition:

Words u is said to Porter-derive 4 word v iff the Step 4 modification of Porter’s algorithm for word conflation transforms word u into word v. We denote this by u à ₄ v.

Word Relation à ₅

Definition:

Words u is said to Porter-derive 5 word v iff the Step 5 modification of Porter’s algorithm for word conflation transforms word u into word v. We denote this by u à ₅ v.

Porter Set

Definition:

Set A is a Porter set iff:

i. A = { a, a₁, a₂, a₃, a₄, a₅ }, and

ii. a à ₁a_{1 à 2} a₂à ₃a₃à ₄a_{4 à 5}a₅.

Porter Relation p

Definition :

Two Porter sets A and B are said to be in Porter relation p iff they have a common element :

A p B Û A Ç B ¹ q

Porter Relation p

Lemma 1 :

Porter relation p is a reflexive and symmetrical relation.

Proof :

(i) Reflexiveness is obvious because :

A p A Û A Ç A ¹ q (Porter sets are non-empty)

(ii) Symmetry is obvious because :

A p B Û A Ç B ¹ q Û B Ç A ¹ q Û B p A

Now, if we denote p _TC the relation that is a transitive closure of the Porter relation p , we have the following theorem depicting its properties :

Transitive Closure Relation p _TC

Theorem 1 :

Transitive closure relation p _TC is an equivalence relation.

Proof :

(i) Reflexiveness is obvious because :

Ap A è Ap _TCA because p Í p _TC.

(ii) Symmetry is relatively straightforward to show, as well :

Ap _TCB è $ C₁,…,C_k[(Ap C₁) Ù … Ù (C_kpB)] Ú Ap B .

(a) Ap _TCB è Ap B

èBp A

è Bp _TCA.

(b) Ap _TCB è $ C₁,…,C_k[(Ap C₁) Ù … Ù (C_kpB)]

è$ C₁,…,C_k[(C_1pA) Ù … Ù (Bp C_k)]

è$ C₁,…,C_k[(Bp C_k) Ù … Ù (C_1pA)]

èBp _TCA.

(ii) Transitivity :

Ap _TCB Ù Bp _TCC

è $ D₁=A,…,D_m=B [(Ap D₂) Ù … Ù (D_m-1pB)] Ù

$ E₁=B,…,E_n=C[(Bp E₂) Ù … Ù (E_n-1pC)]

è Ap _TCC.

Transitive Closure Relation p _TC

Lemma 2 :

Transitive closure relation p _TC partitions the collection of Porter sets into equivalence classes.

Best Representative Term

Definition :

The best representative term for an equivalence class of the transitive closure relation p _TC is the shortest word from the union of its Porter sets.

The Best Representative Term expresses the commonality among the related words (terms) from the union of the sets from its corresponding equivalence class. In addition to that, it does it in a most succinct manner. It clearly contributes to reducing the number of terms in an IR system, but also selects the (memory-wise) most efficient representative.

This word conflation module is intended to be as generic as possible. This means that the system should be built to accommodate any algorithm. The original intention was for any user familiar with linguistics and conflation algorithms should be able to insert the rules of any particular algorithm and then the system would build the other necessary items and classes. This proved to be too difficult. Regular expressions proved to be a simple enough standard to be used by linguistic professionals to build the rules. Java also has a regular expression class that makes manipulations of rules, terms and suffices simple. Here is a sample of some of the Porter rules as regular expressions:

(i) Porter Rule: "*o" means "one consonant, one vowel, one consonant where the last c is not w, x, or y" (cvc)
Regular Expression: [A-Z-[AEIOU]] [AEIOUY] [A-Z-[AEIOUWXY]]

(ii) Porter Rule: "*v*" means the stem contains a vowel
Regular Expression: .*([AEIOU] | [A-Z-[AEIOU]]Y).*

(iii) Porter Rule: "*d" means the stem ends in a double consonant (e.g. -TT, -SS)

Regular Expression: ([A-Z-[AEIOUY]])\\1\\2 (?)

Conclusion and Future Work

The theory and logic behind the system is complete. Work continues on developing and finishing the actual implementation of the system in Java. The original word conflation system is being modified to have abstract classes. The abstract classes will be created as generic modules for any algorithm. Other classes will extend the generic ones to contain the details of a particular algorithm. Future work will implement the generic system and minimal stemming process. This system will serve as a small module that will plug into an even larger system, QASTIIR. QATIIR is a question and answer system being implemented by another research group.

References

Mason, O. “Programming for Corpus Linguistics.” Edinburgh University Press, 2000.

Paice, C.D., 1996: "A method for the evaluation of stemming algorithms based on error counting," Journal of the American Society for Information Science, 47(8), 632-649.

Porter, M.F. An algorithm for suffix stripping. Program. 14 (3), July 1980: 130-137.

Project site: http://www.tcnj.edu/~grape2/CREW/CREW.html