Mentored Research:
Transforming a Word Conflation Algorithm Into a
Minimal Stem Algorithm
Mentor: Miroslav Martinovic
Student: Christina Grape and Emily Gibson
Introduction
Word
conflation is a process used to improve on the performance of information
retrieval (IR) systems. An information
retrieval system usually contains a large database of documents. Each of the documents is analyzed and
statistical information is stored on the words and phrases contained within
them. Terms with common stems are
regarded as having similar meanings.
The
performance of an IR system can be improved if this process reduces similar
terms to a single stem. This conflation
can be completed using several steps of suffix and ending removals. This will reduce the number of terms that
exist in the system making the size and complexity significantly smaller.
Background
Many algorithms exist today
to perform word conflation. This
project concentrated on two major ones, Porter and Paice. C. D. Paice's Algorithm, as cited in Oliver
Mason's book, is a simple suffix-stripping algorithm. There are no condition tests of the original word. Rather there are a number of rules that are
executed when there is matching suffix to be removed. The order of the rules plays an important role in reducing to the
smallest stem. Paice solves this
problem by executing the longest suffixes, followed by the removal of shorter
suffixes. If a shorter suffix were
executed first, the term would not conflate to the correct stem. Mason gives such a case: Given the word
TESTIFIED, both suffices -IED and -ED match.
The first suffix -IED
account for more letters and thus conflates the word correctly to TESTIFY,
while the second one -ED conflates it incorrectly to TESTIFI. (Mason, p.179).
The word conflation algorithm
introduced by M.F. Porter in 1980 (http://www.tcnj.edu/~mmmartin/Porter.html), has long been recognized
as a rather simple, computationally inexpensive and successful technique to
bring together the words conveying the same or similar meaning and treat them
as the same content contributors. The
algorithm comprises of five simple modules each dedicated to handling certain
kinds of word transformations. These
modules are applied to a given word sequentially, producing their own
simplified versions of the word (i.e. for the word RADICALLIZATIONS, the
following words are produced by the five modules, respectively:
RADICALLIZATION, RADICALLIZE, RADICALLIZE, RADICALL, and RADICAL - the final
product). It has been observed though,
that in some of the cases, the algorithm did not conflate related words into a
same common stem word (i.e. DEEPENINGS conflated to DEEPEN, while DEEP stayed
DEEP. Also, RELATEDNESS conflated to
RELATED, while RELATED transformed into RELAT).
Design & Methodology
As noted above, the current
algorithms have their shortcomings in conflating related words to single
terms. As can be seen in a Porter
evaluation of the words RELATEDNESS:
After
processing RELATEDNESS, the resulting stem is REALTED. Now a similar word RELATED is evaluated
using Porter's Algorithm:
Here we can see the
resulting stem is RELAT. Since
RELATEDNESS and RELATED are similar terms they should conflate to the same
stem. This faults from the order in which
Porter's rules are executed.
In order to address these
deficiencies in the algorithm Porter's algorithm was re-evaluated. An equivalence relation was developed to
relate two sets of stems. The five word
products after each step of Porter's algorithm becomes a set. Then two sets are tested based on derived
equivalence relations to see if they indeed should have the same stem as in the
RELATED and RELATEDNESS example.
Equivalence Relation:
Word
Relation à 1
Definition:
Words u is said to Porter-derive 1 word v
iff the Step 1 modification of Porter’s algorithm for word conflation
transforms word u into word v. We denote this by u à 1
v.
Word
Relation à 2
Definition:
Words u is said to Porter-derive 2 word v
iff the Step 2 modification of Porter’s algorithm for word conflation
transforms word u into word v. We denote this by u à 2
v.
Word
Relation à 3
Definition:
Words u is said to Porter-derive 3 word v
iff the Step 3 modification of Porter’s algorithm for word conflation
transforms word u into word v. We denote this by u à 3
v.
Word
Relation à 4
Definition:
Words u is said to Porter-derive 4 word v
iff the Step 4 modification of Porter’s algorithm for word conflation
transforms word u into word v. We denote this by u à 4
v.
Word
Relation à 5
Definition:
Words u is said to Porter-derive 5 word v
iff the Step 5 modification of Porter’s algorithm for word conflation
transforms word u into word v. We denote this by u à 5
v.
Porter
Set
Definition:
Set A is a Porter set iff:
i.
A
= { a, a1, a2, a3, a4, a5
}, and
ii.
a
à 1
a1 à 2 a2 à 3
a3 à 4 a4 à 5 a5.
Porter
Relation p
Definition :
Two Porter sets A and B are said to be in
Porter relation p iff they have a common element :
A p B Û A Ç B ¹ q
Porter
Relation p
Lemma 1 :
Porter relation p is a reflexive and symmetrical
relation.
Proof :
(i) Reflexiveness is obvious because :
A p A Û A Ç A ¹ q (Porter sets are non-empty)
(ii) Symmetry is obvious because :
A p B Û A Ç B ¹ q Û B Ç A ¹ q Û B p A
Now,
if we denote p TC the relation that is a transitive closure of the
Porter relation p , we have the following theorem depicting its properties :
Transitive
Closure Relation p TC
Theorem 1 :
Transitive closure relation p TC is an
equivalence relation.
Proof :
(i) Reflexiveness is obvious because :
Ap A è Ap TCA because p Í p TC
.
(ii) Symmetry is relatively straightforward to show,
as well :
Ap TCB è $ C1,…,Ck [(Ap
C1) Ù … Ù (Ckp B)] Ú Ap B .
(a) Ap TCB è Ap B
èBp A
è Bp TCA .
(b) Ap TCB è $ C1,…,Ck
[(Ap C1) Ù … Ù (Ckp B)]
è$ C1,…,Ck [(C1p A)
Ù … Ù (Bp Ck)]
è$ C1,…,Ck [(Bp Ck)
Ù … Ù (C1p A)]
èBp TCA .
(ii) Transitivity :
Ap TCB Ù Bp TCC
è $ D1=A,…,Dm=B [(Ap D2)
Ù … Ù (Dm-1p B)] Ù
$ E1=B,…,En=C [(Bp
E2) Ù … Ù (En-1p C)]
è Ap TCC .
Transitive
Closure Relation p TC
Lemma 2 :
Transitive closure relation p TC partitions
the collection of Porter sets into equivalence classes.
Best
Representative Term
Definition :
The best representative term for an equivalence
class of the transitive closure relation p TC is the shortest word
from the union of its Porter sets.
The
Best Representative Term expresses the commonality among the related words
(terms) from the union of the sets from its corresponding equivalence
class. In addition to that, it does it
in a most succinct manner. It clearly
contributes to reducing the number of terms in an IR system, but also selects
the (memory-wise) most efficient representative.
This
word conflation module is intended to be as generic as possible. This means that the system should be built
to accommodate any algorithm. The
original intention was for any user familiar with linguistics and conflation
algorithms should be able to insert the rules of any particular algorithm and
then the system would build the other necessary items and classes. This proved to be too difficult. Regular expressions proved to be a simple
enough standard to be used by linguistic professionals to build the rules. Java also has a regular expression class
that makes manipulations of rules, terms and suffices simple. Here is a sample of some of the Porter rules
as regular expressions:
(i)
Porter Rule: "*o" means "one consonant, one vowel, one consonant
where the last c is not w, x, or y" (cvc)
Regular
Expression: [A-Z-[AEIOU]]
[AEIOUY] [A-Z-[AEIOUWXY]]
(ii)
Porter Rule: "*v*" means the
stem contains a vowel
Regular
Expression: .*([AEIOU] | [A-Z-[AEIOU]]Y).*
(iii)
Porter Rule: "*d" means the
stem ends in a double consonant (e.g. -TT, -SS)
Regular Expression:
([A-Z-[AEIOUY]])\\1\\2 (?)
Conclusion and Future Work
The theory and logic behind the
system is complete. Work continues on
developing and finishing the actual implementation of the system in Java. The original word conflation system is being
modified to have abstract classes. The
abstract classes will be created as generic modules for any algorithm. Other classes will extend the generic ones
to contain the details of a particular algorithm. Future work will implement the generic system and minimal
stemming process. This system will
serve as a small module that will plug into an even larger system,
QASTIIR. QATIIR is a question and
answer system being implemented by another research group.
References
Mason,
O. “Programming for Corpus
Linguistics.” Edinburgh University Press, 2000.
Paice,
C.D., 1996: "A method for the evaluation of stemming algorithms based on
error counting," Journal
of the American Society for Information Science, 47(8), 632-649.
Porter,
M.F. An algorithm for suffix stripping.
Program. 14 (3), July 1980: 130-137.
Project site:
http://www.tcnj.edu/~grape2/CREW/CREW.html