"Science must begin with myths, and with the criticism of myths." Karl Popper
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Chapter 5 - Basic Natural Language Processing
%
% last change: 16.08.2004
% correction hamid: xx.xx.2004
% correction prof: xx.xx.2004
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\aphorism{Any noun can be verbed.}{Alan J. Perlis}%
\chapter{Natural Language Text Representation}%
\label{chapter:text_analysis}%
Whereas the previous chapter is concerned with the representation of
the structure, this chapter introduces the basic tasks necessary to
process the content of the leaf nodes of a document containing plain
natural language texts. The goal is to represent textual content in
a form that can be computed and compared efficiently. The result of
this process is a vector of terms -- actually of stems -- associated
with their frequencies of occurrence. The involved natural language
processing tasks are tokenization, tagging, stemming, and stopword
filtering. These tasks are described in detail.\footnote{Parts of
this chapter have already been published
in~\cite{hassler06tokenization}}.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Introduction}
% introduction
Proper retrieval results depend on appropriate content
representations for the documents searched. Therefore, methods from
the field of natural language processing are applied to compute
representations for plain texts. Several
works~\cite{grovlen95nlp,voorhees99nlp} are solely dedicated to
investigate the effects of natural language processing tasks on the
information retrieval performance. It is commonly agreed that this
task is a very complex one, including various processing steps like
tokenization, tagging, shallow parsing, stemming, compound
splitting, etc.~\cite{brants03nlp} The result of this processing
chain is an appropriate representation of the text. In order to
support retrieval tasks appropriately, the representation has to
facilitate
%
\begin{compactitem}
\item fast computation and comparison of representations,
\item minimal storage requirements, and
\item support of high recall and precision in searches.
\end{compactitem}
% framework
A major design goal of natural language processing frameworks is to
define (1) components that manage these tasks, their (2) correct
processing sequence, where the output of one component serves as
input for a subsequent one, and (3) appropriate data interchange
objects.
%
% data interchange via XML
Components are tightly coupled to these data objects and cannot be
used in isolation. Hence, an easy and approved possibility to
transfer data is to utilize structured text files like XML
documents. The usage of such a generic data interchange format
results in platform, application, and system independent components.
A component can simply be plugged into an existing system at the
moment it is required. Furthermore, XML output is easily adaptable
and can be processed efficiently.
% this work
In this work the basic components accomplish the tasks of
tokenization, tagging, term extraction, stemming, stopword
filtering, and term frequency computation. The processing sequence
is depicted in Figure~\ref{fig:nlp_process}.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.99\textwidth]{05_text_analysis/figures/nlp_process}\\
\caption{Natural language processing sequence}
\label{fig:nlp_process}
\end{figure}
% explanations
Tokenization is the first task of natural language text
processing~\cite{webster92tokenization,fox92lexical}. It segments
the text into meaningful processing units called tokens. The output
of the tokenizer serves as input for a tagger. During tagging
syntactic word categories (e.g., verb, noun, adjective) are assigned
to tokens as tags. Based on these tags terms that carry the meaning
of the text are extracted and converted into lower case. Therefore,
nouns and verbs are considered more important than determiners and
prepositions. Afterwards a stemmer reduces each extracted term to
its stem. By stemming, the number of distinct terms (size of the
vocabulary) is reduced drastically. Supplementary it enables a
matching of different forms of the same word. For instance, the
terms \texttt{book}, \texttt{books}, \texttt{booking}, and
\texttt{booked} are all reduced to the single stem \texttt{book}. In
order to get rid of additional unwanted material (i.e., the term
\texttt{computer} within computer science articles may not be very
helpful) a list of stemmed stopwords is used for filtering. Finally,
the frequencies of each stem are summed up to reflect its
importance.
% outlook
This chapter starts with presenting related work and natural
language related delimination problems, highlighting the
difficulties of boundary detection of both words and sentences. In
the sequel the foundations of the natural language processing steps
(tokenization, tagging, stemming, stopword filtering in
Figure~\ref{fig:nlp_process}) are provided. These steps are
necessary to compute content representations for information
retrieval and document mining tasks.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Related Work}
\subsection{Character Sets}
A well known language-related problem is the occurrence of special
characters (e.g., Slavic diacritics, umlauts and sharp \texttt{s} in
German, etc.). In order to work properly a well-designed and
language-independent core system for natural language text
processing has to support these character sets. Therefore, the
Unicode Consortium was founded to develop, extend, and promote the
usage of world-wide Unicode related standards to encode
characters~\cite{url_unicode}. Further, the Unicode Consortium
actively develops standards in the area of internationalization,
including the definition of the behavior and relationships between
Unicode characters. The aim of Unicode is to offer a
language-independent solution for text data interchange. Currently
it is regarded as the default standard that specifies the
representation of texts in modern software systems.
\subsection{Tokenization}
%\subsubsection{Fox}
One of the first authors dealing systematically with the problems of
tokenization was Fox~\cite{fox92lexical} in 1992. Besides standard
tokenization, his work addresses special cases like words containing
numbers, hyphens, special usage of punctuation marks including
slashes and underlines, and the case of letters. For this sake,
hand-written finite state automata are applied to resolve
tokenization ambiguities. Further, the work exemplifies how a
stopword list is constructed automatically to support the
tokenization machinery.
%\subsubsection{Webster and Kit}
Webster and Kit~\cite{webster92tokenization} provide an initial
definition of the token concept. They concentrate on Chinese texts,
where delimiters for words and sentences are not present. Two types
of ambiguities are distinguished, conjunctive ambiguity (two or more
neighboring words are adjacent, compound) and disjunctive ambiguity
(a word in between is adjacent with a previous word and with a
following word, double usage of the word in the middle). Two ways
for disambiguation are proposed: (1) application of knowledge-based
rules and (2) application of statistical-based methods borrowed from
corpus-linguistics.
%\subsubsection{Grefenstette and Tapanainen}
Grefenstette and Tapanainen~\cite{grefenstette94tokenization}
describe a regular expression based treatment of common tokenization
difficulties, including special number formats (e.g., dates,
fractions) and abbreviations. In order to identify sentence borders
they run experiments using rules together with the corpus itself as
a filter. First, all possible candidates for sentence ends (i.e.,
all strings ended by periods) are tested against a short list of
exceptions and a set of identification rules. Second, if no match is
found during the first step, the string without the ending period is
compared to all sentence internal words. In other words it is
checked whether the term occurs as non-abbreviation somewhere within
a sentence. In the case that both steps do not come up with a
result, the token is marked as the last word of the sentence.
%\subsubsection{Giguet}
Giguet~\cite{giguet96multilinguality} addresses the problem of
multilingual text tokenization for natural language diagnosis. The
aim is to assign language names to sentences. Because there is no
knowledge about the language in advance, a general and
language-independent tokenization method is applied. Therefore, he
proposes a rule-based tokenization approach, addressing common
tokenization problems like elision and the difficulty of merging
ordered sets of tokenization rules.
%\subsubsection{Palmer and Hearst}
Strongly related to tokenization is the problem of sentence boundary
disambiguation. Palmer and Hearst~\cite{palmer97adaptive} point out
several difficulties that arise if sentence end markers are not
identified correctly. They propose a machine-learning approach based
on neural networks to tackle this problem. Their implementation
system Satz utilizes part-of-speech tags of preceding and following
tokens around potential sentence end markers. These tag sequences
are used to estimate the probability of a true end of the sentence.
As the system relies on machine-learning, it operates in two phases:
In a training phase, classification patterns provided by an expert
are learned. The trained knowledge is then applied in a testing
phase to recognize sentence boundaries in unknown texts. Satz
executes fast in both, the training and testing phase, adapts to
other corpora and languages, and operates robust regarding the
input.
%\subsubsection{Guo}
Guo presents in~\cite{guo97critical,guo98one} a
one-tokenization-per-source approach that focuses on non-segmenting
languages like Chinese. In contrast to segmenting languages as
English or German, the text comes as a sequence of characters
without any delimiters (e.g., blanks, punctuation marks) between
words and sentences. Thus, the difficulty is the identification of
token borders, which is not an easy task. His proposed approach is
based on the assumption of a complete dictionary. Nevertheless, the
sentence \texttt{todayissunday} still can be interpreted either as
\texttt{to day is sun day} or \texttt{today is sunday}. To recover
from this problem, Guo suggests to use the document itself as a
filter to disambiguate these cases. The system is evaluated using
the Chinese PH newspaper corpus, achieving high accuracy values of
up to 99\%.
%\subsubsection{Yamashita and Matsumoto}
Yamashita and Matsumoto~\cite{yamashita00morphological} propose a
language independent morphological analysis consisting of three
steps: tokenization, dictionary lookup, and disambiguation. Their
approach is applicable for both segmenting and non-segmenting
languages. Segmenting languages are transformed to non-segmenting
ones by simply removing all delimiters. They introduce the notion of
a morpho-fragment, which is an intermediate representation between
characters (graphic words separated by blanks in segmenting
languages) and lexemes (logical units). Trie data structures are
used to implement efficient dictionary lookup including
morphological information. The final disambiguation step is realized
via a hidden markov model. The approach is evaluated using the
tagged Penn Treebank corpus (97\% and 95\% precision and recall), a
Japanese tagged corpus (97\% precision and recall), and the tagged
corpus released by the Chinese Knowledge Information Processing
Group (95\% and 91\% precision and recall).
%\subsubsection{Mikheev}
Using a minimum of pre-built resources,
Mikheev~\cite{mikheev02periods} tackles three different tokenization
problems: sentence boundary disambiguation, disambiguation of
capitalized words, and identification of abbreviations. With the
help of automatically retrieved word frequency lists and a small set
of heuristics, the corpus itself is used to dynamically infer
disambiguation clues.
%
In order to detect sentence borders unambiguous abbreviations (e.g.,
abbreviations followed by a lower case word) are looked up within
the document. Based on these occurrences and local contexts
ambiguous abbreviations are resolved.
%
The disambiguation of capitalized words, which is an important task
of text normalization, is closely related to the first problem. This
step partly covers the task of named entity recognition, where three
subtasks are distinguished: identification of unique identifiers
(organizations, persons, locations, proper names, product names,
acronyms), identification of temporal expressions (complete or
partial times and dates), and identification of numerical
expressions and quantities (monetary values, percentages). As
before, the clues found in unambiguous occurrences are used to
resolve ambiguous contexts. Sentence ends marked by either
exclamation marks, question marks, or periods are identified as
central processing units relying on the previous results,
abbreviations, and capitalized words. The system achieves high
performance exceeding 99\% accuracy on the three tasks using the
Brown Corpus.
\subsection{Tagging Systems}
\subsubsection{The Brill Tagger}
One of the earliest rule-based taggers that achieves similar results
compared to stochastic taggers was presented by
Brill~\cite{brill92simple,brill93phdthesis}. The approach, called
transformation-based error-driven learning, operates in two phases:
During a learning phase, tagging rules are automatically inferred
from a corpus by trying to increase the overall accuracy. In a
testing phase, each word is initially assigned its most likely tag
without considering any context. Therefore, the tagger is provided
with two clues: capitalized words tend to be nouns and most common
word endings are known (e.g., words with the suffix \texttt{-ous}
tend to be adjectives). Afterwards, tag assignments are iteratively
changed according to the learned rules. An evaluation using the
Brown corpus showed that the error rate drops to 3,5\%.
Some extensions of the Brill tagger are described
in~\cite{brill94some}. They comprise tagging using lexical
relationships, a new method to identify unknown words, and multiple
tags for each word. Initial tagging knowledge is reduced to a
minimal amount. The unsupervised learning algorithm is explained in
detail in~\cite{brill95unsupervised}. A good summary of Brill's work
is found in~\cite{brill95transformationbased}.
\subsubsection{Cuttings Hidden Markov Model Tagger}
%\subsubsection{Cutting, Kupiec, Pedersen, and Sibun
Cutting et al.~\cite{cutting92practical} present a part-of-speech
tagger that is robust, efficient, accurate, tunable, and reusable.
The tagger is based on a hidden markov model to resolve a small set
of ambiguity classes. A lexicon module provides information about
common word tags, word suffix identification, and default ambiguity
classes. Their experiments show that the tagger achieves an accuracy
of over 96\%.
\subsubsection{TreeTagger}
Schmid~\cite{schmid94probabilistic} presents a probabilistic
part-of-speech tagger called TreeTagger. The tagger operates on
second order markov models using binary decision trees to estimate
$n$-gram probabilities instead of a maximum likelihood estimation
(see Figure~\ref{fig:related_tree}). Tree pruning is used to
recursively remove leaf nodes below a weighed information gain.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.5\textwidth]{05_text_analysis/figures/related_tree}\\
\caption[A sample decision tree]{A sample decision tree~\cite{schmid94probabilistic}}
\label{fig:related_tree}
\end{figure}
%
Initially, tags are assigned through a lexicon that includes full
forms, suffixes, and default category entries. Afterwards, unigrams,
bigrams, trigrams, and quatrograms are applied to assign correct
tags. Evaluation results using the Penn Treebank corpus show good
results of up to 96,34\% accuracy.
% further extensions
Further developments of the TreeTagger are described
in~\cite{schmid95improvements}. In order to fit the domain of German
texts some additional processing steps are included: Smoothing to
circumvent the sparse data problem is done with equivalence classes,
a prefix lexicon is used in combination with suffixes, and
capitalized sentence initial words are considered. In an evaluation
using the large newspaper corpus Stuttgarter Zeitung the tagger
achieves up to 97,53\% accuracy.
\subsubsection{QTag}
Tufis and Mason~\cite{tufis98tagging} propose QTAG, a probabilistic
part-of-speech tagger. Tiered tagging, as they call it, allows the
usage of large tagsets that go beyond the computational power of
average computer systems. This is accomplished by using a hidden
tagset of smaller size that is post-processed (i.e., hidden tags are
resolved by the original ones) during a later phase. Dictionary
lookup is applied for initial tagging. Unidentified tags are
analyzed by two morphological guessers, one applying suffix matching
and another one which is linguistically motivated. An evaluation of
the Romanian CTAG corpus results in over 97\% accuracy.
\subsubsection{The TnT Tagger}
In 2000 Brants~\cite{brants00tnt} introduces the statistical
\abbrev{Trigrams'n'Tags}{TnT} tagger. The TnT tagger is based on
markov models, where the first step is processing maximum likelihood
probabilities for unigrams, bigrams, trigrams, and lexical units.
Smoothing, the linear interpolation of unigram, bigram, and trigram
probabilities resolves the sparse data problem by avoiding zero
probabilities. Unknown words are identified through suffix
identification and capitalization clues. Experiments using the NEGRA
and the Penn Treebank corpora showed that the TnT tagger outperforms
other approaches reaching an accuracy of over 99\%.
\subsubsection{The Stanford Tagger}
At Stanford University Toutanova and
Manning~\cite{toutanova00enriching} developed a maximum entropy
tagger. According to its context, each word is assigned all possible
tags through probabilities. The probability of a tag sequence is
computed as the product of the conditional probability of each of
the tags, resulting in a probability distribution. Out of this set
of all feasible assignment sequences, the probability distribution
that has the highest information gain (entropy) is selected. With
special attention put on the correct identification of unknown
words, the tagger achieves an overall accuracy of 96,86\%.
\subsection{Stopword Filtering}
%\subsubsection{Fox}
Fox~\cite{fox90stoplist} describes a method to systematically
extract a list of stopwords from a document collection. In his work
he relies on the Brown corpus that consists of over a million words
from common English literature. Based on their frequencies terms
occurring at least 300 times are extracted, resulting in an initial
list of 278 terms. In a first step 32 non-stopwords are manually
removed. 26 additional terms of lower frequency are manually added
in a second step. Finally, the list is further extended by words
that are considered as stopwords and which can be recognized with
the same finite state machine (almost) for free. Therefore, he
defines nine different adding criteria like words with different
prefixes ended by \texttt{-body} (e.g., \texttt{anybody},
\texttt{nobody}), \texttt{-ly} (e.g., \texttt{clearly},
\texttt{fairly}), or \texttt{-self} (e.g., \texttt{itself},
\texttt{herself}). He ends up with a list of 421 stopwords for
common English texts that \enquote{can serve as the basis for stop
lists for specialized data bases}~\cite{fox90stoplist}.
\subsection{Stemming Systems}
\subsubsection{The Lovin Stemmer}
Initial research in the area of word stemming is conducted by
Lovins~\cite{lovins68stemming} in 1968. The proposed stemming
algorithm is context sensitive, that is suffixes are only stripped
if certain qualifications are fulfilled (e.g., minimum length of a
remaining stem is two letters). Longest-match suffixes are removed
first, implying an order among the set of reduction rules. The
stemmer includes a list of 294 possible suffixes, a large list of
exceptions, and a set of context sensitive rules (constraints) to
avoid a wrong reduction of terms. The algorithm finalizes by
applying rules for recoding incorrectly stemmed words (e.g.,
\texttt{-ll} $\rightarrow$ \texttt{-l}, \texttt{-iev} $\rightarrow$
\texttt{-ief}).
\subsubsection{The Dawson stemmer}
Compared to Lovins, the Dawson stemmer~\cite{dawson74suffix} uses a
much larger list of about 1200 suffixes. These are implemented as
character trees to ensure fast computation. In contrast to Lovins'
stemmer no final recoding takes place. A partial matching routine
ensures that nearly identical stems are reduced to a single one.
This is done by sets of endings that, if removed, result in the same
stem. For instance, the set $\{mit,miss\}$ summarizes the terms
\texttt{admit} and \texttt{admiss}, where the shortest remaining
stem \texttt{admit} is selected to serve as the single stem for all
of the terms.
\subsubsection{The Porter Stemmer}
One of the best known stemmer is proposed by
Porter~\cite{porter80suffix}. The rules for stripping suffixes are
organized in five distinct phases. Each of the phases succeeds the
previous one, starting with phase 1 and ending with phase 5.
Equations~\ref{eq:porterstem_1}--\ref{eq:porterstem_5b} depict some
examples according to these phases. Phase 1
(Equation~\ref{eq:porterstem_1}) reduces plurals and past
participles. Subsequent phases further reduce the intermediate
stems: phase 2 (Equation~\ref{eq:porterstem_2}), phase 3
(Equation~\ref{eq:porterstem_3}), and phase 4
(Equation~\ref{eq:porterstem_4}). Finally, phase 5 tidies up some
remaining suffixes (Equations~\ref{eq:porterstem_5a} and
\ref{eq:porterstem_5b}).
%
\begin{eqnarray}
generalizations & \rightarrow & generalization \label{eq:porterstem_1}\\
generalization & \rightarrow & generalize \label{eq:porterstem_2}\\
generalize & \rightarrow & general \label{eq:porterstem_3}\\
general & \rightarrow & gener \label{eq:porterstem_4}\\
\smallskip
controll & \rightarrow & control \label{eq:porterstem_5a}\\
probate & \rightarrow & probat \label{eq:porterstem_5b}
\end{eqnarray}
%
In his experiments Porter shows that stemming applied for
information retrieval reduces the size of the vocabulary by about
one third.
\subsubsection{The Paice/Husk Stemmer}
The Paice/Husk stemmer~\cite{paice90stemmer} described by Paice is
an iteratively operating rule-based stemmer. Each rule either
deletes or substitutes word endings, making recoding rules obsolete.
The rules are grouped according to the ending letters of the suffix,
where their order within the groups is of importance. Some of the
rules are dedicated only to words where no changes have yet been
made. The rule format consists of five parts: an ending in reverse
order, an optional intact flag, a zero elimination flag, an optional
append string, and a continuation symbol defining the termination of
the stemming procedure. Stemming is done by first selecting the
corresponding group and applying the ordered rules until (1) a rule
terminates the stemming, (2) no more rules are applicable, or (3)
the group letter changes.
In his subsequent work Paice~\cite{paice94stemming} presents a way
to evaluate different stemmers. In an experiment he compares the
performance of three different stemmers proposed by Lovin, Porter,
and himself, using a self-generated test set of about 10000 words
from the CISI document collection. The findings are that both, the
Lovin' and the Porter stemmer tend to reduce words referring to the
same concept to different stems. In contrast, the Paice/Husk stemmer
tends to reduce words referring to different concepts to a single
stem. However, without an explicit definition of an application
domain neither of the stemmers clearly outperforms the others.
\subsubsection{Krovetz' Extension of the Porter Stemmer}
Krovetz~\cite{krovetz93morphology} experiments with an altered
version of the Porter stemmer. One of the major problems of Porter's
work is that words with different meanings often are reduced to the
same stem. To avoid this, Krovetz includes dictionary lookup that
precedes each of the five phases in Porters stemming algorithm. If a
match occurs, the stem included in the dictionary is taken instead
of applying any rules, and the stemming procedure is instantly
terminated. However, results only show slight benefits compared to
the original version.
Thus, Krovetz proposes two other stemming approaches. First, an
inflectional stemmer that only converts plurals to singular form,
changes past tense to present tense, and eliminates \texttt{-ing}
endings was developed. Each of the three tasks is preceded by a
dictionary lookup.
%
Second, a derivational stemmer was implemented that extends the
inflectional stemmer by the fifteenth most frequent endings. In
their experiments using four different document collections he shows
that the derivational stemmer slightly outperforms the others
(including the Porter stemmer).
\subsubsection{The Croft Stemmer}
Croft and Xu~\cite{croft95corpusspecific} present a corpus-specific
stemming strategy. They apply an aggressive stemmer (Porter stemmer)
that extensively reduces terms even with different meanings to the
same stem. In their experiments the West collection and the Wall
Street Journal collection are used. Based on the stems and a window
size of 100 words, equivalence classes are formed for each stem
using the unionfind algorithm (i.e., $<word,stem>$ pairs). Each word
pair (i.e., $<word,word>$) within each equivalence class is
evaluated according to Equation~\ref{eq:croft_stemming}.
%
\begin{equation}
\label{eq:croft_stemming}
em(a,b) = \frac{n_{ab}}{n_a + n_b}
\end{equation}
%
where $n_a$ (resp. $n_b$) is the number of occurrences of term $a$
(resp. term $b$) and $n_{ab}$ denotes the number of occurrences of
the terms $a$ and $b$ together in the same window.
%
According to a $em(a,b)$ threshold new equivalence classes are
formed, where the final `stem' is defined by the shortest word in
each equivalence class. The new corpus-specific $<term,stem>$ pairs
are stored in a lookup table. Their results show a clear improvement
of information retrieval applying the proposed stemming algorithm,
even outperforming the results achieved by applying the Porter
stemmer directly.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
%\section{Natural Language Delimination Problem}
\section{Natural Language Oddities}
\label{sec:tok_challenges}
% general stuff
A properly working language processing system has to be aware of the
different kinds of delimiters, printable characters, and special
characters. Clearly, the depth of analysis has a strong impact on
the system's performance regarding both, quality and time. An
important point is an unambiguous terminology and support during all
levels of language analysis. Last but not least domain specific
knowledge plays a central role and has to be considered properly in
advance.
%
% general question
This implies a proper definition of allowed characters within
processing strings. For instance, are units containing characters
like \texttt{-}, \texttt{@}, \texttt{:}, \texttt{\%}, \texttt{\$},
\texttt{.} defined as correct words or not? The decision strongly
relies on language, domain, and external knowledge which has to be
provided to the language analyzer in some way.
% upper/lower case
Different character sets and the treatment of special characters
always posed difficulties. Moreover, the discussion about upper- and
lowercase letters has still not come to a conclusion. Most
information retrieval systems based on keyword search are converting
characters into lowercase. From the retrieval point of view this
does not effect retrieval performance a lot~\cite{baeza-yates99ir}.
Anyway, capital letters also have their own semantics which may be
used to begin sentences, identify proper nouns, name institutions,
or mark other kinds of tokens such as operation system specific
commands. Obviously, a distinction between \texttt{Richard Brown}
and \texttt{brown ink} may be of interest. In general there is no
easy solution for the problem of accurate proper name detection
\cite[pp. 124]{manning02nlp}.
% delimiters
One of the major difficulties in natural language processing pose
the diverse possibilities to separate words and sentences. It
includes common delimiters like blanks, tabulators, line feeds, and
carriage returns. In addition to that, special characters may also
be used to end up or recombine tokens, including apostrophes,
hyphens, slashes, and punctuation marks. This section outlines
example usages of these delimiters and points out possible
treatments.
\subsection{Difficulties Concerning Sentence Delimiters}
% difficulties
Jackson and Moulinier pointed out that \enquote{Detecting sentence
boundaries accurately is not an easy task}~\cite[pp.
9]{jackson02nlp}. Usually we have no problem to detect the end of
sentences with default markings like the period (\texttt{.}),
question mark (\texttt{?}), and exclamation mark (\texttt{!}). Other
candidates for sentence termination markings are the colon
(\texttt{:}) and the semicolon (\texttt{;}). Especially in the
context of irregular usage of those markers difficulties may arise
(see Table~\ref{tab:tok_ex_delim}). Without considering the actual
context, an interpretation of punctuation marks at sentence borders
may often be ambiguous.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of (irregular) usage of sentence delimiters}
\label{tab:tok_ex_delim}
\begin{tabular}{lL{9cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Example \tabularnewline%
%category example
sentence end & \texttt{This is the end...} \tabularnewline%
sentence ending number & \texttt{3 plus 4 is 7. It ended 1977.} \tabularnewline%
sentence ending abbreviation & \texttt{Including mice, rats, etc.} \tabularnewline%
punctuation marks in text & \texttt{In Java !yellow means not(yellow).} \tabularnewline%
cumulative punctuation marks & \texttt{Is that true?! Not this way!!!} \tabularnewline%
citation gap & \texttt{states that \enquote{one theory ... saves the world}}\tabularnewline%
colon and semi colon & \texttt{Some animals: cats and dogs.} \\
\texttt{Some cars: the BMW; the Ford; and the Audi;} \tabularnewline%
\end{tabular}
\end{table}
%
%
% table examples
This diversity of marker usage gives a first insight into the
complexity of natural language text processing. Each of the markers
pose specific (language-dependent) peculiarities in the domain of
text segmentation handling. In order to master the problem of
sentence border disambiguation, a rule-based and linguistically
motivated (language-dependent) multi-level approach is proposed. As
a consequence fixed and static identification strategies often turn
out to be not appropriate, suffering in (language (in-)dependent)
tokenization.
% sentence border detection (period)
Being the first step of natural language analysis, common
tokenization strategies assume that the function of the period is
easily identifiable. However, it takes over many different
functions. One of its default functions is to decode sentence
endings as the last character of a sentence ending word. But even
this task of sentence detection based on border identification still
seems not completely solved.
% functions of the period
Usually the period is defined as the classical sentence delimiter in
European languages. But on a closer look it delimits sentences (per
definition it \enquote{draws the boundaries of something}) only in
combination with a blank. In the domain of pattern conceptualization
it does not function in the same manner. Very often the period may
also be used to format numbers, fix dates and times, compose complex
formats (e.g., credit card numbers, postal codes), enumerate items
(\texttt{the 4.rank}), identify variable names (\texttt{object.id}),
abbreviate full words and phrases, and mark citation gaps (see
Table~\ref{tab:tok_ex_delim}).
% decomposition using periods
The period itself does not uniquely function as a delimiter of
sentences. Without a following blank (i.e., inside abbreviations or
digits) it has a certain decomposing functionality to support the
segmentation of concepts. In patterns like \texttt{12.35} and
\texttt{3.567.123} it helps to distinguish between higher and lower
measurement units. In abbreviation patterns like \texttt{e.g.} the
(first) period helps to identify the related full words. Periods
within URL addresses like \texttt{www.google.com} also decompose
different (semantic) concepts on a syntactical level. Motivated by
Guo~\cite{guo97critical,guo98one} and Mikheev
\cite{mikheev02periods} the document itself can be used to
disambiguate part of such tokens.
%
%
% period in abbrevs
In contrast, a period followed by a blank terminates the
conceptualization process of a semantic unit, which might be the
realization of an abbreviation or the end of the whole sentence. In
the pattern \texttt{e.g.} the second period has two functions: It
decomposes one word of an abbreviation and it terminates the
conceptualization of the whole abbreviated phrase \texttt{exempli
gratia} because of the following blank.
% sentence ending abbreviations
Difficulties arise if abbreviations and regular words are
overlapping, like \texttt{Wash.} (Washington) and \texttt{wash}. As
a consequence dictionary lookup of the word \texttt{wash} (after
punctuation mark splitting and case insensitivity) to identify
non-abbreviations fails. Additionally, abbreviations may
simultaneously denote sentence ends (like \texttt{etc.}).
Treating these abbreviations the same way as within sentences
results in missing sentence ending periods. Thus, sentence borders
are not correctly detected which may lead to problems is later
processing steps (e.g., sentence-based tagging).
%
% german example
In German, for example, one can define the period as a sentence end
if the following conditions are met:
\begin{compactenum}
%
\item The sentence-ending token contains the period as its last character.
\item The sentence-ending token without the period is at least two characters long (no German word consists of a single character).
\item The sentence-ending token without the period must not be a number.
\item The sentence-ending token is the last token of the text, or the token which follows is a number or a string starting with a capital letter.
%
\end{compactenum}
% summary
From this definition it is clear that finding sentence borders
requires much knowledge about a language. To sum up, the period has
the following coding functions:
%
\begin{compactitem}
%
\item It decodes the termination of a sentence and functions as a
delimiter between sentences together with spacings
(blank, tabulator) and new lines (carriage return, line
feed).
%
\item In the domain of word dependent conceptualization
it decodes the termination of semantic word concepts.
%
\item It decodes the reduction of full words and phrases for
the purpose of abbreviation.
%
\item In the domain of formatting, it separates units for a better
readability. After a number it implies an instantiation of an arithmetic
concept.
%
\end{compactitem}
\subsection{Abbreviation and Acronym Detection}
% abbrevs are important
Abbreviations mainly consist of letters separated by periods where
each period terminates a conceptualization. In contrast, acronyms
consist of an arbitrary number of letters without any separation.
They define a compound of related concepts that are reduced for the
sake of easier comprehension by hiding the internal syntactic
structure.
% lists are insufficient
Using list comparison, there may always be the problem of huge
amounts of listed items which are in most cases still incomplete.
Abbreviations may be written with or without periods, making it even
harder to detect sentence borders properly. They occur within
titles, month names, jobs, institutions, country codes, and common
words (see Table~\ref{tab:tok_ex_abbrevs}). Also, the list of used
acronyms (often used in emails or online chats) is growing fast, and
keeping track of all of them is quite impossible.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of abbreviations and acronyms}
\label{tab:tok_ex_abbrevs}
\begin{threeparttable}
\begin{tabular}{lL{10.5cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Example \tabularnewline%
%category example
titles & \texttt{Dr., Prof., Dipl.-Ing., NN., John Doe, Jr.\tnote{*}} \tabularnewline%
month names & \texttt{Jan., Sept., Dec.} \tabularnewline%
jobs & \texttt{Off., Lt., Capt.} \tabularnewline%
institutions & \texttt{CIA, UNO, Microtest Inc.} \tabularnewline%
countries and locations & \texttt{AUT, CH, SLO, U.S., St.~Germain} \tabularnewline%
names & \texttt{M. Hassler, Ali G.} \tabularnewline%
measures and metrics & \texttt{yd., yrs., m.p.h., sec., lbs., kg, l, db, AE} \tabularnewline%
common & \texttt{Mr., etc., aka., i.e., versch., fig., Dept.} \tabularnewline%
acronyms & \texttt{AEG, IBM, ISYS, GIS} \tabularnewline%
texting abbreviations & \texttt{LOL, MfG, CU, ACK} \tabularnewline%
\end{tabular}
\begin{tablenotes}
\footnotesize
\item[*] MUC6 example: Guidlines that pertain only to person
Guidelines (\url{http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book\_9.html},
08.09.2006)
\end{tablenotes}
\end{threeparttable}
\end{table}
% combined approach
A promising approach that may work is to use a dictionary containing
common abbreviations combined with a list of domain specific
abbreviations. Further, language and application dependent rules
identifying special formats likely to match abbreviations can be
used to improve the results. In combination with a full form lexicon
the number of misinterpreted sentence endings can be additionally
reduced.
\subsection{Numbers}
Generally numbers are not good index terms because of their
vagueness and ambiguity~\cite{baeza-yates99ir}. But they can also be
clearly of importance and should be at least recognized during text
processing. Numbers may be used within written texts with
punctuation marks like periods for large numbers or the commas for
decimal numbers. In addition their formatting is language dependent,
as the number \texttt{123,456.78} in English texts becomes
\texttt{123 456,78} in French texts. The slash (e.g., \texttt{1/2})
is used to express fractions of numbers. Sometimes blanks are used
for formatting, depending on the language and individual styles of
an author. Numbers are used in postal codes, date and time formats,
enumerations, abbreviations of their full forms, currencies,
temperatures, metric measurements, or simply as counter.
Table~\ref{tab:tok_ex_numbers} gives some examples of number usages
in texts.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of number occurrences in texts}
\label{tab:tok_ex_numbers}
\begin{tabular}{lL{10.5cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Example \tabularnewline%
%category example
starting words & \texttt{3rd, 5th, 14t{\"a}gig, 100fach} \tabularnewline%
within words & \texttt{F-117A bomber, H2O} \tabularnewline%
ending words & \texttt{vitamin B12, ISBN:123-1433-234} \tabularnewline%
with special characters & \texttt{20\% VAT, volume \#4} \tabularnewline%
composite nouns & \texttt{20-dollar bills} \tabularnewline%
names & \texttt{OS/2, PS/2} \tabularnewline%
numbers and fractions & \texttt{12.345,60, 5 1/4 points, 1.5 times faster} \tabularnewline%
times & \texttt{1:30p.m., on 14.may, 510B.C.} \tabularnewline%
dates & \texttt{2005/June, 01-05-2005, 1875/12/1, 01.01.99, April 05} \tabularnewline%
periods & \texttt{1970-80, summer term 2004/05} \tabularnewline%
currencies & \texttt{\$100,000, US\$42,1 millions, } \tabularnewline%
measures and metrics & \texttt{12kg, 127lbs, 4sec, } \tabularnewline%
\end{tabular}
\end{table}
%
%
% treatment
These patterns can be treated by rules defining the format (covered
in the next subsections). An advanced lexical analysis procedure
might also perform date and number normalizations to uniform
formats~\cite[pp. 166]{baeza-yates99ir}.
\subsection{Special Formats} \label{tok_formats}
% phone numbers
Special formats like sixteen digits in a row separated by blanks or
hyphens may identify credit card numbers. Such patterns can be
covered by special rules. For example, phone numbers are written in
quite heterogenous forms as shown in Table~\ref{tab:tok_ex_phone}.
They may contain spaces, periods, hyphens, brackets, and slashes to
group digits in various forms \cite[pp. 130]{manning02nlp}.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of phone number formats}
\label{tab:tok_ex_phone}
\begin{threeparttable}
\begin{tabular}{ccc}
\rowcolor{tableheadcolor} \mmcc{3}{Phone numbers} \tabularnewline%
%column 1 column 2 column 3
\texttt{(43463)12345} & \texttt{0043-463-12345} & \texttt{+43 463 12345}\tnote{a} \tabularnewline%
\texttt{+43(0)463 12345}\tnote{a} & \texttt{+43/463 12345}\tnote{a} & \texttt{++43-463-12345} \tabularnewline%
\texttt{43-463-12345} & \texttt{+43/(0)463/12345} & \texttt{0463 2700 3531}\tnote{a} \tabularnewline%
\texttt{(44.171) 830 1007}\tnote{a} & \texttt{(202) 522-2230}\tnote{a} & \texttt{1-925-225-3000} \tabularnewline%
\texttt{212.995.5402}\tnote{a} & \texttt{+411/284 3797}\tnote{a} & \texttt{+49 69 136-2 98 05}\tnote{a} \tabularnewline%
\end{tabular}
\begin{tablenotes}
\footnotesize
\item[a] these format cannot be identified on single-token level
\end{tablenotes}
\end{threeparttable}
\end{table}
\hiddenabbrev{Uniform Resource Locator}{URL}
% URLs
Other formats like email addresses containing the \texttt{@}
character, web addresses tending to start with \texttt{www}, or URL
definitions including a protocol as \texttt{ftp://} have to be
considered in texts as well (see Table~\ref{tab:tok_ex_formats}).
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of special formats}
\label{tab:tok_ex_formats}
\begin{tabular}{L{5cm}L{5cm}}
\rowcolor{tableheadcolor} \mmcc{2}{Special token formats} \tabularnewline%
%category example
\texttt{john@smith.com} & \texttt{http://www.fbi.org} \tabularnewline%
\texttt{Object::Name} & \texttt{x.id} \tabularnewline%
\texttt{ID:F175-23-ak} & \texttt{c:$\backslash$photo$\backslash$123.jpg} \tabularnewline%
\texttt{IP:192.168.1.51} & \texttt{/home/marcus/repository.txt} \tabularnewline%
\end{tabular}
\end{table}
\subsection{Apostrophes}
% border detection problem
Apostrophes pose a set of difficulties in finding token borders and
their treatment during tokenization. They are not only used for
enclosing direct speech, but also for encoding contractions and
concatenations (through elision), clitics, or indication of a
plurals possessive (see Table~\ref{tab:tok_ex_ap})~\cite[pp.
126]{manning02nlp}.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of apostrophe usage}
\label{tab:tok_ex_ap}
\begin{threeparttable}
\begin{tabular}{lL{11.5cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Example \tabularnewline%
%category example
direct speech & \texttt{He said: 'It is really difficult!'} \\
\texttt{He says: "Give me that coke, Mr. Smith."} \\
\texttt{'What do you mean?'} \tabularnewline%
elision & \texttt{you're, it's, won't, l'addition, c'est la vie, presqu'le} \tabularnewline%
clitics & \texttt{the man's car, the cat's food} \tabularnewline%
plurals possessive & \texttt{the boys' hats, the cats' food} \tabularnewline%
quoted proper nouns & \texttt{'Water Rats', 'The Round Table'} \tabularnewline%
within names & \texttt{O'Reilley, c't magazine, McDonald's, Alzheimer's\tnote{a}}\tabularnewline%
\end{tabular}
\begin{tablenotes}
\footnotesize
\item[a] MUC6 example: Miscellaneous types of proper names from MUC6
Guidelines (\url{http://www.cs.nyu.edu/cs/faculty/grishman/NEtask20.book\_9.html},
08.09.2006)
\end{tablenotes}
\end{threeparttable}
\end{table}
% general treatment
Apostrophe treatment can be seen as a two level process. It
presupposes simple tokenization including separation of token
candidates (not including a delimiting character) and basic
punctuation mark separation in a first place. Afterwards,
linguistically motivated apostrophe treatment follows. The central
issue is whether strings containing apostrophes should be separated
or not. This issue cannot be resolved without incorporating general
decision guidance. In a first rough approximation the following
basic guidelines for apostrophe treatment are proposed:
%
\begin{compactitem}
%
\item Apostrophes at the start and end of tokens (strings without
delimiting characters) are considered as quotes and separated.
%
\item Apostrophes occurring within tokens are candidates
for further linguistically motivated splitting.
%
\item Apostrophes that cannot be treated in both ways (e.g., no
splitting is possible) are specifically marked and left
for further treatment.
%
\end{compactitem}
% splitting or not
There are several ways to decompose aggregated elision patterns like
\texttt{isn't} in which the \texttt{isn} is the result of a
phonological motivated contraction process. It is an open issue
whether those tokens should be (1) treated as a single token in form
of \texttt{isn't}, (2) split up into two tokens \texttt{isn} and
\texttt{t}, or (3) substituted by their full forms \texttt{is} and
\texttt{not}.
% strategies to handle
The first strategy (1), which is not favored, does not treat
apostrophes at all. The processing is left open to the next analysis
step. The second strategy (2) treats the apostrophe as a separator
and decomposes the involved elements without any linguistic
interpretation. The third strategy (3), which is based on
(language-dependent) logical and semantical analysis (decomposition
rules), is preferred for the purpose of this work. It combines the
first separation step with a second substitution step via
morphosyntactically motivated linguistic rules. For instance,
\texttt{n't} is interpreted as the negation operator \texttt{not}
and thus gets isolated and substituted by its full form. In the case
of \texttt{isn't}, this results in two separate words \texttt{is}
and \texttt{not}. One has to note that there are also exceptions to
this processing in English, as the tokens \texttt{won't} and
\texttt{can't} cannot be treated in this manner.
% list plus rules
To achieve optimal results the treatment of strings containing
apostrophe has to be done on two levels. First it has to be compared
to a (countable) list of exceptions (like \texttt{won't}) with their
full strings (like \texttt{will not}). Matching strings are
substituted by their listed full forms. If no match occurs general
strategies are applied (like strings ending with \texttt{n't} are
split up into the string before the \texttt{n't} and the
substitution \texttt{not}).
% task is complex
The complexity of the problem confirms the hypothesis that proper
apostrophe treatment has to consider the linguistic context. Token
isolation in this sense minimizes the error-proneness of subsequent
disambiguation processes operating on the remaining apostrophe
containing strings.
\subsection{Hyphenations}
% introduction
Hyphenation is another relevant issue in the context of multi-level
tokenization~\cite{grefenstette94tokenization}. Manning and
Sch{\"u}tze distinguish three different types of hyphenation
\cite[pp. 127]{manning02nlp}:
%
\begin{enumerate}
%
\item Hyphenation is used for text justification, where words at
line breaks are split up for the sake of alignment. Finding
such hyphens as the last character of a word at the end of a
line, getting rid of the hyphen, and reconcatenating it with the
first word of the next line is an easy task. But it leads to
haplology in the ambiguous case of words containing natural hyphens.
In electronic texts such hyphens are usually not present.
%
\item Some words contain natural hyphens that are part of the word itself. This
includes lexical hyphens inserted before or after small word
formatives (e.g., \texttt{re-concatenate}).
%
\item Hyphens are used to indicate the correct grouping of
words (e.g., \texttt{state-of-the-art}). Those tokens may be split
up into their components to allow
a syntactic analysis~\cite{baeza-yates99ir}. Because they are
very common in many corpora, the size of vocabulary
increases considerably if they are not split up (e.g.,
\texttt{information-based}).
%
\end{enumerate}
% examples
Hyphens can also be used as an autonomous dash to separate semantic
units of a sentence, for listing items, as a minus sign (in
formulas), for the sake of abbreviation, and much more (see
Table~\ref{tab:tok_ex_hyphen}).
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of hyphenation usage}
\label{tab:tok_ex_hyphen}
\begin{tabular}{lL{11cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Example \tabularnewline%
%category example
(a) line breaks & \texttt{at-tach, unpro-nounced, analo-gy} \tabularnewline%
(b) single words & \texttt{e-mail, co-author, A-1-plus, so-called, pro-Arab, B-52} \tabularnewline%
(c) grouping of words & \texttt{aluminium-export, text-based, 15-year-old, ring-around-the-rose, science-fiction, U.S.-Japan trade} \tabularnewline%
(d) dashes & \texttt{It works -- as I supposed -- fine.} \tabularnewline%
(e) itemizations & \texttt{~- easy}\\
\texttt{~- medium}\\
\texttt{~- hard} \tabularnewline%
(f) minus sign & \texttt{average of -25 degrees centigrade, obviously 3-2=1 is true} \tabularnewline%
(g) abbreviations & \texttt{the pre- and postconditions are} \tabularnewline%
(h) names & \texttt{Jean-Claude van Dame, F. Gregory Fitz-Gerald} \tabularnewline%
\end{tabular}
\end{table}
% irregular usage
A particular problem in this area is the inconsistent use of hyphens
\cite[pp. 128]{manning02nlp}. There exist many examples of different
forms and writing styles of the same terms like
\texttt{co-operate/cooperate}, \texttt{data-base/database}, or
\texttt{hyper-text/hypertext}. A careful editor surely would filter
out this inconsistencies, but in practice most of the texts will
contain such forms.
%
% 2-level treatment
To cope with these difficulties, the approach in this work treats
hyphenation on different levels of tokenization:
\begin{itemize}
%
\item On the first level (simple tokenization) hyphens decoding token- and
line-endings are marked as potential candidates for reconcatenation
on the second level of tokenization.
Hyphens surrounded by blanks (examples (d) and (e) in Table~\ref{tab:tok_ex_hyphen})
are also marked and treated as separate tokens on the first level.
All other hyphens
within strings are treated as valid characters and are not
processed any further.
%
\item On the second level tokens marked for reconcatenation
are substituted by their concatenated form.
For instance, the term \texttt{hyphenation} may be hyphenated as
the three token sequence \texttt{hyphen-} + \texttt{EndOfLine} +
\texttt{tion}.
%
Therefore, list comparisons (e.g., last token is not a valid term, concatenated
string is a valid term) and rule-based strategies (e.g., \texttt{[term]-} +
\texttt{[EndOfLine]} + \texttt{ness} $\rightarrow$ \texttt{[term]ness})
are applied. Tokens marked as autonomous hyphens are
further interpreted as sentence markers ((d) in Table~\ref{tab:tok_ex_hyphen}),
implicit logical items ((e) in Table~\ref{tab:tok_ex_hyphen}), or part of formulas
((f) in Table~\ref{tab:tok_ex_hyphen}).
%
\end{itemize}
\subsection{Slashes and Other Special Characters}
% introduction
In addition to apostrophes and hyphens, slashes, backslashes, and
other special characters are frequently used within texts. Like
before, these patterns may also be candidates for splitting -- with
the same drawbacks of misinterpretation. Some examples are depicted
in Table~\ref{tab:tok_ex_specialchars}.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of other special characters in tokens}
\label{tab:tok_ex_specialchars}
\begin{tabular}{lL{11cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Example \tabularnewline%
%category example
names & \texttt{OS/2, PS/3000, Micro\$oft, AT\&T, Horvarth\&Sons} \tabularnewline%
abbreviations & \texttt{w/o} \tabularnewline%
alternatives & \texttt{this element/container, the pre/postcondition}
\end{tabular}
\end{table}
% transition
The next section introduces Extended Tokenization, a rule-based
approach that tackles these natural language oddities.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Extended Tokenization}
\label{sec:extended_tokenization}
% introduction
Researchers from the text mining, information retrieval, and
computational linguistics domains agree that tokenization is the
first step of any kind of natural language text
processing~\cite{webster92tokenization,fox92lexical,grefenstette94tokenization,guo97critical,guo98one,barcala02tokenization}.
Good surveys about tokenization techniques are provided by Frakes,
and Baeza-Yates~\cite{frakes92ir}, Baeza-Yates and Ribeiro-Neto
~\cite[pp. 165--167]{baeza-yates99ir}, and Manning and
Sch{\"u}tze~\cite[pp. 124--136]{manning02nlp}. But only very few
reflect tokenization as a task of multi-language text processing
with far-reaching
impact~\cite{giguet96multilinguality,yamashita00morphological}. This
involves language-related knowledge about linguistically motivated
and domain specific patterns on many levels of linguistic analysis
(i.e., sentence border disambiguation, composite noun
identification, abbreviation
handling)~\cite{jackson02nlp,say97informationbased,say98punctuation}.
% goal
The major goal of this early (pre-linguistic) task is to convert a
stream of characters into a stream of processing units called
tokens. Beyond the computational linguistics community this job is
taken for granted. Commonly it is seen as an already solved problem
comprising the identification of word borders and punctuation marks
separated by spaces and line breaks. Many textbooks even dispatch
tokenization as relatively uninteresting, but in reality
\enquote{tokenization is a non-trivial
problem}~\cite{grefenstette94tokenization}. From the linguistic
point of view it should manage language related word dependencies,
incorporate domain specific knowledge, and handle
morphosyntactically relevant linguistic phenomena.
% difficulties
At one stage of linguistic analysis, elements of a text have to be
assigned to certain syntactic classes. In order to find those
elements (strings) the text (one long string) has to be divided into
logical units (tokens) that can be assigned to those classes. A key
issue of tokenization is recognizing sentence boundaries since most
grammars consider them as semantic and logic units of
treatment~\cite{grefenstette94tokenization}. Especially if using
online material such as newsgroups and web pages for data
\enquote{\ldots one finds all sorts of oddities like C|net
\ldots}~\cite[pp. 125]{manning02nlp}. Furthermore, characters like
numbers, hyphens, punctuation marks or upper- and lowercase letters
make up a considerable part of a text and have to be taken into
account. In~\cite[pp. 10]{jackson02nlp} the authors outline that a
simple approach is insufficient.
% prerequisites
The main assumption is that text -- the content of fragments --
which is processed by the system has already been pre-filtered,
e.g., whitespaces are collapsed and tags for structural or layout
relevant markup (e.g., \texttt{<link>}-tags, \texttt{<emph>}-tags)
are removed. Furthermore it is presupposed that contents in form of
headers, separators, tables, figures, typesetting codes, and other
isolated objects are not included in the analysis~\cite[pp.
123]{manning02nlp}.
% tokenization as the first step, definition
After this preprocessing of raw textual content, tokenization is the
first step dealing with pure natural language texts. It separates
tokens and sentences, identifies proper nouns and special formats,
handles abbreviations, and performs basic (non-linguistic) text
transformations (thesaurus like substitutions or format
normalizations).
%
% great impact
All subsequent steps during natural language analysis are based on
the result of the tokenizer. This means that early errors occurring
during tokenizing lead to very poor overall performance. It is
important to be conscious that the theory and methods applied here
have a great impact and have to be considered carefully~\cite[pp.
167]{baeza-yates99ir}.
% extended tokenization
Therefore, rule-based Extended
Tokenization~\cite{hassler06tokenization} is proposed that includes
all sorts of linguistic knowledge (e.g., language, grammar rules,
dictionaries), domain knowledge, gazetteer knowledge, and expert
knowledge. Extended Tokenization in this sense does not only
separate strings into basic processing units, but also interprets
and groups isolated tokens to create higher level tokens. This is
done by exploiting so-called token types assigned through an
elaborated machinery of heuristic and linguistically motivated
rules, and -- if available -- minimized dictionary knowledge.
% figure
Figure~\ref{fig:tokenization_task} shows where Extended Tokenization
is located within the text preparation and processing framework.
First, raw texts are preprocessed and segmented into textual units.
This step comprises cleansing and filtering (e.g., whitespace
collapsing, stripping extraneous control
characters)~\cite{palmer00tokenization} and removal of all kinds of
structural or layout relevant markup. Then, Extended Tokenization
segments the plain text into appropriate processing units.
Tokenization does not include any kind of linguistic analysis like
morphological word analysis or syntactic sentence analysis. These
tasks are left open to be accomplished on higher levels during
natural language analysis. Subsequent tasks like tagging are applied
on the tokenized output and thus should be supported as far as
possible (e.g., format normalization, consistent terminology).
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{05_text_analysis/figures/Tokenization_Task}\\
\caption{The task of text preparation and processing}
\label{fig:tokenization_task}
\end{figure}
% implementation
The implementation prototype, JavaTok (see
Section~\ref{sec:javatok}), provides a language independent core
tokenizer that is easily adaptable for language specific needs by
means of incremental rule set expansion. It operates on plain
natural language texts, relying on the language-independent
UTF-16~\cite{url_utf16} character set. JavaTok supports rule-based
typing of tokens using regular expression matching and methods for
context dependent constraint checking. The core features of the
implemented system are identification and disambiguation of all
kinds of linguistic markers, detection and expansion of
abbreviations, treatment of special writing formats, and typing of
tokens including single-tokens and multi-tokens.
% tokenization supports tagging
To improve the quality of textual representations,
linguistically-based tokenization is a necessary step that precedes
further text analysis. The complexity of subsequent processing tasks
(e.g., tagging, stemming, named entity recognition, chunking,
parsing, etc.) is reduced dramatically by decisions made early in
the above mentioned string-interpretation domains. By mapping token
types onto part-of-speech tags, the tagging process is speeded up.
There is no need to identify string patterns again, because the
tokenizer provides tagger relevant clues. Beyond that, the tagging
results are improved because the tokenizer generates better
interpretable input units (multi-tokens). This section focuses on
improving the quality of standard tagging through Extended
Tokenization. Thus, the overall quality of textual representation
throughout all levels of analysis is enhanced.
\hiddenabbrev{Message Understanding Conference}{MUC}
% MUC and TEI
The early identification of tokens partially covers the well known
`named entity recognition' task based on an empirically motivated
categorization of proper names as defined by MUC-6~\cite{url_muc6}
and MUC-7~\cite{url_muc7}. The \abbrev{Text Encoding
Initiative}{TEI} Guidelines~\cite{url_tei_guidelines,url_tei} for
Electronic Text Encoding and Interchange cover such identifiers
(plus abbreviations) and explain that they comprise \enquote{textual
features which it is often convenient to distinguish from their
surrounding text. Names, dates and numbers are likely to be of
particular importance to the scholar treating a text as source for a
database; distinguishing such items from the surrounding text is
however equally important to the scholar primarily interested in
lexis.}~\cite[Section 6.4]{url_tei_guidelines}
% outlook
The next section provides the definitions of two token concepts.
Section~\ref{sec:typing} introduces the notion of token types to
support token generalizations and high level concept formulation.
The procedure of assigning token types to tokens is described as a
rule-based approach. The architecture and functionality of the
implementation (JavaTok) is presented in Section~\ref{sec:javatok}.
Finally, some features of JavaTok with respect to the theoretical
considerations are outlined.
\subsection{Definitions of Token Concepts}
\label{sec:concepts}
% introduction
In this work two different kinds of tokens, namely single-tokens and
multi-tokens are distinguished.
\subsubsection{Single-Tokens}
% definition
The simplest form of a token is the \emph{single-token}. It is
defined as a character string not containing any non-printable or
delimiting characters (blank, tabulator, line feed, new line, etc.).
It corresponds to the traditional concept of a token. Examples of
single-tokens are words, numbers, internet addresses, and most
abbreviations. See Guo~\cite{guo97critical,guo98one} and
Mikheev~\cite{mikheev02periods} for more examples.
\subsubsection{Multi-Tokens}
% introduction
Written texts also contain more complex language constructs that do
not fit into the single-token concept. Such tokens may be specially
formatted using blanks (the standard delimiter for token boundaries)
that belong to semantically motivated groups of tokens. The blank is
an inherent part of the token chain fixed together through
interpretation (see Table~\ref{tab:tok_ex_mt}).
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of multi-tokens}
\label{tab:tok_ex_mt}
\begin{tabular}{lL{11cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Example \tabularnewline%
%category example
Composite nouns & \texttt{traffic jam, rush hour, Christmas tree} \tabularnewline%
Full names & \texttt{Albert Einstein, George William Bush} \tabularnewline%
Institutions & \texttt{University of Klagenfurt, Red Cross } \tabularnewline%
Locations & \texttt{Niagara Falls, New York, Suncoast Seabird Sanctuary} \tabularnewline%
Special formats & \texttt{+43 463 12345-67, ISBN 202-546-234-0}
\end{tabular}
\end{table}
% definition
Tokens that contain token delimiters for formatting (e.g., blanks,
tabs, new lines, line feeds) are defined as \emph{multi-tokens}.
Well known representatives are composite nouns (\texttt{summer
time}), special formats (\texttt{+43 463 2700-3531}), named entities
(names, locations, institutions), and idioms (formulas).
Traditionally they have been identified as a sequence of atomic
tokens glued together during a later processing phase - mainly using
dictionary lookup. In this approach these tokens are multi-tokens
through heuristic interpretation or, in other words, they are tokens
through rule-based typing.
%
The early treatment of multi-tokens as (semantic) concepts during
text processing improves the overall quality of information
retrieval and document mining tasks. The representation of a text
using multi-tokens leads to better intermediate results, hence
structurally and (semantically) grouped tokens are treated as atomic
units. If subsequent tasks do not support multi-tokens, a simple
reinterpretation into standard single-tokens is on hand.
%
Many of the cases mentioned in Table~\ref{tab:tok_ex_mt} can be
handled either by list lookup (compared to predefined strings) or by
matching against regular expressions (like phone numbers). Extremely
difficult to deal with are patterns including hyphenation like in
\texttt{New York-New Haven railroad} \cite[pp. 130]{manning02nlp}.
Here, in order to avoid a misinterpretation of \texttt{York-New} as
a separate hyphen-containing token, the recognition of named
entities has to precede rule-based resolving of hyphens.
\subsection{The Procedure of Token Typing}
\label{sec:typing}
% introduction
Rule-based typing of tokens has not been introduced in the
literature so far. In this work typing of tokens is defined as a
pre-linguistic classification process that assigns type identifiers
to both, single-tokens and multi-tokens. Hence this proposal is user
centered, the user himself is allowed to define his own token types
for appropriate needs, where token types are simply expressed
through strings. In a complex natural language processing framework
token typing supports central linguistic tasks like simple
part-of-speech tagging and/or basic semantic tagging. This approach
partially covers named entity recognition, the classification of
proper names into categories~\cite{mikheev02periods}. This
processing step is integrated in the Extended Tokenization task
because the identification of multi-tokens comprises more than just
named entities.
% domain dependency
The proposed framework for tokenization allows the definition of
multi-tokens by including delimiters (like blanks) within the token
recognition part. This is necessary for each variant of multi-level
tokenization. The definition of token types relies on the available
sources of knowledge and strongly depends on the motivation and
application of token interpretation. There are
%
\begin{compactitem}
\item domain knowledge: i.e., structure of an organization, knowledge about data warehouses;
\item gazetteer knowledge: i.e., country names, river names;
\item expert knowledge: i.e., medicine, astronomy;
\item pure linguistic knowledge: i.e., morphological and syntactical rules, subject of a sentence.
\end{compactitem}
% included features
The goal is to prohibit misleading separation during tokenization,
providing an optimized input for further linguistic processing. It
includes rule-based pattern recognition of single- and multi-tokens,
grouping of tokens, and token splitting. Subtasks can be formulated
as follows:
\begin{compactitem}
\item string replacements (substitutions of single- and multi-tokens)
\item abbreviation and acronym handling (recognition and substitution)
\item identification of proper nouns and names (part of named entity recognition)
\item application of rules on single- and multi-token level (for recognition, substitution, grouping, and splitting)
\end{compactitem}
% token types
A minimal set of internal token types is used to mark end of
sentences $T_{eos}$, punctuation marks $T_{pm}$, delimiters $T_d$,
and unknown tokens $T_u$ (darker boxes in
Figure~\ref{fig:javatok_basictokentypes}). Additional single-token
types occurring in lookup lists or rule bases are dynamically
incorporated and applied during the tokenization process (lighter
boxes in Figure~\ref{fig:javatok_basictokentypes}). In the figure,
the numbers in parentheses count the rules that are defined for this
token type (described in
Chapter~\ref{chapter:advanced_text_analysis}).
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{05_text_analysis/figures/JavaTok_BasicTokenTypes}\\
\caption{Basic token types}
\label{fig:javatok_basictokentypes}
\end{figure}
%
In addition, higher level types for multi-tokens can be defined,
referring to predefined sequences of tokens. Using the same
terminology through different natural language steps (e.g., equally
named token types and part-of-speech tags), tokenization directly
supports subsequent processing tasks. Otherwise, a reinterpretation
or mapping step between token types and tags is needed, allowing to
generalize classes of token types to a certain part-of-speech tag.
% typing process
The typing process (see Algorithm~\ref{alg:token_typing}) comprises
two phases: The \emph{first phase} (1-3) identifies token and
sentence borders. It results in a sequence of single-tokens
associated with basic token types.
%
In a \emph{second phase} (4-6) contextually motivated
reinterpretation (retyping) of tokens is carried out. Therefore,
rules for changing token types, rules for merging multiple
single-tokens into a single multi-token, and rules for splitting a
single multi-token into multiple single-tokens are applied
recursively. The aim of the second phase is to improve the accuracy
of the tokenization results achieved in the first phase.
%
\begin{algorithm}[ht]
\caption{Typing of tokens}
\label{alg:token_typing}
\begin{algorithmic} [1]
\State identify single-tokens
\State type single-tokens according to the basic token types
\State split sentence end markers
\smallskip
\smallskip
\smallskip
\State reinterpret single-token types
\State merge and split tokens recursively
\State reinterpret all token types
\end{algorithmic}
\end{algorithm}
% step 1 and 2
The tokenization algorithm (Algorithm~\ref{alg:token_typing}) starts
with basic text segmentation, separating strings into single-tokens
(step 1) using standard delimiters (blanks, tabs, new lines, line
feeds). Initially, each token is assigned the unknown token type
$T_u$.
%
In step 2, rules assign basic token types (see
Figure~\ref{fig:javatok_basictokentypes}) to single-tokens,
utilizing a classification of tokens in distinct categories. This
step includes assignments of the \textbf{Internal} token types.
%
%
% basic token types
Some examples of other basic types and subtypes are
\begin{compactitem}
%
\item \textbf{Alpha $T_a$:} no letters capitalized $T_{a1}$, first letter capitalized
$T_{a2}$, all letters capitalized $T_{a3}$, mixed cases $T_{a4}$, etc.
%
\item \textbf{Numeric $T_n$:} plain numbers ($T_{n1}$), numbers containing periods or colons
$T_{n2}$, etc.
%
\item \textbf{Entity $T_e$:} internet addresses $T_{e1}$, credit card numbers $T_{e2}$, etc.
%
\end{compactitem}
%
% step 3 and example
The third step identifies and separates punctuation marks.
Therefore, remaining tokens of unknown type $T_u$ ending with a
punctuation mark are investigated as possible sentence ends. If the
complete token string does not match an entry in one of the
repositories (e.g., lists of abbreviations, acronyms, regular
expressions rules for single-token or multi-token typing), the last
character is split and a new token is created together with its
corresponding token type (see 1 in Figure~\ref{fig:tok_ex1}). To
assure the correctness of this splitting operation, a set of
context-sensitive rules is applied. A token ending with a period and
followed by a lower case token is not split, because the period does
not mark the end of a sentence (see 2 in Figure~\ref{fig:tok_ex1}).
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.9\textwidth]{05_text_analysis/figures/JavaTok_Example1}\\
\caption{Example rule for punctuation mark splitting}
\label{fig:tok_ex1}
\end{figure}
% introduction to user defined rules
A set of advanced user-defined token typing rules is then used to
reinterpret, group, and split singe-tokens (steps 4--6 in
Algorithm~\ref{alg:token_typing}). The user is enabled to define
custom rules to support domain-specific needs. These rules may also
include references to the previously assigned single-token types.
% examples of user defined rules
Examples of advanced user-defined types are stopwords $U_1$,
abbreviations $U_2$, acronyms $U_3$, dates and times $U_4$, phone
numbers $U_5$, email addresses $U_6$, and a sequence of capitalized
single-tokens $U_7$ (in many cases extended keywords). These token
types are assigned by applying two strategies: First, tokens are
compared to a repository of reliable entries \texttt{(string, token
type)} created either by a human expert or a (semi-)automatic
machinery. If no match occurs, an ordered list of rules is applied
to process the tokens and token sequences. The rules include regular
expression matching of token strings (see 3 in
Figure~\ref{fig:tok_ex2}), matching of token types (see 4 in
Figure~\ref{fig:tok_ex2}), and combinations of both (see 5 in
Figure~\ref{fig:tok_ex2}).
\begin{figure}[ht]
\centering
\includegraphics[width=0.9\textwidth]{05_text_analysis/figures/JavaTok_Example2}\\
\caption{Tokenization rules}
\label{fig:tok_ex2}
\end{figure}
% syntax of rules
The examples in Figure~\ref{fig:tok_ex1} and
Figure~\ref{fig:tok_ex2} outline the syntax of the tokenization
rules. Each rule consists of a condition part $IN$ (input sequence
of typed tokens) and a consequence part $OUT$ (output sequence of
typed tokens). The numbered indices of tokens indicate relative
token positions. The rule-based approach is based on simple and pure
linguistic functional interpretation of basic token types and token
strings in a given context. Example types of rules may cover
morphological, syntactical, and general patterns like
%
\begin{compactitem}
\item sentence border disambiguation
%
\item multi-token identification
%
\item special character treatment (e.g., apostrophes, slashes, ampersand etc.)
%
\item suffix identification of well-known endings (e.g., \texttt{-ly}, \texttt{-ness}).
%
\item identification and reconcatenation of hyphenated words at line breaks
%
\item abbreviation treatment
\end{compactitem}
\subsection{JavaTok}
\label{sec:javatok}
% introduction
This section describes the architecture of JavaTok, a
freely configurable tokenizer developed in JAVA. To cope with language
dependent occurrence of special characters (country specific
characters like Slavic diacritics, French accents, umlauts and sharp
s in German, etc.), JavaTok enables a Unicode-conform
UTF-16~\cite{url_unicode,url_utf16} initialization and input/output
processing. For the purpose of convenient higher-level tokenization
the following features are necessary:
%
\begin{compactitem}
\item free configuration and adaptation (character definitions, tokenization strategies)
\item string replacements (abbreviation resolution, zero elimination, string and thesaurus-like substitution of multiple length)
\item user-defined token type definition
\item rule-based token typing (credit card numbers, phone numbers, dates, internet addresses, special IDs, \ldots)
\item pre-tagging functionality (based on token types)
\item compound noun and proper name identification
\item multi language support
\item process statistics and runtime performance measurements
\item multiple output formats
\end{compactitem}
% no uncertain transformations
As presented, Extended Tokenization is defined as a rule-based
process that includes basic linguistic knowledge. In case of
ambiguity JavaTok does not make assumptions or guesses. Uncertain
borders of tokens or sentences are not further interpreted or
touched. This is because misinterpretation of tokens at an early
stage leads to poor overall performance during language analysis.
JavaTok also does not carry out any kind of character conversion
automatically, hence other tools operating on the tokenizers' output
may loose important information (such as the case of letters during
tagging).
\subsubsection{Architecture}
Major aims were to support web-based processing, easy integration in
existing software systems, and good overall performance. Therefore,
the functionality of JavaTok can be accessed in three different
ways:
%
\begin{itemize}
\item JavaTok operates as \emph{stand-alone software} called from the
command line. All initializations are specified via XML configuration
files. JavaTok supports single file and batch
processing, providing TXT, XML, and HTML output formats.
\item The functionality of JavaTok can be included into other
JAVA-based projects by simply importing JavaTok as a single \emph{JAVA class}.
Initialization and computation is done via public methods. An OutputFormatter
is included supporting TXT, XML, and HTML.
\item It is possible to access the tokenizer via internet as a
\emph{servlet} running on a Tomcat webserver.
The online version is available at the Klagenfurt CLR
web portal~\cite{url_clr_portal}.
\end{itemize}
% workflow
Figure~\ref{fig:javatok_architecture} depicts the workflow of
JavaTok. A single configuration file contains the input file, output
file, parameter settings, and references to dictionaries and
knowledge bases containing typing rules for single-tokens and
multi-tokens. String replacements are carried out at the very
beginning, eliminating undesired markup or known noise in the input
texts. In a next step tokens are split according to the delimiter
definitions. Subsequent single-token typing is carried out,
assigning each token its basic token type (including $T_u$ for
unknown tokens). Punctuation mark splitting is only applied to
tokens of unknown type. First, punctuation marks at the beginning of
a token are separated and typed as $T_{pm}$. Then, if possible,
single sentence end marks are split from the end of the tokens and
typed as $T_{eos}$. Finally, remaining punctuation marks are
isolated from the end of the token and typed as $T_{pm}$. After the
splitting of punctuation marks multi-token types are identified.
Finally, optional abbreviation substitution in both directions
(expansion and abbreviation) is accomplished.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.95\textwidth]{05_text_analysis/figures/JavaTok_Architecture}\\
\caption{The architecture of JavaTok}
\label{fig:javatok_architecture}
\end{figure}
% priority to lookup, ordered rules
During all typing tasks priority is given to repository lookup. The
rules for typing address token strings, token types, or both using
regular expressions. The order of the tokenization rules is of great
importance. The rules are applied sequentially, thus rules firing
earlier prevent firing of later rules.
\subsubsection{Sample Tokenization Outputs}
% introduction
In the example given in Figure~\ref{fig:tokenizer_output} basic
token types $T_{a1}$, $T_{a2}$, $T_{a3}$, and $T_{pm}$ (see
Section~\ref{sec:typing}) are concatenated to tokens using a
separating slash. Sequence of tokens creating multi-tokens are
parenthesized.
% example preliminaries
User-defined token types are $ABBR$ (abbreviation) and $INST$
(institution). The mode describes whether single-token typing is
enabled ($S$), whether multi-token typing is enabled ($M$), and
whether known abbreviations are replaced ($R$). A known abbreviation
in the example is \texttt{aka.}, standing for \texttt{also known
as}. \texttt{Red Cross} is a known institution. \texttt{RK} is an
unknown abbreviation standing for `Red Cross'.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{05_text_analysis/figures/JavaTok_Outputs}\\
\caption{Sample JavaTok outputs}
\label{fig:tokenizer_output}
\end{figure}
\subsubsection{Tagging Optimization by JavaTok}
% introduction
Figure~\ref{fig:tagger_tokenizer_output} shows the effect of
Extended Tokenization on state-of-the-art tagging outputs. For
evaluation purposes three freely available taggers are used:
\emph{QTag}~\cite{url_tagger_qtag} developed at the University of
Birmingham, \emph{POStaggerME}~\cite{url_tagger_openNLP} available
as part of the OpenNLP package provided by SourceForge, and
\emph{Stanford POS Tagger}~\cite{url_tagger_stanford} implemented by
the Stanford Natural Language Processing Group.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.9\textwidth]{05_text_analysis/figures/JavaTok_Tagger_Outputs}\\
\caption{Tagging improvements through Extended Tokenization}
\label{fig:tagger_tokenizer_output}
\end{figure}
% description of the examples
The initial input sentence at the top of the figure comprises some
of the discussed delimitation problems that motivate token typing
and multi-token concepts. Cell (1) contains QTag outputs, cell (2)
POStaggerME outputs, and cell (3) Stanford POS Tagger outputs. Lines
1a), 2a) and 3a) refer to standard tagging outputs after rudimentary
tokenization, whereas lines 1b), 2b) and 3b) show the improved
tagger outputs including preceded Extended Tokenization done by
JavaTok. In the figure, underlines mark the tokens identified which
served as input for the tagger.
%
Tags assigned are located below the corresponding tokens, including
$NN$ (noun), $VB$ (base verb), $VBG$ (infinitive verb), $VBZ$
(present tense verb, $3^{rd}$ person singlular), $DOZ$ (does),
$XNOT$ (negative marker), $JJ$ (adjective), $IN$ (preposition), $RB$
(adverb), $CD$ (number), $SYM$ (symbol), $LS$ (single letter), $FW$
(foreign word), $"$ (quotation mark), $:$ (punctuation marks), and
$.$ (period).
%
Tags written in bold are changed through JavaTok-specific groupings
and/or splittings, thus resulting in optimized tagging inputs.
Tagging improvement is achieved in two respects:
%
\begin{compactitem}
\item Direct changes of tags through empirically motivated and more adequate input units
\item Indirect changes of tags through changes of linguistic contexts
\end{compactitem}
%
%
% sequel
Operating on the output of tokenization, tagging, the subsequent
step of natural language processing, is described in the next
section.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Tagging}
% introduction
Tagging is defined as the process of assigning syntactic categories
in the form of \abbrev{Part-Of-Speech}{POS} tags to words~\cite[pp.
341]{manning02nlp}. Compared to full syntactic parsing, the
advantages of tagging are its stability and good computational
performance. This is the reason for applying tagging before
higher-level processing tasks (e.g., named entity recognition, chunk
parsing)~\cite{brill00postag}.
In most cases multiple syntactic categories can be assigned to a
single word. Hence, the goal of tagging can be reformulated as:
determine the correct (i.e., the most likely) category of a word in
a given context. In other words, tagging is the disambiguation of
part-of-speech~\cite{abney96partspeech}. In this process the set of
all tags -- the tagset -- has a major impact on the performance of a
tagger. A balance has to be found between the granularity of word
categories and the accuracy of the tagger. In languages with rich
morphology (e.g., German) this is extremely difficult.
%
Two well known tagsets are those used in the Penn Treebank
project\footnote{http://www.cis.upenn.edu/$\sim$treebank
(19.03.2008)} (for English), and the \abbrev{Stuttgart-T{\"u}bingen
TagSet}{STTS}~\cite{schiller95stts} (for German).
% 2 types of taggers
Two groups of taggers~\cite[pp. 224]{carstensen04cl} can be
distinguished:
%
\begin{description}
%
\item[Stochastic taggers]
identify word categories using the probability of word category in a
given context of other words and categories. Most of these taggers
rely on hidden markov models (using the Viterbi algorithm~\cite[pp.
202--204]{allen95understanding}), decision trees, maximum entropy
approaches, or support vector machines. Examples are the
TreeTagger~\cite{schmid94probabilistic,schmid95improvements},
QTag~\cite{tufis98tagging}, and TnT Tagger~\cite{brants00tnt}.
%
\item[Rule-based taggers] apply rules learned from a corpus to
assign initial categories to the words. In a second step
transformations are iteratively applied to change those initial tags
to achieve better accuracy. In contrast to stochastic taggers the
knowledge is encoded in a human-readable form. Thus, their
performance can be increased by adding or adapting rules in the rule
base. A well known example is the Brill
tagger~\cite{brill92simple,brill94some}.
%
\end{description}
%
Good overviews about the approaches are given by
Abney~\cite{abney96partspeech}, and Glass and
Bangay~\cite{glass05evaluating}. Comparisons of tagging results are
provided
in~\cite{glass05evaluating,ratnaparkhi96maximum,halteren98improving}.
% ways to do an assignment
In order to achieve proper tag assignments three different
information sources are suggested~\cite{brill94some}:
%
\begin{enumerate}
%
\item \emph{Lexical lookup} is used to assign each term its possible
tags according to the entries in a dictionary (e.g., constructed
from a pre-tagged corpus). Thus, it is possible that a single word
is assigned more than one category per se (e.g., \texttt{run} can be
a verb or noun). Surprisingly, the accuracy of simply assigning each
word its most likely tag (due to statistically ranked word tags in
the dictionary) achieves up to 90\%~\cite[pp. 344]{manning02nlp}.
%
\item \emph{Morphologic analysis} is capable of deciding a
words' category. For instance, in English texts words ended by
\texttt{-ion} (nouns), \texttt{-able} (adjectives), or \texttt{-ly}
(adverbs) are easy to identify. However, there exist several
exceptions to these rules (just think about \texttt{July} or
\texttt{cable}). Weischedel et al.~\cite{weischedel93ambiguity} and
Samuelsson~\cite{samuelsson93morphological} summarize interesting
experiments dedicated to the prediction of unknown word categories.
%
\item \emph{Context rules} are applied to identify tags depending on their
contexts. In most cases only preceding tokens are taken into
account. The complexity of generating such rules increases
exponentially with their window size (number of tokens they
include). This is the main reason why most taggers consider
$n$-grams: unigrams (one word context), bigrams (two words context),
and trigrams (three words context). An example rule looks like
$DET~+~?~+~N \rightarrow ADJ$, which alters an unknown tag $?$
embraced by a determiner $DET$ and a noun $N$ to an adjective $ADJ$.
%
\end{enumerate}
%
%
% example
Table~\ref{tab:tagging_process_example} depicts the general
processing steps of assigning tags to words. The first step assigns
each word its most likely tag according to a dictionary. Thus, the
words \texttt{the} and \texttt{an} are determiners \texttt{DET},
\texttt{lady} is a noun \texttt{NN}, \texttt{is} is an auxiliary
verb \texttt{AUX}, and the period defines the end of a sentence
\texttt{\$}. In step 2 morphological rules are applied on
unidentified words not included in the dictionary. The suffixes
\texttt{-ing} for verbs \texttt{VB} and \texttt{-ive} for adjectives
\texttt{ADJ} are taken for granted. Finally, step 3 applies context
dependent transformations on tags. Therefore, two rules
$DET~+~?~+~NN \rightarrow ADJ$ and $DET~+~ADJ~+~? \rightarrow NN$
are applied.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{The process of assigning part-of-speech tags}
\label{tab:tagging_process_example}
\begin{tabular}{Bccccccccc}
% mode part.groups pap. collection tracks
Original: & The & nice & lady & is & buying & an & expensive & car & . \tabularnewline%
Step 1: & DET & ? & NN & AUX & ? & DET & ? & ? & \$ \tabularnewline%
Step 2: & DET & ? & NN & AUX & VB & DET & ADJ & ? & \$ \tabularnewline%
Step 3: & DET & ADJ & NN & AUX & VB & DET & ADJ & NN & \$ \tabularnewline%
\end{tabular}
\end{table}
% complexity
However, the assignment of tags to words in real-world texts is not
as simple as the example in Table~\ref{tab:tagging_process_example}
suggests. There are only very few situations where assignments are
either unambiguous or can be restricted to a single tag.
Furthermore, different languages require different efforts based on
their morphological richness and syntactical freedom in positioning
words and phrases. For instance, morphology and grammar in languages
such as German and French are more complex than they are in English.
% this work
In this work tagging is exploited for the purpose of term selection.
Therefore, only terms assigned to a certain subset of part-of-speech
tags (i.e., nouns and verbs) are selected to create the index. All
other terms (i.e., determiners, interjections, adjectives, adverbs)
are simply neglected. This selection process reduces the amount of
index terms and speeds up retrieval. However, inaccurate tag
assignments may still lead to unwanted terms in the index. This
unwanted material is further filtered using stopword lists described
in the next section.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Stopword Filtering}
\label{sec:stopwords}
% introduction to stopword filtering
Nearly all researchers working on natural language texts rely on a
properly selected set of words and other items (e.g., alphanumeric
combinations, numbers) to filter `meaningless' and unwanted
material. Such low level filtering lists are known as stopword
lists, stoplists, or negative dictionaries. But although most
retrieval systems use stoplists, there is astonishingly few
literature about how to systematically generate stopword lists
systematically~\cite{fox90stoplist,fox92lexical}.
% related work
Most authors think of stopwords as words of little intrinsic meaning
that occur too frequently to be useful when searching texts.
Typically, these words are short (only a few characters), implement
only a grammatical function, and don't add to the meaning of the
sentence~\cite[pp. 533--534]{manning02nlp}. Mainly these terms are
articles, conjunctions, auxiliary verbs, or prepositions~\cite[pp.
482,491]{carstensen04cl}. Some common examples of English stopwords
are listed in Table~\ref{tab:stopwords_manning}.
%
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption[Example for an English stopword list]{Example for an English stopword list (57 entries)~\cite[pp. 533]{manning02nlp}}
\label{tab:stopwords_manning}
\begin{tabular}{lllllllll}
\rowcolor{tableheadcolor} \mmcc{9}{Special token formats} \tabularnewline%
%category example
a & also & an & and & as & at & be & but & by \tabularnewline%
can & could & do & for & from & go & have & he & her \tabularnewline%
here & his & how & i & if & in & into & it & its \tabularnewline%
my & of & on & or & our & say & she & that & the \tabularnewline%
their & there & therefore & they & this & these & those & through & to \tabularnewline%
until & we & what & when & where & which & while & who & with \tabularnewline%
would & you & your & & & & & &
\end{tabular}
\end{table}
% reason for filtering: dimensionality reduction
Filtering stopwords from the set of index terms is done mainly for
the reason of dimensionality reduction. Terms regarded as useless
are filtered very early and do not need to be processed (e.g.,
stemmed) any further. According to Zipf's law~\cite[pp.
533--534]{manning02nlp}, this can reduce the indexing vocabulary
drastically. Thus, index construction, time, and storage costs are
minimized~\cite[pp. 141]{grossman04ir}. Baeza-Yates~\cite[pp.
167--168]{baeza-yates99ir} and Grossman~\cite[pp. 141]{grossman04ir}
assume that up to 40\% of all document terms can be regarded as
stopwords.
% also nouns and verbs
Chu~\cite[pp. 8--9]{chu05irr} extended this definition of stopwords,
including other syntactic categories such as nouns and verbs. Thus,
he added even infrequent, noninformative (e.g., \texttt{report},
\texttt{abstract}, \texttt{summary}), and domain and corpus specific
terms (e.g., \texttt{computer}, \texttt{information retrieval}),
where the latter two are also called trade words~\cite[pp.
47]{chu05irr}.
% no maintenance, poor quality
Although nearly all existing IR systems apply stopword filtering,
there is still no standardized list agreed upon. As a consequence,
these systems rely either on highly specialized hand-crafted lists,
or on unreviewed lists downloaded from the internet. The drawback in
both cases is that most of these lists are applied as given and
remain unchanged.
% many available lists
Many stopword lists can be found in the web for free. Unfortunately
very few attention is put on their linguistic quality, especially if
used in restricted domains and application areas. Hence stoplists
are used in nearly all information retrieval applications for early
feature reduction, the retrieval quality highly depends on the
content (and structure) of the lists. Surprisingly, the impact of
the linguistic quality of those lists on the retrieval performance
is often underestimated.
%
%
% problems if stopwords are enabled
Anyway, ignoring the effects of stopword filtering during the
indexing and searching process can lead to the following problems:
%
\begin{itemize}
\item Without stopword filtering and a query that requires all terms to match,
documents may be
excluded because one of the stopwords did not appear in the document. Even though the
stopword itself is probably irrelevant to what the user is searching for,
the document may have actually been a very good result. For
instance, a query \texttt{cat on a tree} does not match a document stating
\texttt{cat on the tree}.
\item If a search does not require all terms to match and stopwords are not filtered,
any document
containing at least one of the stopwords would be included in the relevant results.
Since stopwords are very common, nearly all documents in
the engine's index might be retrieved. As an example, the query
\texttt{cat on a tree} matches all documents containing the words
\texttt{a} or \texttt{the} (which can be found in nearly all documents).
\item On the other hand rudimentary filtering of stopwords prohibits precise
searches for query terms like \texttt{vitamin c} or \texttt{The Round Table}.
As for the first example, all texts stating the term \texttt{vitamin}
are returned, leading to a high number of irrelevant results.
Generally, words that have different meanings in different contexts,
one in which it is a stopword and an other one in which it is not, pose another problem:
Confusing the noun \texttt{can} with the auxiliary verb \texttt{can} may result in serious troubles.
Even more obvious, phrases consisting entirely of
stopwords are hard to handle, for an example consider \texttt{to be or not
to be}.
\end{itemize}
%
For those reasons some web engines abstain completely from stopword
filtering in order to maintain high recall and to avoid results that
users are unable to interpret~\cite[pp. 167--168]{baeza-yates99ir}.
% transition
%This work proposes a more elaborated stopword model that better
%supports information retrieval and document mining tasks. Therefore,
%the multi-layered stopword model is introduced in the next section.
\subsection{The Multi-Layered Stopword Model}
% personal statement, contribution
This work proposes that the stopword class is much more heterogenous
than suggested in the literature. It is composed of three different
layers, as depicted in Figure~\ref{fig:stoplist_levels}. Each layer
is composed of certain groups containing terms sharing syntactic
(expressed through word categories in the figure), semantic (related
to common concepts), or domain-specific (related to concepts in a
restricted domain) attributes.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{05_text_analysis/figures/stopword_layers}\\
\caption{Multi-layered stopword model I}
\label{fig:stoplist_levels}
\end{figure}
Linguistically argued, stopwords can be distinguished into three
layered subsets (see Figures~\ref{fig:stoplist_levels} and
\ref{fig:stoplist_levels_2}):
%
\begin{description}
%
\item[Functional stopwords] define terms which are regarded
as traditional stopwords. They belong to the class of function
words (or grammatical words) which have only little
lexical meaning. Normally these terms decode syntactic and morphosyntactic
features (e.g., values of parameters like case, numerus, genus, tempus, modus).
Generally, functional stopwords do not transport any semantics and are
not bound to specific domains. Typical representatives are determiners,
auxiliaries, prepositions, pronouns,
particles, conjunctions, logical operators, quantors, etc.
%
\item[Content-related stopwords] are mainly adjectives and
adverbs which decode very general semantic concepts.
These stopwords do not contribute to the semantics of a text and
are domain-independent.
For the same reason a set of verbs (e.g., \texttt{able}, \texttt{based}, \texttt{consider})
and nouns (e.g., \texttt{abstract}, \texttt{degree}, \texttt{goal}) also belong to this category.
%
\item[Domain-specific stopwords] are words which refer to domain-specific
entities and concepts. Normally these terms are part of the presupposed
knowledge of a domain insider. Preferred candidates for this group
are nouns and named entities (e.g., acronyms), verbs, adjectives, and adverbs.
For domain experts, these words are taken for granted and do not contain
semantically important information. For instance, terms like \texttt{computer} or
\texttt{data} might not be good candidates for a common stoplist. However, both terms
may be perfect stopwords when searching a huge archive about computer related topics.
%
\end{description}
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{05_text_analysis/figures/stopword_layers_2}\\
\caption{Multi-layered stopword model II}
\label{fig:stoplist_levels_2}
\end{figure}
% three levels, process next
Conducting a three level stopword filtering would allow to focus on
certain information retrieval and document mining tasks while
improving the system's performance regarding runtime and index
space. In the context of this work stopword filtering is applied in
two ways:
%
\begin{enumerate}
%
\item Stopwords are filtered from the set of index terms, utilizing
stopwords in a traditional manner. This filtering step includes
stopwords of all three levels.
%
\item In document mining tasks, valuable resources aiding information
retrieval are extracted from a corpus automatically.
However, the amount of raw data extracted is inapplicable to be
applied directly and contains loads of irrelevant information.
%
Therefore, stopword filtering is applied to exclude unwanted patterns
(e.g., phrases started or ended by stopwords) from further analysis.
%
Chapter~\ref{chapter:advanced_text_analysis} provides
methods that generate appropriate dictionaries by applying stopword
filtering. These dictionaries include composite nouns, named entities,
formulaic speech, and full forms of acronyms.
%
\end{enumerate}
% transition
From this point of view stopwords have to fulfill both statistic and
linguistic criteria. In the sequel the process of creating and
incrementally extending the list of stopwords based on statistic and
heuristic methods is described.
\subsection{The Stopword Extraction Process}
\label{sec:stoplist_extraction}
% introduction
In order to generate a stoplist, this work relies on the corpus
described in Section~\ref{sdr:sec:corpus}. The result of this
process is a ranked list of terms that are promising candidates. It
serves as input for a linguistically-based categorization and for
heuristically motivated layering of useful stopwords. The process
relies on the tokenization output only, thus no part-of-speech
tagging or any other linguistic processing is required.
% process description
By going through the corpus in a single pass, the output of the
Extended Tokenizer is used to gather corpus based term statistics.
All single-tokens that occur in the texts as `proper' words --
defined as a sequence of small letters, optionally started by a
single capital letter -- are regarded as potential candidates. Hence
filter terms are considered as context free, no left or right
neighborhood is taken into account. The token selection process
itself is solely based on the token-types assigned by the Extended
Tokenizer. Other token-types like numbers, acronyms, or special
formats are ignored.
% remark on the granularity
For each term $i$ the term frequency $n_i$ (the number of
occurrences of term $i$ in the collection) and the document
frequency $m_i$ (the number of documents stating term $i$) are
computed. As this work utilizes structured documents, it is
important to note that both frequencies are computed in a certain
context $c$, resulting in $n_{i,c}$ and $m_{i,c}$. Thus, the
approach can be applied to generate term statistics based on
structural conditions of chapters (\texttt{/DOC/SEC}), on any kind
of sections (\texttt{//SEC}), or on paragraphs (\texttt{//FRA} of
type \texttt{text}). The experiments described in this chapter are
conducted using the \texttt{/DOC} context.
% tf and idf
Based on the $n_{i,c}$ and $m_{i,c}$ values, the inverse term
frequency $itf_{i,c}$ and the inverse document frequency $idf_{i,c}$
are computed for each term, applying the formulae
%
\begin{eqnarray}
itf_{i,c} & = & \log\frac{N_c}{n_{i,c}} \\
idf_{i,c} & = & \log\frac{M_c}{m_{i,c}}
\end{eqnarray}
%
where $N$ (resp. $M$) is the total number of terms (resp. elements,
i.e., documents, chapters, sections, paragraphs) in the whole
collection.
%
% explanation
The inverse term frequency $itf_{i,c}$ is used to express the
distribution of term $i$ throughout all terms. The higher the value,
the more discriminant is the term $i$ itself.
%
Accordingly, the inverse element frequency $idf_{i,c}$ is computed,
expressing the distribution of term $i$ through the collection. The
higher the value, the rarer the term occurs in the collection.
% parameter settings, experiment
In order to extract filter terms two parameters have to be
specified:
%
\begin{description}
%
\item[Ranking criteria] First, a ranking strategy according to
the `quality' of stopwords has to be specified. Statistically argued
stopwords can be defined as extraordinary frequent
words~\cite{fox90stoplist}, thus terms having a high $n_{i,c}$
(resp. low $itf_{i,c}$) value. Alternatively, stopwords can be
described as equally distributed throughout all texts, thus having a
high $m_{i,c}$ (resp. low $idf_{i,c}$) value. In order to express
the adequacy of a term $i$ to become a stopword, both values
$itf_{i,c}$ and $idf_{i,c}$ are combined in two different ranking
formulae $q_1$ and $q_2$:
%
\begin{eqnarray}
q_1(t_i) & = & itf_{i,c} \cdot idf_{i,c} \\
q_2(t_i) & = & \frac{itf_{i,c}} {idf_{i,c}}
\end{eqnarray}
%
%
\item[Cutoff point] Second, a certain cutoff point for term selection must
be chosen. From the literature~\cite[pp. 167]{baeza-yates99ir}, the
size of a general stopword list can be estimated to be up to 500
words. Using the Brown corpus ($\sim 1$ million words),
Fox~\cite{fox90stoplist} chose the cutoff at the minimum term
frequency of 300, which was based on an empirical decision. Used for
the experiments in the current work, the top 500 ranked terms are
classified as stopwords.
%
\end{description}
% result
The extracted stopwords are sorted according to the quality measures
$q_1$ and $q_2$ in decreasing order. Good stopwords are reflected by
high $q_1(t_i)$ and $q_2(t_i)$ values. This procedure results in the
list given in Figure~\ref{fig:stoplist_ranking_itf_div_idf} (first
31 words ranked by $q_2$ in descending order). Note that $itf$ and
$idf$ use different scalings: With respect to the INEX corpus, $itf$
lies between 6,5 and 14,0 because even the most frequent term
\texttt{the} does not occur in half of the cases (where $ift_{i,c}$
would be $1,0$). In contrast, $idf$ ranges from 0,01 to 27,4 because
many terms occur in more than half of the documents (where
$idf_{i,c} < 1,0$). This fact explains the high impact of $idf$ in
both, $q_1$ and $q_2$. For the data shown in
Figure~\ref{fig:stoplist_ranking_itf_div_idf}, $N_c = 178.677.541$
denotes the total number of terms, and $M_c = 16.819$ is the total
number of documents in the collection.
%
%\begin{figure}
% \centering
% \includegraphics[width=0.8\textwidth]{05_text_analysis/figures/stopword_ranking_tf}\\
% \caption{Top ranked INEX terms using $tf$}
% \label{fig:stoplist_ranking_tf}
%\end{figure}
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{05_text_analysis/figures/stopword_ranking_itf_div_idf}\\
\caption{Top ranked INEX stopwords using $q_2$}
\label{fig:stoplist_ranking_itf_div_idf}
\end{figure}
% further interpretation
As the list in Figure~\ref{fig:stoplist_ranking_itf_div_idf} is case
sensitive, the 500 selected entries are reduced to 445 unique terms
(e.g., see \texttt{the} and \texttt{The} at ranks 6 and 8). These
445 words cover 22,55\% of all words in the INEX document
collection, where the minimal term frequency $n_{i,c}$ is $8271$ and
the minimal document frequency $M_{i,c}$ is $5342$.
%\FloatBarrier
% ranking comparisons, reference list
Figure~\ref{fig:stoplist_ranking_criteria} overviews the top 500
results applying different ranking strategies.
%
% combine the two approaches
Ranking strategies combined term frequency $tf$ and document
frequency $df$ (resp. $itf$ and $idf$), and their performance is
compared to each other.
%
The first row headed by $\#~SW$ counts the number of relevant
stopwords (out of the 500 extracted) compared to the final reference
list. Green colored cells explicitly mark the words contained in the
final list.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.8\textwidth]{05_text_analysis/figures/stopword_ranking_criteria}
\caption{Comparison of different ranking criteria}
\label{fig:stoplist_ranking_criteria}
\end{figure}
%
% comparisons, only tf and df
In their work Grossman and Frieder stated that \enquote{using term
frequency is a means of implementing a dynamic stop word
list}~\cite[pp. 197]{grossman04ir}. Indeed, at the first glimpse
term frequency $tf$ based ranking looks as a good indicator.
% tf ranking
But the results of the experiments show that this method generates
many unwanted stopwords already among the top 100 ranks. Already at
ranks 29, 31, and 34 frequent nouns such as \texttt{data},
\texttt{system}, and \texttt{time} occurred (see
Figure~\ref{fig:stoplist_ranking_criteria}). All in all, 193 of the
final stopwords are identified.
%
% df ranking
The (inverse) document frequency $df,~idf$ based ranking performed
clearly better, including unwanted results at lower ranks 49
(\texttt{time}), 64 (\texttt{example}) and 65 (\texttt{work}). The
500 terms extracted include 251 final stopwords.
%
% tf/df, df/tf
The worst performance with only 67 accepted stopwords is achieved
with fractions of raw term and document $\frac{tf}{df}$ and
$\frac{df}{tf}$.
%
% tf * df
Ranking based on $tf \cdot df$ leads to better results containing
219 final stopwords.
%
% idf and itf
Best results are achieved using a combination of $itf$ and $idf$.
The explanation for this behavior is that the high term frequency
$tf$ influences the ranking much more than the lower element
frequency $df$, which is smoothed by the logarithm.
%
Surprisingly, $q_1 = itf \cdot idf$ extracting 245 stopwords
performs slightly worse than $q_2 = \frac{itf}{idf}$ extracting 262
stopwords. However, there is no big difference in the results
compared to simple element frequency $df$ (resp. $idf$) ranking,
which achieved 251 of the final stopwords.
% summary
To summarize the results of the experiments, rankings including the
(inverse) element frequency obviously outperform the other
strategies, where the best results are tight-fitting, relying either
on the raw element frequency $df$, $q_1(t_i)$, or $q_2(t_i)$.
% comparison of the top 2 lists
The top two lists acquired by using the plain element frequency $df$
ranking and $q_2$ ranking only differ with respect to 20 terms. The
$df$ list exclusively contains the terms \texttt{state},
\texttt{test}, \texttt{code}, \texttt{communication},
\texttt{points}, \texttt{respectively}, \texttt{product},
\texttt{operation}, \texttt{parallel}, \texttt{task}, \texttt{line},
\texttt{memory}, \texttt{quality}, \texttt{you},
\texttt{parameters}, \texttt{source}, \texttt{power},
\texttt{Table}, \texttt{behavior}, and \texttt{elements}.
%
The $q_2$-ranked list exclusively contains \texttt{especially},
\texttt{advantage}, \texttt{providing}, \texttt{once},
\texttt{works}, \texttt{yet}, \texttt{details}, \texttt{little},
\texttt{involved}, \texttt{developing}, \texttt{likely},
\texttt{smaller}, \texttt{back}, \texttt{increasing},
\texttt{across}, \texttt{recent}, \texttt{depends}, and
\texttt{ability}.
%
% reason why q2 is better
From a linguistic point of view, the $q_2$-ranked list seems
slightly better because it includes more terms fitting in the
functional and content-related stopword layers. In contrast, the
other strategy favors nouns and verbs. However, differences occur
only at ranks lower than 337. Ranks of the first hundred entries are
nearly identical, differing only in up to five positions for each
stopword.
% now manual steps
According to the multi-layered stopword model extracted stopwords
are (sub)classified in the previously described layers.
\subsection{Identification of Functional Stopwords}
Starting with the $q_2$-ranked stopword candidates generated, a list
of functional stopwords is created. This list covers most of the
words included in traditional stopword lists. A classification
system is applied for a systematic generation of an extended
functional stopword list. The classification is based on the widely
accepted schema of linguistic word categories:
%
\begin{description}
%
\item[Determiners (DET)] are noun specifiers that express definiteness
or indefiniteness of a noun phrase (e.g., \texttt{a}, \texttt{the}).
In many languages, for instance German, it additionally encodes
morphosyntactic information like casus, genus, and numerus (e.g.,
\texttt{dieser}, \texttt{eine}, \texttt{das}).
%
\item[Auxiliary verbs (AUX)] are verbs that accompany the head verb
of the sentence, expressing grammatical distinctions like person,
number, tempus, aspect (e.g., \texttt{can}, \texttt{may},
\texttt{do}, \texttt{be}, \texttt{have}).
%
\item[Prepositions (PREP)] are non-inflected elements that govern
the case of their nominal complements (e.g., \texttt{in},
\texttt{on}, \texttt{upon}, \texttt{by}).
%
\item[Pronouns (PRON)] are pro-forms that substitute noun phrases. They
decode values of parameters like person, genus, numerus. Common
types are personal, possessive, and interrogative pronouns (e.g.,
\texttt{he}, \texttt{her}, \texttt{who}).
%
\item[Particles (PART)] are non-inflectional words that do not have any
governing function. They belong to very heterogenous subclasses like
verb particles (e.g., \texttt{[stand] up}, \texttt{[bring] down}),
model particles (e.g., \texttt{quite}, \texttt{very}), and
infinitive particles (e.g., \texttt{to}).
%
\item[Connectors (CONN)] are coordinating or subordinating conjunctions that
establish semantic relations between parts of sentences or whole
sentences (e.g., \texttt{further}, \texttt{hence}, \texttt{since}).
%
\item[Logical operators (LOG\_OP)] establish logical connections or interpretations
concerning elements in their scope (e.g., \texttt{and}, \texttt{or},
\texttt{not}).
%
\item[Quantifiers (Q)] express a definite or indefinite number of entities.
Normally, three groups can be distinguished: fuzzy (e.g.,
\texttt{some}, \texttt{many}), numeral (e.g., \texttt{two},
\texttt{million}), and cardinal (e.g., \texttt{second},
\texttt{fifth}) quantors.
%
\end{description}
% still some categories left
Further admissible candidates for functional stopword lists are
interjections (e.g., \texttt{oh}, \texttt{ah}, \texttt{hey},
\texttt{man}, \texttt{uhhh}), negatives (e.g., \texttt{no},
\texttt{not}), politeness markers (e.g., \texttt{please},
\texttt{thank you}), greetings (e.g., \texttt{hello},
\texttt{goodbye}), and the existential \texttt{there}. However,
these category labels are not used for categorization in this work.
%
% tagging similarities
One might note that the categories strongly correlate with
part-of-speech tags used by well known tagging systems.
% the final list
The list in Table~\ref{tab:stoplist_function_terms} contains 140
functional stopwords that cover 36,3\% (34.299.626) of all terms in
the whole INEX collection.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{List of functional stopwords}
\label{tab:stoplist_function_terms}
\begin{tabular}{lL{13cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Terms \tabularnewline%
%category example
DET (10) & \texttt{a, an, here, some, that, the, there, these, this, those} \tabularnewline%
AUX (26) & \texttt{are, be, become, becomes, been, being, can, cannot, could, do, does, done, had, has, have, having, is, make, may, might, must, should, was, were, will, would} \tabularnewline%
PREP (29) & \texttt{about, above, across, after, against, along, among, amongst, around, aside, at, before, beforehand, behind, below, beside, between, beyond, by, down, for, from, in, into, near, of, on, out, outside, per, since, through, thru, to, toward, towards, under, until, unto, up, upon, via, with, within, without, off} \tabularnewline%
PRON (22) & \texttt{another, anybody, anyhow, anyone, anything, anyway, anywhere, during, elsewhere, everybody, everyone, everything, everywhere, he, her, hers, herself, him, himself, his, how, i, it, its, itself, me, my, myself, nobody, none, noone, nowhere, onto, our, ours, ourselves, she, somebody, somehow, someone, something, sometime, somewhat, somewhere, such, that, their, theirs, them, themselves, they, us, we, what, when, whence, where, which, whither, who, whoever, whom, whose, why, you, your, yours, yourself, yourselves, whomever} \tabularnewline%
PART (12) & \texttt{almost, as, down, even, just, no, out, over, quite, rather, so, to, too, up, very, yes, off} \tabularnewline%
CONN (20) & \texttt{after, although, and, because, before, but, further, furthermore, hence, howbeit, if, insofar, instead, like, neither, nor, not, or, since, than, then, thence, therefore, though, thus, unless, until, whenever, whereafter, whereas, whereby, wherein, whereupon, wherever, whether, while, nonetheless} \tabularnewline%
LOG\_OP (3) & \texttt{and, not, or} \tabularnewline%
Q (18) & \texttt{all, both, each, every, few, first, five, four, many, much, often, one, second, some, third, three, two, various}
\end{tabular}
\end{table}
%
Figure~\ref{fig:stoplist_function_words} shows the number of
documents terms according to the functional stopword categories.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{05_text_analysis/figures/stopword_diag_functional_stopwords}\\
\caption{Frequency distribution of linguistic categories in functional stopwords}
\label{fig:stoplist_function_words}
\end{figure}
% why classification
The subclassification of functional stopwords supports further
linguistic tasks such as composite noun identification and acronym
resolution (see next chapter). Further selection of certain subsets
of functional stopwords strongly depends on the focus of the
application. For instance, the task of comparing queries and
metadata information (e.g., section titles, table captions, authors)
may require only specific functional stopword categories such as
DET, PREP, CONN, and LOG\_OP.
\subsection{Identification of Content-Related Stopwords}
% introduction
In addition to functional stopwords many other terms can be regarded
as stopword candidates. For instance, commonly occurring nouns,
verbs, adverbs, or adjectives are also very frequent and equally
distributed throughout document collections.
% content-related stopwords
Terms that can be taken for granted throughout all domains (common
knowledge) are regarded as content-related stopwords.
Characteristically, these terms occur quite frequently and are
equally distributed in documents of any document collection. Due to
the generic usage, such terms do generally not add valuable
information to standard texts (e.g., \texttt{use},
\texttt{accordingly}, \texttt{certain}, \texttt{problem}).
For instance, consider three different texts: one about
\texttt{social and cultural effects}, another one about
\texttt{effects of protuberances on the climate}, and a third one
about \texttt{effects of a new indexing method}. All texts deal
about certain effects. However, only few users are interested in
information about general effects of any kind. Thus, the term
\texttt{effect} can be considered as a content-related stopword.
%
% classification
Content-related stopwords are classified according to the following
categories:
%
\begin{description}
%
\item[Nouns (N)]
are words which refer to entities or groups of entities (persons,
places, names, concepts, etc.). In many cases they bear the most
relevant information of sentences, populating the verb argument
structure. For the purpose of extending the stopword list those
nouns which carry hardly any information without considering their
context are picked. Thus, they refer to proper and common nouns that
can be regarded as common knowledge (e.g., \texttt{abstract},
\texttt{field}, \texttt{example}, \texttt{conclusion}).
%
\item[Verbs (V)]
typically encode events, actions, or processes. In a syntactical
sense they constitute predicates in clauses, establishing a verb
argument structure. In many theories they function as heads of
sentences~\cite{guenther99morphosyntax}. For the sake of this purpose verbs
that do not transport much content (e.g., \texttt{use},
\texttt{apply}, \texttt{seem}, \texttt{go}) are selected.
%
\item[Adjectives (ADJ)]
are words which normally modify nouns by specifying properties or
other qualities. In a syntactical sense they function as attributes
or parts of predicates (e.g., \texttt{low}, \texttt{high},
\texttt{normal}, \texttt{fast}).
%
\item[Adverbs (ADV)]
are words which modify verbs or adjectives. Examples of stopwords in
this category are \texttt{quickly}, \texttt{accordingly},
\texttt{efficiently}, etc.
%
\end{description}
% manually assignment
By manually assigning the remaining $q_2$-ranked terms (excluding
functional stopwords) to these categories, a list of 271
content-related stopwords is derived (see
Table~\ref{tab:stoplist_content_terms}). The same word with
different meanings is assigned to all possible categories (i.e.,
\texttt{run} [N,V], \texttt{study} [N,V], \texttt{potential}
[N,ADJ], \texttt{focus} [N,V]).
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{List of content-related stopwords}
\label{tab:stoplist_content_terms}
\begin{tabular}{lL{13cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Terms \tabularnewline%
%category example
ADV (35) & \texttt{along, already, also, always, any, back, currently, directly, due, easily, either, especially, even, far, finally, first, however, less, likely, more, most, much, now, often, once, only, over, rather, simply, still, together, usually, very, well, yet} \tabularnewline%
ADJ (61) & \texttt{additional, appropriate, available, basic, best, better, certain, common, complete, current, different, difficult, due, easy, enough, entire, full, general, good, high, higher, important, independent, initial, large, larger, last, later, least, little, long, low, lower, main, major, necessary, new, next, original, other, own, particular, possible, potential, previous, real, recent, right, same, several, significant, similar, simple, single, small, smaller, special, specific, standard, total, useful} \tabularnewline%
N (74) & \texttt{ability, abstract, addition, address, advantage, amount, area, areas, basis, case, cases, change, changes, cost, curricula, degree, details, difference, effect, end, example, fact, field, focus, form, future, goal, group, hand, help, individual, interest, interests, issue, issues, key, level, means, need, needs, order, others, paper, part, place, point, potential, problem, problems, project, range, result, results, section, set, sets, size, solution, space, step, study, support, terms, time, times, type, types, use, view, vitae, way, ways, work, years} \tabularnewline%
V (100) & \texttt{able, according, achieve, address, allow, allows, applied, apply, associated, assume, based, called, change, compared, consider, considered, consists, contains, corresponding, create, depends, describe, described, determine, effect, end, existing, find, focus, following, follows, form, found, get, give, given, gives, group, hand, help, improve, include, includes, including, increase, increasing, involved, issue, issues, know, known, let, like, limited, made, makes, making, need, needed, needs, obtain, obtained, order, part, place, point, present, presented, produce, proposed, provide, provided, provides, providing, received, reduce, related, require, required, requires, result, resulting, see, set, show, shown, shows, step, study, support, take, takes, type, use, used, uses, using, work, working, works} \tabularnewline%
\end{tabular}
\end{table}
% result
The final list covers 8,6\% (8.025.990) of all terms in the document
collection. The union of the content-related and the functional
stopwords covers 44,9\% of all terms in the whole collection. The
distribution of content-related stopword categories is given in
Figure~\ref{fig:stoplist_content_words}.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{05_text_analysis/figures/stopword_diag_content_stopwords}\\
\caption{Frequency distribution of linguistic categories in content-related stopwords}
\label{fig:stoplist_content_words}
\end{figure}
% possible extension
One way to extract content-related stopwords automatically is to use
a number of document collections from different domains. On these
corpora, the top ranked terms are extracted as described and
functional stopwords are removed. Consequently, the intersection of
the resulting lists (one list for each corpus) defines the
domain-independent and content-related stopwords.
\subsection{Identification of Domain-Specific Stopwords}
% introduction
These stopwords are defined as terms that can be taken for granted
in a given domain. In this sense they do not add information to
texts in a given specific domain. This hypothesis is based on the
assumption that domain experts normally presuppose a finer
granulated domain vocabulary decoding their knowledge.
%
% categories
Domain-specific stopwords are classified according to the same
categories as the content-related stopwords (e.g., in the domain of
computer science):
%
\begin{description}
\item[Nouns (N)] like \texttt{computer}, \texttt{memory}, \texttt{model}, etc.
\item[Verbs (V)] like \texttt{perform}, \texttt{calculate}, \texttt{run}, etc.
\item[Adjectives (ADJ)] like \texttt{efficient}, \texttt{supervised}, etc.
\item[Adverbs (ADV)] like \texttt{automatically}, \texttt{effectively}, etc.
\end{description}
% domain
Table~\ref{tab:stoplist_domain_terms} shows the manually extracted
domain-specific stopwords. This work relies on the INEX document
collection derived from computer science literature. Hence, its
application domain is computer science and information technology.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{List of domain-specific stopwords}
\label{tab:stoplist_domain_terms}
\begin{tabular}{lL{13cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Terms \tabularnewline%
%category example
ADV (0) & \texttt{} \tabularnewline%
ADJ (6) & \texttt{complex, effective, efficient, local, national, technical} \tabularnewline%
N (58) & \texttt{access, algorithm, algorithms, analysis, application, applications, approach, approaches, architecture, complexity, computer, control, data, department, design, development, environment, features, function, functions, hardware, implementation, information, input, institute, knowledge, management, member, method, methods, model, models, network, number, operations, performance, process, professor, program, requirements, research, science, software, structure, system, systems, technology, trans, university, user, users, components, technique, techniques, tools, value, values, version} \tabularnewline%
V (27) & \texttt{computing, develop, developed, developing, distributed, engineering, generate, generated, implemented, perform, performed, processing, represent, represents, sets, define, defined, designed, run, supported} \tabularnewline%
\end{tabular}
\end{table}
% results
The final domain-specific stopword list comprises 91 terms covering
4,7\% (4.379.991) of all terms in the document collection. Merged
with the functional and content-related stopwords, the complete list
(445 unique stopwords) covers about 49,6\% of all terms in the
corpus. Figure~\ref{fig:stoplist_domain_words} gives an overview of
the distribution of the domain-specific stopword categories.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{05_text_analysis/figures/stopword_diag_domain_stopwords}
\caption{Frequency distribution of linguistic categories in domain-specific stopwords}
\label{fig:stoplist_domain_words}
\end{figure}
\subsection{Extending the Stopword List}
% extension
Providing additional knowledge about the language, the stopword list
created by the steps (1) to (3) can be further extended. As an
example additional terms that are (semi-)automatically identified
through morphological rules are included.
% adverbs and verbs, experiment
In further experiments conducted, adverbs (\texttt{-ly}) and verbs
(\texttt{-ing}, \texttt{-ed}) are selected from the INEX corpus
because of their simple identification via suffix matching.
Therefore, the $q_2$-ranked terms of the corpus are investigated.
The list included 344 terms ending with \texttt{-ly} (adverb
candidates), 839 terms ending with \texttt{-ing} (verb candidates),
and 795 terms ending with \texttt{-ed} (verb candidates).
%
%
% better solution
Each of the three lists is cut after 50 entries and the terms are
manually checked for their correct word category. Only five
non-adverbs (\texttt{only}, \texttt{apply}, \texttt{early},
\texttt{July}, \texttt{Only}), two non-present-tense verbs
(\texttt{during}, \texttt{During}), and two non-past-tense verbs
(\texttt{need}, \texttt{speed}) had to be removed. The remaining
terms are assigned to the set of functional $F$, content-related
$CR$, or domain-specific $CS$ stopwords according to their category.
The added stopwords are given in
Table~\ref{tab:stoplist_extension_terms}.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{List of additional stopwords}
\label{tab:stoplist_extension_terms}
\begin{tabular}{llL{11cm}}
\rowcolor{tableheadcolor} \color{white} Suffix & \color{white} Type & \color{white} Terms \tabularnewline%
%category example
ADV \texttt{-ly} & $CR$ (42) & \texttt{actually, approximately, clearly, closely, completely, currently, directly, easily, especially, exactly, explicitly, extremely, Finally, frequently, fully, generally, highly, immediately, increasingly, independently, likely, necessarily, particularly, possibly, previously, primarily, probably, quickly, recently, relatively, respectively, significantly, Similarly, simply, simultaneously, slightly, specifically, successfully, typically, Unfortunately, usually, widely} \tabularnewline%
ADV \texttt{-ly} & $DS$ (3) & \texttt{automatically, effectively, efficiently} \tabularnewline%
\smallskip
AUX \texttt{-ing}& $F$ (2) & \texttt{having, doing} \tabularnewline%
V \texttt{-ing} & $CR$ (32) & \texttt{according, adding, allowing, applying, beginning, being, building, changing, considering, containing, corresponding, creating, depending, existing, finding, following, including, increasing, interesting, leading, making, providing, reducing, remaining, resulting, starting, taking, underlying, understanding, using, Using, working} \tabularnewline%
V \texttt{-ing} & $DS$ (14) & \texttt{computing, Computing, developing, engineering, Engineering, modeling, operating, performing, processing, Processing, programming, representing, running, testing} \tabularnewline%
\smallskip
V \texttt{-ed} & $CR$ (36) & \texttt{achieved, added, applied, associated, based, Based, called, compared, considered, created, derived, described, detailed, determined, discussed, expected, fixed, improved, included, increased, introduced, involved, limited, needed, obtained, organized, presented, proposed, provided, published, received, reduced, related, required, selected, used} \tabularnewline%
V \texttt{-ed} & $DS$ (12) & \texttt{developed, defined, designed, distributed, implemented, supported, performed, generated, represented, extended, integrated, specified} \tabularnewline%
\end{tabular}
\end{table}
% result
The resulting stoplist contains 530 unique terms and covers 51,0\%
of all words in the document collection. The inclusion of the
additional stopwords increased the coverage of all (unique) terms by
1,4\%, but increased the size of the list by nearly 20\% (+85
terms). This clearly shows the effect that including more terms in
the stoplist at a certain size only increases the number of terms
matched in the corpus marginally. The final distribution of stopword
categories is shown in Figure~\ref{fig:stoplist_extended_words},
where a preceding $F$ stands for functional, $CR$ for
content-related, and $DS$ for domain-specific stopwords.
%
\begin{figure}
\centering
\includegraphics[width=0.9\textwidth]{05_text_analysis/figures/stopword_diag_final_stopwords}\\
\caption{Frequency distribution of all linguistic stopword categories}
\label{fig:stoplist_extended_words}
\end{figure}
\subsection{Coverage of the Generated Stopword List}
% the complete result
Compared to other stopword lists the generated list seems to be well
sized. As the stopwords in this work are generated from the INEX
computer science corpus, a number of common terms such as
\texttt{did}, \texttt{once}, or \texttt{upon} occurred quite
infrequent and are not included. This problem can be handled by
applying the same procedure of stopword generation to other corpora
of other domains. By merging the results (besides the
domain-specific stopwords), a more complete stopword list could be
achieved. A final intersection of all terms might be regarded as
real-world `common' stopwords.
% evaluation
An evaluation of the generated stopword list is performed by a
coverage test comparing several other stopword lists.
%
Therefore, nine freely available stoplists are compared to it. The
stopwords presented by Fox in~\cite{fox90stoplist} (Fox), three
standard stoplists (SL1, SL2, SL3, unfortunately without sources),
the assumed Google
stopwords\footnote{http://www.ranks.nl/tools/stopwords.html
(11.03.2008)} (Google), the stopwords presented by Manning
in~\cite{manning02nlp} (Manning), the BioPD
stopwords\footnote{http://www-fog.bio.unipd.it/waishelp/stoplist.html
(11.03.2008)} from the
freeWAIS-sf\footnote{http://www.is.informatik.uni-duisburg.de/projects/freeWAIS-sf/STOPWORDS
(11.03.2008)} (BioPD), the Glasgow stopwords from The Information
Retrieval
Group\footnote{http://www.dcs.gla.ac.uk/idom/ir\_resources/linguistic\_utils/stop\_words
(11.03.2008)} (Glasgow), and the CLEF stopwords from
Smart\footnote{http://www.unine.ch/info/clef/englishST.txt
(11.03.2008)} (CLEF Smart) were selected.
% really the final list
As said before, a more complete stopword list could be achieved by
merging the stopwords generated from other corpora. Because of
limited time, this step is not conducted in this work. Instead, the
nine stopword lists mentioned are analyzed in detail and valid
stopwords are extracted. According to their linguistic category,
these stopwords are assigned to either the functional $F$ or the
content-related $C$ stopword layer. This is explained by the fact
that all stoplists are meant to be `common sense'.
%
Table~\ref{tab:stoplist_additional_external_stopwords} summarizes
the additional stopwords added from the external stopword lists.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Additional stopwords of external stopword lists}
\label{tab:stoplist_additional_external_stopwords}
\begin{tabular}{lL{13cm}}
\rowcolor{tableheadcolor} \color{white} Category & \color{white} Terms \tabularnewline%
%category example
F\_AUX (6) & \texttt{am, became, becoming, did, ought, shall} \tabularnewline%
F\_PREP (17) & \texttt{against, amongst, aside, beforehand, behind, below, beside, beyond, down, near, off, onto, outside, thru, toward, towards, unto, upon} \tabularnewline%
F\_PRON (49) & \texttt{anybody, anyhow, anyone, anything, anyway, anywhere, elsewhere, everybody, everyone, everything, everywhere, her, hers, herself, him, himself, i, me, mine, my, myself, nobody, none, noone, nowhere, ours, ourselves, she, somebody, somehow, someone, something, sometime, somewhat, somewhere, theirs, themselves, whence, whither, whoever, whom, whomever, why, you, your, yours, yourself, yourselves} \tabularnewline%
F\_PART (5) & \texttt{almost, amongst, down, off, quite, yes} \tabularnewline%
F\_CONN (17) & \texttt{furthermore, hence, howbeit, insofar, neither, nonetheless, nor, thence, though, unless, whenever, whereafter, whereas, whereby, wherein, whereupon, wherever} \tabularnewline%
F\_Q (25) & \texttt{billion, eight, eighty, eleven, fifteen, fifth, fifty, forty, million, nine, ninety, secondly, sixty, seven, seventy, six, ten, third, thirty, thousand, trillion, twelve, twenty, twice, zero} \tabularnewline%
\smallskip
C\_ADV (71) & \texttt{accordingly, actually, afterwards, again, apart, away, awfully, besides, certainly, clearly, consequently, definitely, differently, downwards, else, entirely, evenly, ever, formerly, hardly, hereafter, hereby, herein, hereupon, hither, hopefully, inasmuch, indeed, inward, lately, latterly, mainly, meanwhile, merely, moreover, mostly, namely, nearly, never, nevertheless, non, normally, nothing, nowhere, obviously, otherwise, overall, perhaps, preferably, presumably, really, reasonably, seriously, sometimes, soon, sure, thereafter, thereby, therein, thereof, thereto, thereupon, thorough, thoroughly, throughout, today, truly, unfortunately, unlikely, whatever} \tabularnewline%
C\_ADJ (41) & \texttt{alone, big, brief, clear, early, except, former, forth, great, greater, greatest, highest, immediate, inner, kind, largely, latest, latter, longer, longest, main, mean, near, newer, newest, novel, old, older, oldest, regardless, sensible, serious, smallest, sorry, suitable, thick, thin, whole, young, younger, youngest} \tabularnewline%
C\_N (2) & \texttt{ones, regards} \tabularnewline%
C\_V (52) & \texttt{began, came, come, contain, differ, gave, gets, getting, go, goes, going, gone, got, gotten, interested, keep, keeps, kept, knew, knows, less, lets, liked, liked, likes, maybe, needing, okay, preferred, presenting, presents, put, puts, regarding, seeing, seem, seemed, seeming, seems, self, specify, specifying, sub, taken, took, unlike, want, wanted, wanting, wants, went, worked} \tabularnewline%
\end{tabular}
\end{table}
%
The final generated stopword list contains 804 unique terms and is
attached in Section~\ref{app:sec:full_stoplist} in the Appendix.
% coverage
Figure~\ref{fig:stoplist_comparison_optimal_reduced} shows the
overlap of the generated list and other stopword lists. In this
figure stopwords are sorted in alphabetical order, where the number
at the top of each column $\#SW$ is the number of contained
stopwords, and the number in the column headers shows the actual
size of the list.
%
One has to note that most stopword lists comprise all single letters
and terms with apostrophes, hyphens, and other special characters.
In the experiments such terms have been excluded by utilizing the
token types assigned through Extended Tokenization.
%
In order to achieve more accurate comparison results, stopwords of
other lists that contain apostrophes (e.g., aren't, it's, they've)
are removed.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{05_text_analysis/figures/stopword_comparison_optimal_reduced}\\
\caption{Final stopword coverage}
\label{fig:stoplist_comparison_optimal_reduced}
\end{figure}
%
% final coverage
The stopword coverage of the final stoplist is above 77\% for all
lists: Fox (77,4\%), SL1 (77,0\%), SL2 (85,6\%), SL3 (80,5\%),
Google (83,3\%), Manning (98,2\%), BioPD (88,0\%), Glasgow (90,0\%),
and CLEF Smart (80,6\%).
% appendix remark and transition
After stopwords are filtered the remaining set of index terms is
undergoing a stemming process, described in the next subsection.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Stemming}
% intro
Stemming is one of the simplest and most successful methods of
natural language processing for information
retrieval~\cite{croft95corpusspecific}. Early work done by Lovins
goes back to 1968~\cite{lovins68stemming}, already pointing out the
benefits of applying stemming in information retrieval. Today,
stemming has become a widely accepted processing step in nearly all
retrieval systems.
% what is stemming
Stemming itself is the process of reducing a full word form to its
stem. Although quite similar to morphological processing, it is not
linguistically motivated but has somewhat different
goals~\cite{croft95corpusspecific}. It refers to a normalization of
terms by removing affixes (prefixes and suffixes)~\cite[pp.
140]{grossman04ir}, which is done mainly for two reasons:
%
\begin{itemize}
%
\item First, the vocabulary is kept as small as possible. For instance, the terms
\texttt{hunt}, \texttt{hunts}, and \texttt{hunted} are reduced to
the single form \texttt{hunt}.
%
\item Second, the recall of search results is increased by retrieving identical stems
of words having different spellings as complete words (e.g., another
tempus, singular versus plural, comparative forms). Thus, the term
\texttt{house} matches the terms \texttt{housing}, \texttt{housed},
\texttt{houses}, etc.
%
\end{itemize}
% problems
Besides these benefits, retrieval systems that apply stemming face
some drawbacks~\cite[pp. 132--134]{manning02nlp}. Generally,
stemmers like the Porter stemmer~\cite{porter80suffix} reduce word
forms by simply stripping their suffixes. This method works fine for
most English words. However, such processing also leads to unwanted
and even incorrect results. For instance, the rule deleting the word
ending \texttt{-ly} reducing adverbs to adjectives but also reduces
\texttt{July} to \texttt{Ju}. The reason for that is that the stem
is computed without any deeper linguistic knowledge and analysis. In
other languages with rich morphology (e.g., German) these stemmers
perform very poor. Krovetz and
Croft~\cite{krovetz89word,krovetz93morphology} tried to tackle this
problem using large dictionaries to ensure correct stems. However,
complete lists are not likely to be available because of the
maintenance costs. Another difficulty is that semantically different
words (e.g., \texttt{gallery} and \texttt{gall}) are both stemmed to
\texttt{gall}, leading to unwanted matches of terms in the query and
in the documents. This is the reason why some web search engines do
not apply stemming at all~\cite[pp. 168]{baeza-yates99ir}.
% process of stemming
Generally, stemmers reduce terms to stems by accessing a list of
suffixes that are stripped. The suffixes are sorted according to
their length, where longer suffix matches are stripped before
shorter ones. This ensures that the example rules in
Equation~\ref{eq:stem_1} and Equation~\ref{eq:stem_2} reduce the
term \texttt{stresses} to \texttt{stress} (and not to
\texttt{stresse}).
%
\begin{eqnarray}
sses & \rightarrow & ss \label{eq:stem_1}\\
s & \rightarrow & \emptyset \label{eq:stem_2}
\end{eqnarray}
% reason for using stemmers
Besides its advantages according to the vocabulary size, the reason
for applying stemming is its performance and deterministic behavior,
mostly implemented as finite state automata. Although it often
reduces full forms very poorly, it does not greatly impact the
retrieval result because both, the query and the content of
documents are stemmed: the same error during indexing is done for
the query, which still results in a match of both wrongly stemmed
terms.
%During retrieval, this ensures high recall but decreases precision.
%Hence in the structured document domain retrieval is an iterative
%process of constantly refining search results and browsing
%documents, this drop of precision during initial searches is
%overcome in later search cycles.
% transition
The stems serve as index terms for the textual representation. In a
final step the frequencies of the stems are summed up in order to
reflect the importance of each `concept'. During retrieval, these
frequencies are used to compute each stems' weight for comparison.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Summary}
This chapter introduced the relevant aspects of transforming natural
language text into a computable representation that is comparable to
others. Therefore, the involved natural language processing steps
are described and the processing chain is explained.
The steps include tokenization, tagging, stemming, and stopword
filtering. The final processing result is a term frequency vector
that reflects the stemmed terms and their frequency within the
initial text. The generated representation is stored in the database
accessible during later retrieval phases.
|