"Science must begin with myths, and with the criticism of myths." Karl Popper
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Chapter 5 - Basic Natural Language Processing
%
% last change: 16.08.2004
% correction hamid: xx.xx.2004
% correction prof: xx.xx.2004
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\aphorism{The voyage of discovery is not in seeking new landscapes but in having new eyes.}{Marcel Proust}%
\chapter{Generation of Natural Language Resources Supporting Information Retrieval}%
\label{chapter:advanced_text_analysis}%
Natural language resources consist of dictionaries and sets of
language rules that improve the quality of text analysis, and thus,
the quality of text representations.
%Examples include lists of abbreviations, tokenization rules, and
%multi-term dictionaries (e.g., composite nouns).
Incorporating them in retrieval systems increases both,
computational performance and retrieval quality.
%
This chapter presents statistical methods that extract natural
language resources from a large set of documents automatically.
Processing solely relies on the computationally light-weight tasks
of Extended Tokenization (described in
Section~\ref{sec:extended_tokenization}) and stopword filtering
(described in Section~\ref{sec:stopwords}).
%
Resources generated include a set of single-token typing rules and a
set of multi-token typing rules improving tokenization, a full form
dictionary, an abbreviation dictionary, multi-term dictionaries for
the identification of index phrases (composite nouns, named
entities, and formulaic speech), and acronym dictionaries for
detection and re-substitution of acronyms and full forms.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Introduction}
% NLP & IR
Extended Tokenization, as discussed in
Section~\ref{sec:extended_tokenization}, is applied in natural
language processing mainly for index and query computation.
Obviously, the quality of retrieval systems strongly relies on the
quality of its index representation and on the way the query is
analyzed and processed.
%
Incorporating the generated resources in text analysis frameworks
improves central linguistic tasks such as tagging, parsing, term
extraction and filtering, named entity recognition, etc. The better
the natural language analysis, the better the representation of
textual contents. As a consequence, improvements of natural language
text representation improves the quality of information retrieval
tasks.
% NLP resources
The performance of natural language processing tools depends on the
underlying resources like dictionaries and knowledge bases (e.g.,
morphological and syntactical rules).
%
Creating these resources is time and cost expensive, which often is
the reason for trusting external sources that are freely available.
Due to the high creation effort, these resources are generally
tailored to certain domains and applications.
%
Thus, different sources use different terminologies, data
structures, and implicit or ambiguous semantics to describe the same
concepts. Quality assessments of resources are carried out rarely.
Only in few cases external resources can be applied as given, and
often much efforts are needed to rework and adapt them.
% exploit tokenization
In general, tokenizers preprocess texts for further analysis tasks.
However, the output of a sophisticated tokenizer can also be applied
to generate lists of abbreviations, acronyms, named entities, full
word forms (dictionaries), and domain and application specific text
patterns in an unsupervised way.
%
Since tokenization is a light-weight process, large amounts of data
can be processed efficiently. Without much effort, experts are able
to filter well prepared lists and extract information in a
convenient way.
%
% Motivation & idea
This section relies on the output of JavaTok -- the Extended
Tokenization prototype. Neither stemming nor stopword filtering, as
described in the previous chapter, have been applied. The output of
the tokenizer is exploited to serve information extraction and text
mining tasks. Based on a large amount of example texts, domain- and
application-specific resources are created in a cheap, fast, and
simple way with minimal human intervention.
% Tokenizer, Extractor, Miner, and how they work
Figure~\ref{fig:tokenizer_extractor_miner} shows the three
components that generate linguistic resources from large amounts of
plain text. It is important to note that this process starts from
scratch, not relying on any of the representations computed earlier.
%
In a first step, JavaTok is applied to tokenize each text.
%
Its output is passed to a Pattern Extractor that identifies and
extracts predefined token sequences (patterns) of length $b$. For
each pattern, the extraction process includes a left (pre-window of
length $a$) and a right (post-window of length $c$) context of
tokens. Identified patterns ($b$ tokens) are sorted according to
their frequency in descending order. Corresponding full windows of
$a+b+c$ tokens (pre-window, pattern, post-window) are additionally
stored and sorted the same way.
%
In a final step a Rule Miner analyzes the extracted full windows of
each pattern. Based on statistics, rules are proposed that identify
the current pattern according to its context. Rules are sorted by
their length (number of tokens addressed) and ranked according to
the proportion of full window matches.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{06_advanced_nlp/figures/Information_Mining_Architecture}
\caption{Information extraction and text mining tasks}
\label{fig:tokenizer_extractor_miner}
\end{figure}
% resources
The machinery depicted in Figure~\ref{fig:tokenizer_extractor_miner}
is used to extract language resources that improve information
retrieval. Relying on corpus-based statistics (e.g., pattern
frequency, document frequency), it is demonstrated how resources can
be generated efficiently from large text corpora with minimal manual
effort.
%
The resources include:
%
\begin{description}
%
\item[Single-token rules]
Focusing on single-tokens of unknown type, rules that identify single-token
types are incrementally added. These basic token-types enable
subsequent extraction of more elaborated resources.
%
\item[Multi-token rules]
Identification of complex multi-token rules based on single-tokens
(strings and types) lead to better tokenization results.
Especially tagging is improved by glueing conceptual
related single-tokens to higher level multi-tokens.
%
\item[Rule mining]
In order to enhance retrieval of sophisticated textual patterns,
multi-token rules are identified automatically. Based on
a bootstrapping mechanisms, context rules are learnt from a few
known `seed' patterns. The new rules can be used to extract
further patterns that occur in the same context.
%
\item[Single-term dictionaries]
Single-term dictionaries refer to traditional term lists, including full
form lexicons, lists of abbreviations that improve the reliability of
sentence border disambiguation, domain-dependent acronyms, URIs, phone
numbers, and special formats (e.g., hyphenated forms such as
\texttt{rule-based}, \texttt{state-of-the-art}). Extraction of
specific word category dictionaries is done by focusing on
token types applying regular expression matching of token
strings (e.g., \texttt{-ly} ending adverbs, \texttt{-ion} and \texttt{-ings}
ending nouns, \texttt{-ing} and \texttt{-ed} ending verbs).
%
\item[Multi-term dictionaries]
Multi-term dictionaries consist of multi-tokens that serve as
additional multi-term or phrase indices. These indices improve retrieval tasks
by representing concepts where the order of words is significant
(e.g., composite nouns, named entities, expanded acronyms).
%
\end{description}
%
%
% outlook
The next section describes the general procedure of the extraction
tasks. Subsequently, the different approaches generating natural
language resources are presented.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Experimental Setting and Procedure}
% introduction, language independent
All experiments rely on English computer science texts of the INEX
2005 corpus described in Section~\ref{sdr:sec:corpus}. However, the
proposed extraction methods can be applied to other languages and
domains in the same manner. Achieved results are meant as
guidelines, sketching ways to smartly extract linguistic resources
on a surprisingly low processing level.
%
%
% general procedure, iterative
The resources are generated and extended in an iterative process
using JavaTok and a Pattern Extractor. Both tools are complemented
by a Rule Miner that identifies statistically motivated rules based
on the contexts of extracted patterns (see
Figure~\ref{fig:tokenizer_extractor_miner}).
% JavaTok and Pattern Extractor
In each iteration step the set of documents is tokenized using
JavaTok.
%
% Pattern Extractor
The output is passed to the Pattern Extractor which extracts token
patterns and corresponding contexts. Patterns are defined as an
arbitrary number of tokens that are specified by token strings and
token types. Positive (e.g., only tokens of type \texttt{ABBR}) and
negative (e.g., no token strings ending by \texttt{-ing}) pattern
definitions are supported. Additionally, a list of known exceptions
(negative token strings; e.g., \texttt{July} is not an adverb
although ended by \texttt{-ly}) and a list of known strings
(positive token strings; e.g., \texttt{wrt.} is a known
abbreviation) are used to exclude previously identified patterns. If
necessary, a left context (pre-window) and a right context
(post-window) of arbitrary length can be included. The Pattern
Extractor outputs the extracted patterns (and their contexts)
according to their frequency in descending order.
%
%
% explanation of the extraction process
By investigating the top $n$ outputs, patterns are efficiently
identified and processed. One can either
%
\begin{compactenum}
\item assign the extracted pattern to a list of positive token strings, or
\item assign the extracted pattern to a list of negative token strings, or
\item set up a rule that identifies the pattern as a certain token type, or
\item set up a rule that ignores the pattern in the next iteration step, or
\item leave the pattern untouched.
\end{compactenum}
%
%
% effects
Adding a pattern to the positive list (option 1) results in a
dictionary of highly confident patterns. This dictionary is
supported by positive rules that identify the pattern (option 3).
The effect of a new rule can instantly be checked by extracting all
tokens assigned to the new rules' token type. Similarly, options (2)
and (4) exclude unwanted patterns. Each option (1)--(4) prevents the
pattern from being extracted in subsequent iteration steps. Due to
the iterative process, both, the lists (positive and negative token
strings), and tokenization rules (positive and negative), grow
quickly. Thus, the number of patterns investigated in the next
iteration step is reduced efficiently.
%
%
% Rule Miner
In addition to the manual identification of positive rules the Rule
Miner can be applied to extract rule proposals. Statistical ranking
of these rules supports the rule identification process.
%The rest of this chapter outlines the generated results ...
%Sticking to this procedure ...
%In this work this procedure is exemplarily applied to create (1) a
%list of common abbreviations, (2) a set of basic token type rules
%for single-tokens, (3) a list of multi-tokens, and (4) a set of
%rules for multi-tokens.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Definition of Basic Token Types for Single-Tokens}
% introduction
Basic token types provide an abstraction of tokens that classify
token strings into a set of interpretable categories. During
Extended Tokenization these types are assigned to tokens in an
initial step. Although token types are not interpreted by the
tokenizer itself, subsequent processing, as described in this
chapter, is supported by focusing on certain subsets of tokens.
% process
Relying on the output of the Pattern Extractor, basic token types
are identified manually. Therefore, single-tokens of unknown type
($T_u$) are investigated. Types are assigned by regular expression
rules that are applied on token strings. According to the
single-tokens extracted, three main categories, alphabetic
($ALPHA$), numeric ($NUMERIC$), and special ($ENTITY$) tokens are
distinguished. Each category consists of further subcategories (see
Figure~\ref{fig:javatok_basictokentypes}). Within a short time of
several hours a set of 72 basic token typing rules is constructed
(see
Tables~\ref{tab:tok_type_alpha_common}--\ref{tab:tok_type_special_www}).
In the sequel the rules of each category are described.
\subsection{Alphabetic Tokens ($ALPHA$)}
Alphabetic tokens refer to `proper words' of a text. Generally,
search engines rely on these terms as index terms. Interestingly, a
wide variety of different types occur frequently throughout the INEX
documents. For the sake of improving postprocessing tasks, 49
specialized identification rules are identified and organized in
three subcategories.
%
%
% common
$COMMON$ alphabetic tokens refer to full form words (see
Table~\ref{tab:tok_type_alpha_common}).
%
% acronym
The alphabetic subcategory $ACRONYM$ marks acronyms of various forms
(see Table~\ref{tab:tok_type_alpha_acronym}).
%
% special
Specially formatted tokens still referring to words are typed as
$SPECIAL$ alphabetic tokens (see
Table~\ref{tab:tok_type_alpha_special}). The number in the table
headers counts the number of rules included.
%
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$COMMON$ alphabetic token types (26)}
\label{tab:tok_type_alpha_common}
\begin{tabular}{lll}
\rowcolor{tableheadcolor} \color{white} Subcategory & \color{white} Regular Expression & \color{white} Example \tabularnewline%
%subcategory reg.ex. examples
$LOWER$ & [a-z]+ & \texttt{this, house, it} \tabularnewline%
$UPPER$ & [A-Z][a-z]+ & \texttt{Vienna, Bill, Table} \tabularnewline%
\smallskip \smallskip
$UPPER\_NUMBER$ & [A-Z][a-z]+[0-9]+ & \texttt{Win32} \tabularnewline%
%
$ELECTRONICAL\_OBJECT\_1$ & [eE](-[A-Z]?|[A-Z])[a-z]+ & \texttt{e-Commerce, e-mail, eBay} \tabularnewline%
$ELECTRONICAL\_OBJECT\_2$ & i(-[A-Z]?|[A-Z])[a-z]+ & \texttt{iPod, i-Pod, i-pod} \tabularnewline%
\smallskip \smallskip
$MIXED$ & ([a-z]+[A-Z]+[a-zA-Z]*|[A-Z]+[a-z]+[A-Z]+[a-zA-Z]*) & \texttt{thiS, HoUsE, iT} \tabularnewline%
%
$HYPHEN\_1$ & [a-z]+-[a-z]+ & \texttt{red-hat} \tabularnewline%
$HYPHEN\_2$ & [a-z]+-[a-z]+(-[a-z]+)+ & \texttt{state-of-the-art} \tabularnewline%
$HYPHEN\_3$ & [a-z]+-[0-9]+ & \texttt{mid-1990} \tabularnewline%
$HYPHEN\_4$ & [A-Z]+-[a-z]+ & \texttt{NP-hard} \tabularnewline%
$HYPHEN\_5$ & [A-Z]+-[A-Z][a-z]+ & \texttt{NP-Complete} \tabularnewline%
$HYPHEN\_6$ & [A-Z]-[A-Z]?[a-z]+ & \texttt{B-Spline, A-test} \tabularnewline%
$HYPHEN\_7$ & [A-Z][a-z]+-[A-Z][a-z]+ & \texttt{Master-Mind, New-York}\tabularnewline%
$HYPHEN\_8$ & [A-Z][a-z]+-[a-z]+ & \texttt{Master-switch} \tabularnewline%
$HYPHEN\_9$ & [A-Z][a-z]+-[A-Z][a-z]+(-[A-Z][a-z]+)+ & \texttt{Master-Editor-Chief} \tabularnewline%
$HYPHEN\_10$ & [A-Z]?[a-z]+-[A-Z]?[a-z]+(-[A-Z]?[a-z]+)+ & \texttt{Editor-in-Advance} \tabularnewline%
$HYPHEN\_3\_PLURAL$ & [a-z]+-[0-9]+s & \texttt{mid-1990s} \tabularnewline%
$HYPHEN\_PREFIX$ & [A-Z]?[a-z]+- & \texttt{Pre-, post-} \tabularnewline%
\smallskip \smallskip
$HYPHEN\_POSTFIX$ & -[a-z]+ & \texttt{-ness} \tabularnewline%
%
$CONTRACTION\_IS\_HAS$ & ([Tt]hat's|[Tt]here's|[Hh]e's|[Ss]he's|[Ii]t's) & \texttt{It's, That's, he's} \tabularnewline%
$CONTRACTION\_HAD\_WOULD$ & ([A-Z]?[a-z]+|I)'d & \texttt{he'd, They'd, We'd} \tabularnewline%
\smallskip \smallskip
$CONTRACTION\_CLITICS$ & (I|[A-Z]?[a-z]+)'(ve|ll|re|m|t) & \texttt{I'm, They've, We're} \tabularnewline%
%
$APOSTROPHE\_POSS\_GEN\_1$ & [a-z]+'s & \texttt{egg's} \tabularnewline%
$APOSTROPHE\_POSS\_GEN\_2$ & [A-Z][a-z]+'s & \texttt{Million's, Dude's} \tabularnewline%
$APOSTROPHE\_POSS\_GEN\_3$ & [a-z]+-[0-9]+'s & \texttt{mid-1990's} \tabularnewline%
$APOSTROPHE\_POSS\_GEN\_4$ & ([a-z]+[A-Z]+[a-zA-Z]*|[A-Z]+[a-z]+[A-Z]+[a-zA-Z]*)'s & \texttt{tHaT's, McDonald's} \tabularnewline%
\end{tabular}
\end{table}
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$ACRONYM$ alphabetic token types (15)}
\label{tab:tok_type_alpha_acronym}
\begin{tabular}{llL{8,6cm}}
\rowcolor{tableheadcolor} \color{white} Subcategory & \color{white} Regular Expression & \color{white} Example \tabularnewline%
%subcategory reg.ex. examples
$SIMPLE$ & [A-Z][A-Z]+ & \texttt{USA, ACM, IEEE} \tabularnewline%
$NUMBER\_FIRST$ & [0-9]+[A-Z]+ & \texttt{3D, 123US} \tabularnewline%
$SLASH$ & [A-Z]+(/[A-Z]+)+ & \texttt{IEEE/ACM} \tabularnewline%
$AND$ & [A-Z]+\&[A-Z]+ & \texttt{AT\&T, ABC\&DEFG, A\&PC} \tabularnewline%
$SLASH\_NUMBER$ & [A-Z]+/[0-9]+ & \texttt{PS/2, RS/6000} \tabularnewline%
\smallskip \smallskip
$MIXED\_NUMBER$ & [A-Z]+[0-9]+[A-Z]* & \texttt{P2P, J2EE} \tabularnewline%
%
$PLURAL\_1$ & [A-Z]+s & \texttt{PCs, CPUs, XMLs} \tabularnewline%
$PLURAL\_2$ & [A-Z]+/[A-Z]+s & \texttt{I/Os, DA/Ps} \tabularnewline%
\smallskip \smallskip
$PLURAL\_3$ & [A-Z]+-[A-Z]+s & \texttt{ACM-IEEEs, SP-SSs} \tabularnewline%
%
\smallskip \smallskip
$NAME$ & [A-Z]$\backslash$.-?([A-Z]$\backslash$.)+ & \texttt{E.G.T., M.-T., A.B.-C.D.} \tabularnewline%
%
$HYPHEN\_1$ & [A-Z]+-[A-Z]+ & \texttt{ACM-IEEE, SP-SS} \tabularnewline%
$HYPHEN\_2$ & [A-Z]+-[A-Z]+(-[A-Z]+)+ & \texttt{IF-THEN-ELSE} \tabularnewline%
\smallskip \smallskip
$HYPHEN\_NUMBER$ & [A-Z]+-[0-9]+ & \texttt{F-117, B-52, US-123} \tabularnewline%
%
\smallskip \smallskip
$APOSTROPHE\_POSS\_GEN$ & [A-Z]+'s & \texttt{USA's, ACM's} \tabularnewline%
%
$MIXED (ALPHA\_MIXED)$ & [A-Z][A-Z]+[a-z]+ & \texttt{IPsec, ITtalks, KBytes} \tabularnewline%
\end{tabular}
\end{table}
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$SPECIAL$ alphabetic token types (8)}
\label{tab:tok_type_alpha_special}
\begin{tabular}{llL{7cm}}
\rowcolor{tableheadcolor} \color{white} Subcategory & \color{white} Regular Expression & \color{white} Example \tabularnewline%
%subcategory reg.ex. examples
$APOSTROPHE\_NAME$ & (O|Mc|De)'[A-Z][a-z]+ & \texttt{O'Conner, Mc'Donalds} \tabularnewline%
$APOSTROPHE\_NAME\_POSS\_GEN$& (O|Mc|De)'[A-Z][a-z]+'s & \texttt{O'Conner's, Mc'Donalds's} \tabularnewline%
\smallskip \smallskip
$APOSTROPHE\_GENERAL$ & [A-Z]?[a-z]*'[A-Z]?[a-z]+ & \texttt{Int'l} \tabularnewline%
%
$ALTERNATIVE\_1$ & [a-z]+(/[a-z]+)+ & \texttt{input/output} \tabularnewline%
$ALTERNATIVE\_2$ & [A-Z][a-z]+(/[A-Z][a-z]+)+ & \texttt{May/June} \tabularnewline%
\smallskip \smallskip
$ALTERNATIVE\_3$ & [A-Z]?[a-z]+$\backslash$.(/[A-Z]?[a-z]+\.)+ & \texttt{Jan./Feb.} \tabularnewline%
%
$SLASH\_NAME$ & [A-Z][a-z]+/[0-9]+ & \texttt{System/6000, Ansi/200} \tabularnewline%
$UNDERLINE\_COMPOUND$ & [A-Z]?[a-z]+(\_[A-Z]?[a-z]+)+ & \texttt{By\_the\_End} \tabularnewline%
\end{tabular}
\end{table}
\FloatBarrier
\subsection{Numeric Tokens ($NUMERIC$)}
Numeric tokens identify numbers and number-like strings which
generally are not included in the set of index terms. Three
different subtypes of numeric tokens are defined: plain numbers
($PLAIN$, see Table~\ref{tab:tok_type_num_plain}), formats that
contain numbers ($FORMAT$, see Table~\ref{tab:tok_type_num_format}),
and special number constructs ($SPECIAL$, see
Table~\ref{tab:tok_type_num_special}).
%
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$PLAIN$ numeric token types (6)}
\label{tab:tok_type_num_plain}
\begin{tabular}{llL{9cm}}
\rowcolor{tableheadcolor} \color{white} Subcategory & \color{white} Regular Expression & \color{white} Example \tabularnewline%
%subcategory reg.ex. examples
$SIMPLE$ & [0-9]+ & \texttt{12345} \tabularnewline%
$SIMPLE\_SIGNED$ & [$\backslash$+-][0-9]+ & \texttt{+512, -18} \tabularnewline%
$PERIOD$ & [0-9]+($\backslash$.[0-9]+)+ & \texttt{123.45} \tabularnewline%
$COMMA$ & [0-9]+(,[0-9]+)+ & \texttt{123,45} \tabularnewline%
$PLURAL$ & [0-9]+s & \texttt{1990s, 60s} \tabularnewline%
$APOSTROPHE\_POSS\_GEN$ & [0-9]+'s & \texttt{1990's} \tabularnewline%
\end{tabular}
\end{table}
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$FORMAT$ numeric token types (13)}
\label{tab:tok_type_num_format}
\begin{tabular}{lll}
\rowcolor{tableheadcolor} \color{white} Subcategory & \color{white} Regular Expression & \color{white} Example \tabularnewline%
%subcategory reg.ex. examples
$PERIOD\_CHAPTER $ & [0-9]+$\backslash$.[0-9]+($\backslash$.[0-9]+)+ & \texttt{3.2.1} \tabularnewline%
\smallskip \smallskip
$VERSION $ & [0-9]+([$\backslash$.-][0-9]+)+[[A-Za-z]+] & \texttt{3.2.1-5alpha, 0.3.2-1, 0-4-3} \tabularnewline%
%
$RANGE $ & [0-9]+--?[0-9]+ & \texttt{25-28, 11--17} \tabularnewline%
$TIME\_SCORE\_PROPORTION$ & [0-9]+(:[0-9]+)+ & \texttt{12:45, 10:09:99} \tabularnewline%
$SLASH\_NUMBER $ & [0-9]+/[0-9]+ & \texttt{3/4, 1/2, 99/100} \tabularnewline%
\smallskip \smallskip
$CARDINAL $ & [0-9]+(st|nd|rd|th) & \texttt{1st, 2nd, 15th} \tabularnewline%
%
$PERCENT $ & [0-9]+\% & \texttt{1\%, 99\%, 127\%} \tabularnewline%
$DOLLAR\_1 $ & [0-9]+\$ & \texttt{123\$} \tabularnewline%
\smallskip \smallskip
$DOLLAR\_2 $ & $\backslash$\$[0-9]+ & \texttt{\$123} \tabularnewline%
%
$DATE\_1 $ & (0?[1-9]|1[0-9]|2[0-9]|3[0-1])$\backslash$.(0?[1-9]|1[0-2])$\backslash$.[0-9]{4} & \texttt{12.12.1977, 1.1.1980} \tabularnewline%
$DATE\_2 $ & [0-9]{4}$\backslash$.(0?[1-9]|1[0-2])$\backslash$.(0?[1-9]|1[0-9]|2[0-9]|3[0-1]) & \texttt{1977.12.12, 1980.1.1} \tabularnewline%
$DATE\_3 $ & (0?[1-9]|1[0-9]|2[0-9]|3[0-1])/(0?[1-9]|1[0-2])/[0-9]{4} & \texttt{12/12/1977, 1/1/1980} \tabularnewline%
$DATE\_4 $ & [0-9]{4}/(0?[1-9]|1[0-2])/(0?[1-9]|1[0-9]|2[0-9]|3[0-1]) & \texttt{1977/12/12, 1980/1/1} \tabularnewline%
\end{tabular}
\end{table}
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$SPECIAL$ numeric token types (1)}
\label{tab:tok_type_num_special}
\begin{tabular}{llL{10.5cm}}
\rowcolor{tableheadcolor} \color{white} Subcategory & \color{white} Regular Expression & \color{white} Example \tabularnewline%
%subcategory reg.ex. examples
$MEASSURE$ & [0-9]+-[a-z]+ & \texttt{128-bit, 64-inch, 84-kg} \tabularnewline%
%
\end{tabular}
\end{table}
\FloatBarrier
\subsection{Entity Tokens ($ENTITY$)}
Special token types describe application and domain specific token
strings. Since the INEX corpus consists of computer science texts, a
subtype $WWW$ for web-related tokens is defined (see
Table~\ref{tab:tok_type_special_www}).
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$WWW$ Entity Token Types (3)}
\label{tab:tok_type_special_www}
\begin{tabular}{lll}
\rowcolor{tableheadcolor} \color{white} Subcategory & \color{white} Regular Expression & \color{white} Example \tabularnewline%
%subcategory reg.ex. examples
$EMAIL$ & [0-9a-zA-Z]([-$\backslash$.0-9a-zA-Z])*@([0-9a-zA-Z][-a-zA-Z]*[0-9a-zA-Z]$\backslash$.)+[a-zA-Z]{2,9} & \texttt{abc.def@xml.org} \tabularnewline%
$URL$ & (http|ftp|https)://([0-9a-zA-Z])+($\backslash$.([0-9a-zA-Z])+)+(/([0-9a-zA-Z])*)+ & \texttt{http://hekkas.com} \tabularnewline%
$FILE\_EXT$ & $\backslash$.[a-zA-Z][a-zA-Z][a-zA-Z] & \texttt{.txt, .GIF, .Pdf} \tabularnewline%
%
\end{tabular}
\end{table}
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\FloatBarrier
\section{Definition of Complex Token Types for Multi-Tokens}
% introduction
Sequences of token strings and basic token types can be further
exploited to define more elaborated multi-token types. Therefore,
rules that combine, split, and retype single-tokens to create
multi-tokens are applied. Promising candidates consisting of
multiple tokens include phone numbers (see
Table~\ref{tab:tok_ex_phone}), named entities, and special writing
formats (e.g., credit card numbers, product identifiers). Note that
some patterns (e.g., phone numbers like \texttt{+43 (0)463
2700-3511}) are identified context independently, whereas others
must include their textual surrounding (left and/or right contexts).
This section focuses on context independent pattern identification.
% example rule
An example of a multi-token rule is given in
Listing~\ref{listing:javatok_rule}.
%
\begin{lstlisting}[float,caption={[Example rule correcting sentence ends]Example rule correcting sentence ends},label=listing:javatok_rule]
<RULE>
<ID>R-EOS-001</ID>
<IGNORE_DELIMS>true</IGNORE_DELIMS>
<IN>
<TOKEN>
<STRING>.*</STRING>
<TYPE>^[(TT_PM)]</TYPE>
</TOKEN>
<TOKEN>
<STRING>\.</STRING>
<TYPE>TT_EOS</TYPE>
</TOKEN>
<TOKEN>
<STRING>.*</STRING>
<TYPE>ALPHA_COMMON_LOWER</TYPE>
</TOKEN>
</IN>
<OUT>
<TOKEN>
<STRING>$S0$.</STRING>
<TYPE>ABBREV_CAND</TYPE>
</TOKEN>
<TOKEN>
<STRING>$S2$</STRING>
<TYPE>$T2$</TYPE>
</TOKEN>
</OUT>
<DESCRIPTION>
Rule that re-concatenates a period to the previous token if it is followed
by a non-capitalized alphabetic token. This rule corrects wrong sentence ends
and marks abbreviation candidates, eventually handled by subsequent rules.
</DESCRIPTION>
</RULE>
\end{lstlisting}
%
% general rule description
Each rule consists of an ID (line 2), a flag that specifies whether
delimiter tokens have to be considered in the token input sequence
(line 3), an input sequence of tokens defined through string and
type specifications using regular expressions (lines 4--17), an
output sequence of (re)typed tokens eventually creating
multi-tokens, and an optional description. A string manipulation
language was developed to address strings and types of input tokens.
For instance, \texttt{$S0$} (resp. $T0$) refers to the token string
$S$ (resp. type $T$) of the first input token $0$. Additional
commands allow the extraction and recombination of substrings (e.g.,
\texttt{cuttail("aabcc",2)} results in the string \texttt{aab},
\texttt{cuthead("aabcc",3)} results in the string \texttt{cc}).
% extracted rules
Rule examples identifying multi-tokens in the INEX corpus:
%
\begin{description}
%
\item[R-1]
Rule that corrects wrong end of sentence tokens (see Listing~\ref{listing:javatok_rule});\\
e.g., \texttt{int + $END\_OF\_SENTENCE$ + companies} $\rightarrow$ \texttt{int.~companies}
%
\item[R-2]
Rule that re-concatenates hyphenated words at line breaks;\\
e.g., \texttt{auto- + $END\_OF\_LINE$ + matically} $\rightarrow$ \texttt{automatically}
%
\item[R-3]
Rule that splits clitics and expands it;\\
e.g., \texttt{isn't} $\rightarrow$ \texttt{is} + \texttt{not}
%
\item[R-4]
Rule that identifies international phone numbers;\\
e.g., \texttt{+43} + \texttt{(0)463} + \texttt{2700} + \texttt{-} + \texttt{3504} $\rightarrow$ \texttt{+43~(0)463~2700-3504}
%
\end{description}
% usages of rules
The rules described identify patterns directly. This means that a
pattern, optionally with its context(s), is specified by an expert.
This allows an identification of known patterns from a text. With
respect to information retrieval, the set of multi-terms identified
defines the term space for content representation. In X-DOSE, this
procedure identifies multi-terms that represent the multi-term
contents.
However, there is the possibility to identify unknown multi-terms
indirectly by specifying the context surrounding them. In these
rules the multi-term itself is replaced by a wildcard. Using the
Pattern Extractor to extract these context patterns, texts are mined
for additional multi-terms that occur in the same context. Thus, the
term space can be extended automatically based on context analysis
of few known patterns by a bootstrapping-like mechanism.
%
%The extracted rules can be used in two ways: First, rules are
%applied to identify complex patterns in texts for further
%processing. In the structured document retrieval prototype
%developed, multi-terms are identified and extracted to create
%additional indices.
%
%Second, rules identify a small set of high confident token patterns.
%Using the Pattern Extractor to extract these patterns and their left
%and right contexts, texts can be mined for additional patterns that
%occur in the same context. This procedure automatically obtains new
%patterns based on the context analysis of few known patterns
%(bootstrapping-like mechanism).
%
For instance, a small number of country names (e.g.,
\texttt{Austria}, \texttt{Chile}) generates a set of rules that
identify countries in certain contexts (e.g., \texttt{president of
[country]}). In a second step these rules are applied to identify
unknown country names in the same context.
%
The next section describes a statistical approach identifying rules
automatically.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Automatic Rule Extraction}
% introduction
The identification of complex domain-dependent and
application-specific tokenization rules is hard and time consuming
even for experts. Thus, approaches that solve this task
automatically are of importance.
%
% procedure
For this purpose, a Rule Miner is proposed that computes
statistically-motivated tokenization rules based on extracted
patterns and contexts.
%
Single-token patterns assumed to form multi-tokens (e.g., signed
numbers like \texttt{+43}) are extracted with left and right
contexts. The Rule Miner analyzes the contexts and generates rules
that identify the specified pattern. It is important to note that
the size of the context investigated has a major impact on the
computational performance, since token strings, token types, and
combinations of both are considered for rule generation. Strict
matching of strings and types is considered only. However, more
elaborated matching (e.g., same super token type $ALPHA$, regular
expressions that match token strings ended by \texttt{-ly}) seems
promising.
% experiments
Two experiments generating multi-token rules illustrate the usage of
the Rule Miner.
%
%
% EXPERIMENT 1
In a first experiment, plain number contexts with left and right
window sizes of one token are analyzed.
Table~\ref{tab:result_rules_numbers} shows the top ranked outputs
for left, right, and both contexts included.
%
% description of the table
$S$ resp. $T$ stands for a match of the token string resp. token
type. The last column contains the percentage of contexts that are
matched by the rule (out of 1.652.285 extracted patterns).
%
\begin{table}[ht]
\centering
\caption{Top 10 context rules for plain numbers}
\label{tab:result_rules_numbers}
%
\subfloat[Left context]{
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\begin{tabular}{rlr}
\rowcolor{tableheadcolor} \mcc{ID} & \color{white} Left context & \mcc{\%} \tabularnewline%
%left pattern right %
1 & $T = TT\_PM$ & 37,0\% \tabularnewline%
2 & $T = ALPHA\_COMMON\_LOWER$ & 21,2\% \tabularnewline%
3 & $S = [$ & 16,1\% \tabularnewline%
4 & $T = ALPHA\_COMMON\_UPPER$ & 14,3\% \tabularnewline%
5 & $S = ,$ & 13,2\% \tabularnewline%
6 & $T = TT\_UNKNOWN$ & 10,2\% \tabularnewline%
7 & $T = ABBR$ & 9,9\% \tabularnewline%
8 & $S = Fig.$ & 5,4\% \tabularnewline%
9 & $S = ($ & 5,1\% \tabularnewline%
10 & $T = TT\_EOS$ & 3,6\%
\end{tabular}\label{tab:result_rules_numbers_left}}
%
\quad
%
\subfloat[Right context]{
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\begin{tabular}{rlr}
\rowcolor{tableheadcolor} \mcc{ID} & \color{white} Right context & \mcc{\%} \tabularnewline%
%left pattern right %
1 & $T = TT\_PM$ & 53,5\% \tabularnewline%
2 & $S = ,$ & 24,7\% \tabularnewline%
3 & $T = ALPHA\_COMMON\_LOWER$ & 24,4\% \tabularnewline%
4 & $S = ]$ & 16,1\% \tabularnewline%
5 & $S = )$ & 9,4\% \tabularnewline%
6 & $T = TT\_EOS$ & 9,0\% \tabularnewline%
7 & $S = .$ & 8,3\% \tabularnewline%
8 & $T = TT\_UNKNOWN$ & 5,4\% \tabularnewline%
9 & $T = ALPHA\_COMMON\_UPPER$ & 4,2\% \tabularnewline%
10 & $S = and$ & 2,8\%
\end{tabular}\label{tab:result_rules_numbers_right}}
%
\\
%
\subfloat[Both contexts]{
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\begin{tabular}{rllr}
\rowcolor{tableheadcolor} \mcc{ID} & \color{white} Left context & \color{white} Right context & \mcc{\%} \tabularnewline%
%ID left pattern right
1 & $T = TT\_PM$ & $T = TT\_PM$ & 31,7\% \tabularnewline%
2 & $S = [$ & $T = TT\_PM$ & 16,0\% \tabularnewline%
3 & $T = TT\_PM$ & $S = ]$ & 15,9\% \tabularnewline%
4 & $S = [$ & $S = ]$ & 15,8\% \tabularnewline%
5 & $T = ALPHA\_COMMON\_LOWER$ & $T = ALPHA\_COMMON\_LOWER$ & 10,9\% \tabularnewline%
6 & $T = TT\_PM$ & $S = ,$ & 10,1\% \tabularnewline%
7 & $S = ,$ & $T = TT\_PM$ & 10,1\% \tabularnewline%
8 & $S = ,$ & $S = ,$ & 8,9\% \tabularnewline%
9 & $T = ALPHA\_COMMON\_UPPER$ & $T = ALPHA\_COMMON\_LOWER$ & 5,6\% \tabularnewline%
10 & $T = ALPHA\_COMMON\_UPPER$ & $T = TT\_PM$ & 5,5\%
\end{tabular}\label{tab:result_rules_numbers_both}}
%
\end{table}
%\newpage
% discussion of the results in the table
As expected, the percentage of matching contexts drops quickly.
Especially larger contexts (left and right context) occur less
frequent than shorter ones (either left or right context). According
to a threshold (i.e., 10\%), only a small fraction of extracted
rules has to be investigated. Based on the top four results
presented in Table~\ref{tab:result_rules_numbers}c, the rule $[$ +
$NUM\_PLAIN\_SIMPLE$ + $]$ that matches citations (e.g.,
\texttt{[3]}, \texttt{[42]}) is identified. Note that the token type
$TT\_PM$ for punctuation marks includes the characters $[$ and $]$.
%
%
% final list
The final list of rules that treat plain numbers contains:
%
\begin{description}
%
\item[Cross references (Table~\ref{tab:result_rules_numbers}a)]
Numbers preceded by \texttt{Fig.}~(5,4\%), \texttt{Figure}~(2,5\%), \texttt{Section}~(1,6\%), \texttt{Table}~(1,4\%), \texttt{Step}~(0,5\%), \texttt{Theorem}~(0,4\%), \texttt{Lemma}~(0,3\%); \\
e.g., \texttt{Section~4}, \texttt{Fig.~2}, \texttt{Step~5}
%
\item[Month dates (Table~\ref{tab:result_rules_numbers}a)]
Numbers preceded by \texttt{June}~(0,6\%), \texttt{May}~(0,6\%), \texttt{July}~(0,4\%), \texttt{Aug.}~(0,4\%), \texttt{Oct.}~(0,4\%), \texttt{Sept.}~(0,4\%), \texttt{Dec.}~(0,4\%), \texttt{Jan.}~(0,4\%), \texttt{Apr.}~(0,4\%), \texttt{Mar.}~(0,4\%),~\texttt{Nov.}~(0,4\%), \texttt{Feb.}~(0,3\%); \\
e.g., \texttt{May~24}, \texttt{Aug.~3}
%
\item[Quantities (Table~\ref{tab:result_rules_numbers}b)]
Numbers followed by \texttt{percent}~(1,4\%), \texttt{years}~(0,3\%), \texttt{bits}~(0,2\%), \texttt{million}~(0,1\%), \texttt{times}~(0,1\%), \texttt{MHz}~(0,1\%), \texttt{Mbytes}~(0,1\%); \\
e.g., \texttt{43~percent}, \texttt{2~years}, \texttt{5~times}
%
\item[Citations (Table~\ref{tab:result_rules_numbers}c)]
Numbers embraced by square brackets (15,8\%); \\
e.g., \texttt{[2]}, \texttt{[41]}, \texttt{[111]}
%
\end{description}
% EXPERIMENT 2
A second experiment focuses on the identification of international
phone numbers. Preceding country code patterns (e.g., \texttt{++43},
\texttt{+39}) are exploited to identify possible candidates. A left
context of one token and a right context of five tokens are included
in the analysis. The top ten results are presented in
Table~\ref{tab:result_rules_phone}.
%
% description of the table
As before, $S$ resp. $T$ denotes token strings resp. token types,
and the last column is the percentage of contexts that match the
rule (out of 955 extracted patterns). For better readability the
token type $NUM$ stands for $NUMERIC\_PLAIN\_SIMPLE$ (plain
numbers).
%
\begin{table}[ht]
\centering
\caption{Top 10 context rules for international country codes}
\label{tab:result_rules_phone}
%
\subfloat[Left context]{
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\begin{tabular}{rlr}
\rowcolor{tableheadcolor} \mcc{ID} & \color{white} Left context & \mcc{\%} \tabularnewline%
%left pattern right %
1 & $T = ALPHA\_COMMON\_LOWER$ & 80,6\% \tabularnewline%
2 & $S = fax$ & 37,5\% \tabularnewline%
3 & $S = phone$ & 25,7\% \tabularnewline%
4 & $T = TT\_PM$ & 15,1\% \tabularnewline%
5 & $S = voice$ & 12,4\% \tabularnewline%
6 & $S = :$ & 7,5\% \tabularnewline%
7 & $S = ($ & 6,3\% \tabularnewline%
8 & $T = TT\_EOS$ & 2,6\% \tabularnewline%
9 & $S = ;$ & 2,6\% \tabularnewline%
10 & $S = to$ & 1,4\%
\end{tabular}\label{tab:result_rules_phone_left}}
%
%\quad
%
\subfloat[Right context]{
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\begin{tabular}{rlr}
\rowcolor{tableheadcolor} \mcc{ID} & \color{white} Right context & \mcc{\%} \tabularnewline%
%left pattern right %
1 & $T = NUM$ & 73,3\% \tabularnewline%
2 & $T = NUM, T = NUM$ & 64,4\% \tabularnewline%
3 & $T = NUM, T = NUM, T = NUM$ & 36,5\% \tabularnewline%
4 & $T = NUM, T = NUM, T = TT\_EOS$ & 23,1\% \tabularnewline%
5 & $T = NUM, T = NUM, S = ;$ & 21,5\% \tabularnewline%
6 & $T = NUM, T = NUM, T = NUM, T = TT\_EOS$ & 19,3\% \tabularnewline%
7 & $T = TT\_PM$ & 19,3\% \tabularnewline%
8 & $T = NUM, T = NUM, T = NUM, S = ;$ & 19,0\% \tabularnewline%
9 & $T = NUM, T = NUM, T = NUM, T = NUM$ & 13,5\% \tabularnewline%
10 & $T = TT\_PM, T = NUM$ & 11,8\%
\end{tabular}\label{tab:result_rules_phone_right}}
%
\\
%
\subfloat[Both contexts]{
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\begin{tabular}{rllr}
\rowcolor{tableheadcolor} \mcc{ID} & \color{white} Left context & \color{white} Right context & \mcc{\%} \tabularnewline%
%ID left pattern right
1 & $T = ALPHA\_COMMON\_LOWER$ & $T = NUM$ & 66,2\% \tabularnewline%
2 & $T = ALPHA\_COMMON\_LOWER$ & $T = NUM, T = NUM$ & 57,9\% \tabularnewline%
3 & $T = ALPHA\_COMMON\_LOWER$ & $T = NUM, T = NUM, T = NUM$ & 32,0\% \tabularnewline%
4 & $S = fax$ & $T = NUM$ & 31,9\% \tabularnewline%
5 & $S = fax$ & $T = NUM, T = NUM$ & 27,9\% \tabularnewline%
6 & $S = phone$ & $T = NUM$ & 25,3\% \tabularnewline%
7 & $S = phone$ & $T = NUM, T = NUM$ & 23,9\% \tabularnewline%
8 & $T = ALPHA\_COMMON\_LOWER$ & $T = NUM, T = NUM, T = TT\_EOS$ & 22,2\% \tabularnewline%
9 & $T = ALPHA\_COMMON\_LOWER$ & $T = NUM, T = NUM, S = ;$ & 20,5\% \tabularnewline%
10 & $T = ALPHA\_COMMON\_LOWER$ & $T = NUM, T = NUM, T = NUM, T = TT\_EOS$ & 17,5\%
\end{tabular}\label{tab:result_rules_phone_both}}
%
\end{table}
% discussion of the results in the table
Table~\ref{tab:result_rules_phone}a shows that -- in combination --
75,8\% of all country code occurrences are preceded by the words
\texttt{phone}, \texttt{voice}, and \texttt{fax} (optionally
capitalized).
%
Over 70\% of tokens following country codes are numbers (see
Table~\ref{tab:result_rules_phone}b).
%
The experiment confirms the expectation that phone numbers are easy
to identify with high confidence. An investigation of the complete
Rule Miner output confirms that phone numbers are the only
meaningful multi-term pattern in the INEX corpus containing a
country code.
%
\begin{description}
%
\item[Phone numbers (Table~\ref{tab:result_rules_phone}c)]
International phone number preceded by \texttt{fax}~(37,5\%), \texttt{phone}~(25,7\%), \texttt{voice}~(12,4\%), \texttt{telephone/fax}~(0,1\%), \texttt{Fax}~(0,1\%), \texttt{Phone}~(0,1\%) and followed by one~(73,3\%), two~(64,4\%), three~(36,5\%), four~(13,5\%), or five~(4,8\%) plain numbers; \\
e.g., \texttt{phone~+43~1023~234~32345~12}
%
\end{description}
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Generating Single-Term Dictionaries}
% introduction
Single-term dictionaries contain single-tokens that support text
analysis and text representation. Previously assigned token types
and additional specifications of token strings are exploited to
generate dictionaries from the corpus automatically. This section
demonstrates the procedure of creating a full form dictionary and an
abbreviation dictionary. The same approach can be applied to
generate other domain-specific dictionaries including
%
\begin{compactitem}
\item email and internet addresses
%
\item domain-relevant acronyms
%
\item hyphenated word forms (e.g., \texttt{state-of-the-art}, \texttt{F-117A})
%
\item words containing certain affixes (e.g., noun dictionaries ended by \texttt{-ion}, terms started by \texttt{re-})
%
\item domain-specific single-token concepts (e.g., chemical formulas like \texttt{H$_2$SO$_4$})
%
\item etc.
\end{compactitem}
\subsubsection{Creating a Full Form Dictionary}
\label{sec:full_form_dictionary}
% full forms
A full form dictionary refers to the set of all `regular' words that
occur in the document collection. The Pattern Extractor selects only
single-token strings that consist of alphabetic lowercase
characters, possibly started by a capital letter. These tokens are
identified by the token types $ALPHA\_COMMON\_LOWER$ and
$ALPHA\_COMMON\_UPPER$. Both, left and right contexts are omitted.
% result
The INEX collection consists of 178.677.541 single-tokens including
delimiter tokens. Following the above guidelines and ignoring
delimiters, a total number of 92.855.161 single-tokens is examined.
Extracted token strings are converted into lowercase with cumulated
term frequencies. The final list contains 171.580 unique full forms
(out of 205.106 upper and lower case tokens). In this work the full
form dictionary is applied for stopword extraction (described in
Section~\ref{sec:stopwords}, first four columns of
Table~\ref{fig:stoplist_ranking_itf_div_idf}) and abbreviation
detection (described in the next section).
\subsubsection{Creating an Abbreviation Dictionary}
% introduction
Similar to the generation of a full form dictionary, a dictionary of
common abbreviations is extracted from the corpus.
%
% tokenizer and pattern extractor settings
JavaTok skips punctuation mark splitting (see
Figure~\ref{fig:javatok_architecture}) and the Pattern Extractor
picks token strings ended by periods only. In order to check the
correctness of the extraction, a left and right context of three
words is included. Preliminary tests have shown that this window
size is large enough to decide whether a pattern is an abbreviation
or not. Extracted results are ranked according to their frequency in
descending order.
% difficulties
An investigation of the extracted patterns clearly shows that a
simple, rule-based abbreviation detection approach is insufficient.
Table~\ref{tab:incorrect_abbrevs} shows incorrect abbreviations
identified among the top 100-ranked results. The examples are sorted
according to their frequency in descending order.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of incorrect abbreviation patterns}
\label{tab:incorrect_abbrevs}
\begin{tabular}{L{15cm}}
%
\texttt{Urbana-Champaign., MHz., PCs., '96., '95., '97., '94., 3D.,
PEs., '93., '98., NP-complete,. e-commerce., '99., e-mail., ICs.,
FPGAs., trade-offs., fault-free., '92., flip-flops., QoS.,
time-consuming., NP-hard., '91., deadlock-free., CPUs.,
NP-Completeness., APIs., Jan.-Mar., Wisconsin-Madison., 1998;.,
LANs., run-time., URLs., cost-effective., real-time., trade-off.,
1,000., July-Sept., A.1., '90., Mar.-Apr., PDAs., Jan.-Feb., 1999;.,
don't., GHz., 2D., dB., MultiMedia., ?1., 's., A.2., IPv6.}
%
\tabularnewline%
%
\end{tabular}
\end{table}
% general process
The generation of an abbreviation dictionary is iterative and
demands human intervention for selection purpose only. The idea is
based on the extraction of candidate patterns defined by token
strings ended by a single period.
%
According to Grefenstette and
Tapanainen~\cite{grefenstette94tokenization}, a large number of
non-abbreviations is excluded by using the terms of the corpus
itself as a filter. Since words ended by periods possibly mark ends
of sentences, a comparison to all words occurring within sentences
-- the full form dictionary created in the previous section --
preserves an extraction.
% process
The process starts with two empty list of abbreviations (positives)
and exceptions (negatives), and an empty set of positive
identification and negative exclusion rules. By skimming the
top-ranked entries and their contexts, both lists and rule sets grow
quickly. Applying these lists and rules in subsequent iterations,
the number of correct abbreviations among the top-ranked results
decreases rapidly. The procedure stops if no meaningful strings or
rules are identified.
% results
In the experiments the top $n=100$ extracted entries of the INEX
corpus are investigated. After four iterations 322 corpus-specific
abbreviations (see Table~\ref{tab:abbreviation_examples}) are
generated efficiently. Abbreviations in the table are sorted
alphabetically.
%
The list of exceptions includes the terms listed in
Table~\ref{tab:incorrect_abbrevs}.
%
Positive rules comprise patterns that contain periods within the
string (e.g., \texttt{w.r.t.}, \texttt{i.e.}, \texttt{Ph.D.},
\texttt{Comp.Sci.}).
%
Interestingly, capital letters separated and ended by periods (e.g.,
\texttt{A.}, \texttt{E.R.}, \texttt{C.P.R.}) turned out to be
worthless abbreviation patterns because these decode the first
name(s) of persons. In order to avoid this pattern a negative
exclusion rule $([A-Z]\backslash.)+$ is added.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Extracted abbreviations}
\label{tab:abbreviation_examples}
\begin{tabular}{L{15cm}}
%\rowcolor{tableheadcolor} \color{white} Category \tabularnewline%
%category example
\texttt{Acad., Adv., Akad., Amer., Appl., Applic., Apr., Apt., Assn.,
Assoc., Aug., Automat., Av., Ave., B.Eng., B.Sc., B.Tech., BTech.,
Biol., Biomed., Bld., Blvd., Bv., Calc., Calif., Capt., Cdr., Chem.,
Cir., Co., Col., Comm., Comp., Comp.Sci., Comput., Conf., Conn.,
Coop., Corp., Crit., Ct., Ctr., D.Sc., Dec., Depart., Dept., Depts.,
Devel., Dipl., Dipl.-Inform., Dipl.-Ing., Dipl.-Math., Dipl.Ing.,
Discr., Dist., Div., Dr-Ing., Dr., Dr.-Ing., Dr.Eng., Dr.rer.nat.,
Drs., Elec., Eng., Engr., Eq., Eqn., Equip., Feb., Fed., Fig.,
Figs., Fil., Fla., Ft., G.v., GmbH., Govt., Hist., Img., Inc., Ind.,
Inf., Info., Ing., Inst., Instrum., Int'l., Int., Intell., Intl.,
Jan., Jr., Jt., Jul., Jun., Knowl., Lehrst., Lt., Ltd., M.Eng.,
M.Math., M.Phil., M.Sc., MC-Sym., MSc., Mag., Maj., Mar., Md.,
Mgmt., Mgt., Mr., Mrs., Ms., Mt., Multiconf., Nat'l., Natl'l., Nos.,
Nov., Numer., Oct., Oper., Orthop., Ph.D., PhD., Phys., Pre-Proc.,
Proc., Prof., Profs., Rd., Re-Eng., Rev., S.p.A., Sch., Sci., Sect.,
Sekr., Sen., Sept., Soc., Sr., St., Stn., Stud., Succ., Symp.,
Syst., Technol., Tel., Theor., Theoret., Univ., Vol., Vt., W.l.o.g.,
Work-Conf., a.a., a.e.w., a.k.a., a.m., a.s., a.u., aff., alg.,
algm., algms., anal., appl., appls., approx., archit., archits.,
arith., assoc., avail., b.o.d., biol., boul., c.f., c.m., calc.,
ccts., cf., circ., cit., combinat., commun., conf., constr.,
contrib., correl., cross-develop., d.o.f., defn., determ., diags.,
dimens., distr., distrib., dyn., e.V., e.g., edn., eds., elec.,
engng., environ., eq., estim., etc., evol., expt., expts., fl.,
fns., g.c.i., geom., gov., heterog., high-perform., high-resoln.,
hist., hyperb., i.c., i.e., i.i.d., inc., indust., instrum.,
integrat., intell., intens., internat., intr., intro., jns., l.h.s.,
lang., langs., lb., m.c., maint., mech., mfg., mgt., mil.,
mission-crit., modif., mol., n-dimens., n.b.c., n.d.d., navig.,
nonlin., nos., o.c.s., objs., op., optim., p., p.d.f., p.m.,
particle-syst., perform., pers., phenom., phys., pol., poss., pp.,
preconf., probab., prod., prog., propag., pt., punt., qualitat.,
quantitat., r.h.s., r.m.s., r.p.m., r.v., recogn., reconstr., refl.,
reliab., rer., s.d., s.t., sc., scatt., sci., semicond., sens.,
seqs., simul., simuls., single-proc., sq., std., stds., symp.,
syst., systs., t.o., techn., technol., technols., techs., topol.,
transl., transm., vol., volc., vs., w.h.p., w.l.o.g., w.r.t.} \tabularnewline%
\end{tabular}
\end{table}
% consequences and proposed approach: lists + rules
The extracted abbreviations are integrated in JavaTok for enhancing
the sentence end disambiguation.
%
% proposed approach: lists + rules
To cope with the difficulty of incomplete lists, a combined approach
of dictionary lookup and rule-based identification is proposed. A
domain- and application-specific set of abbreviations is kept in a
dictionary.
%
If an abbreviation candidate is not found in the dictionary, the
following regular expression rules for identification are applied:
%
\begin{compactitem}
\item Abbreviations that are prefixes of full forms: e.g., \texttt{avail.}, \texttt{approx.}, \texttt{Technol.}
\item Abbreviations containing uppercase letters: e.g., \texttt{GmbH.}, \texttt{BTech.}
\item Abbreviations containing multiple periods: e.g., \texttt{w.r.t.}, \texttt{M.Math.}, \texttt{S.p.A.}
\item Abbreviations shorter than three letters: e.g., \texttt{pp.}, \texttt{cf.}, \texttt{Av.}
\end{compactitem}
%
If both identification methods are failing, the pattern is not
treated as an abbreviation.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\FloatBarrier
\section{Generating Multi-Term Dictionaries via Concordances}
\label{sec:advanced_nlp:multi_term_dictionaries}
% definition
One hypothesis of this work is that the identification of
multi-terms is helpful not only for natural language processing but
improves information retrieval related tasks. Experiments that
confirm this hypothesis are conducted by Arazy and
Woo~\cite{arazy07enhancing}.
%
Multi-terms, defined as a continuous sequence of words (adjacent
tokens, $n$-grams), often belong together creating a semantic unit.
Multi-terms can be classified into several categories: composite
nouns (e.g., \texttt{information retrieval}), named entities with a
minimum of two tokens (e.g., \texttt{Ford Motor Company}), formulaic
speech (e.g., \texttt{catch the bus}), full forms of acronyms (e.g.,
\texttt{Central Intelligence Agency (CIA)}), etc.
% aim
This section shows how meaningful multi-terms can be extracted from
the corpus automatically. The goal of this section is to extract
meaningful multi-terms from the corpus automatically. The quality of
the extracted results is improved by applying the elaborated
stopword lists (described in Section~\ref{sec:stopwords}) for
filtering of unlikely patterns. Depending on the extraction task the
different stopword layers and linguistic stopword categories are
exploited.
% order is an issue
One important issue of multi-term detection is that their length is
significant. For instance, the multi-terms \texttt{information
retrieval system} and \texttt{information system} are not bound to
each other. In contrast, the multi-terms \texttt{world wide} and
\texttt{wide web} do not occur independently. They occur only in
combination with each other in \texttt{world wide web}. Thus, longer
multi-terms have to be identified before shorter ones.
% three approaches
In the sequel three approaches are described: A first approach,
aiming at composite nouns, extracts token sequences consisting of
small letter tokens only. Sequences containing at least one token
that is a functional stopword or that is shorter than three letters
are neglected.
%
A second approach identifies named entities. Token sequences
consisting of small letters and started by a single capital letter
are extracted. As before, sequences containing stopwords or tokens
smaller than three letters are excluded.
%
A third approach focuses on the extraction of token sequences that
are missed by the previous composite noun extraction and named
entity extraction tasks. The goal is the identification of formulaic
speech and other patterns that are still useful for information
retrieval.
\subsection{Pattern Extraction Suited for Composite Nouns}
% goal and process description
This section deals with the automatic generation of non-capitalized
token sequences that fall into the categories of composite nouns and
non-capitalized named entities longer than a single token. The
Pattern Extractor extracts sequences of two, three, four, five, six,
and seven adjacent tokens of type $ALPHA\_COMMON\_LOWER$.
Preliminary tests indicate that longer sequences are inappropriate
multi-terms because these do not constitute composite nouns but
arbitrary parts of sentences. During extraction, left and right
contexts are not considered. Extracted multi-terms are sorted
according to their frequency in decreasing order.
% stopword filtering
Multi-terms that include stopwords are ignored by the Pattern
Extractor. This is done for two reasons: First, the amount of
extracted patterns increases exponentially compared to the number of
documents analyzed. Thus, excluding multi-terms that contain
stopwords reduces the data load tremendously. The remaining
multi-terms become applicable for information retrieval.
%
Second, neglecting stopwords in token sequences leads to multi-terms
that are closer related to semantic concepts than to complete
phrases. Since users tend to search for concepts instead of phrases
including articles and pronouns\footnote{The daily updated google
trends website provides an insight of popular user queries:
http://google.com/trends (05.05.2008)}, this approach better fits
information retrieval related tasks.
% performance
One has to note that processing of longer multi-terms requires a
considerably larger amount of memory than processing of shorter
ones. Although the number of longer sequences decreases (longer
sequences are more often interrupted by stopwords or tokens of other
types), the number of unique multi-terms is much higher. However,
the procedure can be applied on non-overlapping subsets of documents
independently. A final result for the whole collection is obtained
by merging the results for the sub-collections.
\subsubsection{Experiment I - Stopword Filtering}
% EXPERIMENT 1: experiments with 3000 docs
An initial experiment evaluates the reduction factor of multi-terms
by applying stopword filtering. For illustration and comparison
purpose, this experiment is restricted to 3.000 INEX documents. A
higher number of documents exceeds the maximum of two gigabytes of
Java-addressable working memory.
%
For each multi-term length (two, three, four, five, six, and seven
tokens) several experimental runs are conducted. One run without
stopword filtering, one run for each linguistic stopword category,
one run for each stopword layer (functional, content-related, and
domain-specific), and one run for the whole set of stopwords.
% results
Figure~\ref{fig:multiterms_3000_stopwords_01} presents the results
of the experiment. The different lengths of the multi-terms are
color-coded. The type of stopword filtering (see
Section~\ref{sec:stopwords}) is indicated by the labels on the
$x$-axis. $NONE$ means no stopword filtering, $FS$ (resp. $CR$ and
$DS$) are the functional (resp. content-related and domain-specific)
stopword layers, $FS\_SUM$ (resp. $CR\_SUM$ and $DS\_SUM$) stands
for the union of all $FS$ (resp. $CR$ and $DS$) stopwords, and $ALL$
is the union of all stopwords.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{06_advanced_nlp/figures/multiterms_3000_stopwords_01}
\caption[Using stopword filtering on multi-terms]{Using stopword filtering on multi-terms (color code indicates length of the multi-terms)}
\label{fig:multiterms_3000_stopwords_01}
\end{figure}
%
For instance, the figure shows that the total number of 6.117.535
multi-terms of two tokens is reduced to 1.439.427 (23,5\%) by
functional stopwords, 4.580.504 (74,9\%) by content-related
stopwords, and 5.412.582 (88,5\%) by domain-specific stopwords. The
union of all stopwords reduces the number of multi-terms to 614.141
(10,0\%). Especially determiners $DET$, prepositions $PREP$, and
connectors $CONN$ turn out to be excellent linguistic filter
categories.
% results cumulative
In addition to the filtering effect of independent stopword
categories and stopword layers, the cumulative filtering effect of
stopwords is investigated in
Figure~\ref{fig:multiterms_3000_stopwords_cumulative}. This effect
is evaluated on the same but unique set of multi-terms extracted.
%
In the figure, the linguistic stopwords categories are successively
applied for filtering from left (no filtering) to right (complete
filtering using all stopwords).
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{06_advanced_nlp/figures/multiterms_3000_stopwords_cumulative}
\caption{Cumulative filtering of linguistic stopword categories}
\label{fig:multiterms_3000_stopwords_cumulative}
\end{figure}
% number of multi-terms
Obviously, the number of shorter multi-terms is higher than the
number of larger multi-terms. Consider a text consisting of four
tokens: This token sequence contains at most three multi-terms of
length two, two multi-terms of length three, or one multi-term of
length four. Further constraints on the tokens (e.g., a certain
token type as $ALPHA\_COMMON\_LOWER$) emphasize this effect.
%
% unique constraint
A different effect occurs by applying the unique constraint. Because
there are less different combinations of a smaller number of tokens
than of a larger number of tokens, shorter multi-terms are
summarized to a much smaller set of unique multi-terms than longer
ones.
%
In the figure, these two effects become evident especially for the
unfiltered and unique multi-terms. Multi-terms of length two occur
least often (1.162.842 times), while multi-terms of length four
occur most often (3.596.448 times).
% filter effect
As expected, longer multi-terms are filtered to a higher degree than
shorter multi-terms. The overall reduction results in 340.951
(29,3\%) multi-terms of length two, 92.776 (3,1\%) multi-terms of
length three, 15.425 (0,4\%) multi-terms of length four, 2.535
(0,1\%) multi-terms of length five, 506 (0,02\%) multi-terms of
length six, and 155 (0,01\%) multi-terms of length seven.
\FloatBarrier
\subsubsection{Experiment II - Three Letter Constraint}
% EXPERIMENT 2: 3000 docs, minimum length of 3 characters
An investigation of the results obtained by the first experiment
shows that multi-terms frequently include terms consisting of one or
two letters. Examining the full form dictionary generated in
Section~\ref{sec:full_form_dictionary}, a minimum length constraint
of three characters per term is a good choice
%
A second experiment repeats the previous experiment with the same
settings, including the constraint of a minimal term length of three
letters. Again, the 3.000 INEX documents of the first experiment are
used, and all linguistic stopword categories and stopword layers are
applied for filtering.
%results
The results of the experiment are given in
Figure~\ref{fig:multiterms_3000_min3char_stopwords_01}. Compared to
the first experiment, the three letter constraint reduced the number
of unfiltered composite nouns to 3.827.298 (62,6\%) of length two,
2.389.252 (47,4\%) of length three, 1.500.294 (35,5\%) of length
four, 947.234 (27,4\%) of length five, 595.594 (21,1\%) of length
six, and 372.582 (16,2\%) of length seven. The figure shows that the
number of multi-terms of length two is reduced to 37,3\% by
functional stopwords, 71,4\% by content-related stopwords, and
85,6\% by domain-specific stopwords, Applying all stopword layers
combined, a final set of 606.169 (15,8\%) length-two multi-terms
remains.
%
Compared to the results of the first experiment, this drop of
reduction is explained by the pre-filtered one and two letter words,
which are mostly included in the functional stopwords (e.g.,
\texttt{a}, \texttt{in}, \texttt{on}, \texttt{by}).
%
\begin{figure}[h!]
\centering
\includegraphics[width=1.0\textwidth]{06_advanced_nlp/figures/multiterms_3000_min3char_stopwords_01}
\caption{Using stopword filtering on multi-terms (minimum of three letters per term)}
\label{fig:multiterms_3000_min3char_stopwords_01}
\end{figure}
%
Figure~\ref{fig:multiterms_3000_min3char_stopwords_cumulative}
depicts the cumulative filter effect on the unique multi-terms
extracted. In total, the three-letter filter reduces the number of
cumulative filtered multi-terms to 337.746 (33,9\%) of length two,
91.317 (5,3\%) of length three, 14.426 (1,0\%) of length four, 2.294
(0,3\%) of length five, 395 (0,1\%) of length six, and 78 (0,02\%)
of length seven. Compared to the first experiment, the three letter
constraint reduces the number of stopword filtered multi-terms by
0,9\% (two terms), 1,6\% (three terms), 3,1\% (four terms), 6,7\%
(five terms), 18,4\% (six terms), and 44,3\% (seven terms).
%
\begin{figure}[h!]
\centering
\includegraphics[width=1.0\textwidth]{06_advanced_nlp/figures/multiterms_3000_min3char_stopwords_cumulative}
\caption{Cumulative filtering of linguistic stopword categories (minimum of three letters per term)}
\label{fig:multiterms_3000_min3char_stopwords_cumulative}
\end{figure}
%\FloatBarrier
\subsubsection{Experiment III - Final List}
% EXPERIMENT 3: all docs, filtered by all SW_not_CR_DS
Based on the previous findings, a third experiment creates a
complete dictionary of composite noun suited patterns. All documents
in the INEX collection (16.819) are used. As before, different
lengthes of two, three, four, five, six, and seven tokens are
extracted. Each term must consist of at least three letters. No
stopwords except content-related and domain-specific nouns ($CR\_N$
and $DS\_N$) are allowed within extracted multi-terms. In the
context of extracting composite nouns the inclusion of these
linguistic stopword categories seems feasible.
% example result
Table~\ref{tab:composite_nouns} lists the top 5 composite noun
suited patterns for each length ordered by term frequency in
decreasing order. In the table $tf$ and $df$ refers to the term
frequency and document frequency of the multi-terms. The top 24
composite noun suited patterns for each length are presented in the
appendix in Section~\ref{app:sec:composite_nouns}.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top 5 patterns suited for composite nouns}
\label{tab:composite_nouns}
\begin{tabular}{L{10cm}rr}
\rowcolor{tableheadcolor} \color{white} Multi-term & \mcc{$tf$} & \mcc{$df$} \tabularnewline%
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Two terms} \tabularnewline%
%multi-term example
% length 2
execution time & 6.870 & 1.405 \tabularnewline%
electrical engineering & 6.333 & 3.316 \tabularnewline%
response time & 3.803 & 804 \tabularnewline%
source code & 3.768 & 1.403 \tabularnewline%
fault tolerance & 3.758 & 1.024 \tabularnewline%
% upper bound & 3369 & 1228 \tabularnewline%
% worst case & 3035 & 1305 \tabularnewline%
% experimental results & 2930 & 1517 \tabularnewline%
% test cases & 2839 & 511 \tabularnewline%
% wide range & 2657 & 1969 \tabularnewline%
% length 3
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Three terms} \tabularnewline%
% \rowcolor{tableheadcolor} \color{white} Multi-term & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
digital signal processing & 385 & 301 \tabularnewline%
natural language processing & 303 & 186 \tabularnewline%
partial differential equations & 302 & 210 \tabularnewline%
cache hit ratio & 282 & 46 \tabularnewline%
directed acyclic graph & 251 & 198 \tabularnewline%
% average response time & 251 & 121 \tabularnewline%
% test pattern generation & 249 & 148 \tabularnewline%
% distributed shared memory & 249 & 150 \tabularnewline%
% computational fluid dynamics & 247 & 188 \tabularnewline%
% finite state machine & 243 & 129 \tabularnewline%
% length 4
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Four terms} \tabularnewline%
% \rowcolor{tableheadcolor} \color{white} Multi-term & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
automatic test pattern generation & 79 & 68 \tabularnewline%
extended channel dependency graph & 62 & 8 \tabularnewline%
dynamic buffer allocation scheme & 55 & 1 \tabularnewline%
linear feedback shift register & 51 & 38 \tabularnewline%
parallel task completion time & 48 & 1 \tabularnewline%
% worst case response time & 40 & 7 \tabularnewline%
% call channel occupancy time & 35 & 1 \tabularnewline%
% worst case computation time & 31 & 9 \tabularnewline%
% linear feedback shift registers & 30 & 24 \tabularnewline%
% identically distributed random variables & 30 & 23 \tabularnewline%
% length 5
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Five terms} \tabularnewline%
% \rowcolor{tableheadcolor} \color{white} Multi-term & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
handoff call channel occupancy time & 19 & 1 \tabularnewline%
submit queries concerning historical events & 10 & 10 \tabularnewline%
average message latency versus traffic & 10 & 4 \tabularnewline%
scholarly archival journals inform readers & 9 & 9 \tabularnewline%
procedurally generated partial product reduction & 9 & 1 \tabularnewline%
% minimal cost distribution tree problem & 9 & 1 \tabularnewline%
% generated partial product reduction tree & 9 & 1 \tabularnewline%
% vertex versus maximal clique incidence & 8 & 1 \tabularnewline%
% row shift invariant wavelet packet & 8 & 1 \tabularnewline%
% robust path delay fault coverage & 8 & 2 \tabularnewline%
% length 6
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Six terms} \tabularnewline%
% \rowcolor{tableheadcolor} \color{white} Multi-term & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
procedurally generated partial product reduction tree & 9 & 1 \tabularnewline%
row shift invariant wavelet packet transform & 7 & 1 \tabularnewline%
adaptive row shift invariant wavelet packet & 7 & 1 \tabularnewline%
vertex versus maximal clique incidence matrix & 4 & 1 \tabularnewline%
vertex versus maximal clique incidence matrices & 4 & 1 \tabularnewline%
% produces dependency preserving nested database schemes & 4 & 1 \tabularnewline%
% nonbalanced identically distributed binary random variables & 4 & 1 \tabularnewline%
% beam addressed swept volume display unit & 4 & 1 \tabularnewline%
% systolic redundant residue arithmetic error correction & 3 & 2 \tabularnewline%
% redundant residue arithmetic error correction circuit & 3 & 2 \tabularnewline%
% length 7
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Seven terms} \tabularnewline%
% \rowcolor{tableheadcolor} \color{white} Multi-term & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
adaptive row shift invariant wavelet packet transform & 7 & 1 \tabularnewline%
systolic redundant residue arithmetic error correction circuit & 3 & 2 \tabularnewline%
wird etwas knapp bei mir sagen wir & 2 & 1 \tabularnewline%
une courbe qui remplit toute une aire & 2 & 2 \tabularnewline%
temporal strata translate temporal query language statements & 2 & 1 \tabularnewline%
% recommended practices define ethical standards define educational & 2 & 2 \tabularnewline%
% practices define ethical standards define educational curricula & 2 & 2 \tabularnewline%
% law enforcement agencies continually analyze vast amounts & 2 & 2 \tabularnewline%
% knapp bei mir sagen wir lieber vierzehn & 2 & 1 \tabularnewline%
% etwas knapp bei mir sagen wir lieber & 2 & 1 \tabularnewline%
\end{tabular}
\end{table}
%
The results show that term and document frequencies of longer
multi-terms decrease quickly. Multi-terms of two, three, and four
tokens provide appropriate index terms. Longer sequences than four
terms are more related to parts of sentences or headlines than to
composite nouns.
% reducing the number of composite nouns
However, the number of unique composite nouns is huge compared to
single-terms. In order to become available as index terms, the
number of multi-terms must be downsized. Therefore, the average term
frequency is used as a threshold for multi-term selection. The
numbers of final multi-terms are given in
Table~\ref{tab:size_composite_nouns}.
%
\# denotes the number of unique multi-terms, {\O} $tf$ refers to the
average term frequency of all multi-terms, and \# $\ge$ {\O} $tf$ is
the number of unique multi-terms that occur more often than the
average.
%
Applying the average term frequency as selection criteria turns out
as an effective filter. Achieved reduction factors are 95,7\% (two
terms), 90,5\% (three terms), 94,6\% (four terms), 96,6\% (five
terms), 97,4\% (six terms), and 98,0\% (seven terms).
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Number of unique patterns suited for composite nouns}
\label{tab:size_composite_nouns}
\begin{tabular}{lrrr}
\rowcolor{tableheadcolor} \color{white} Multi-term length & \mcc{\#} & \mcc{{\O} $tf$} & \mcc{\# $\ge$ {\O} $tf$} \tabularnewline%
%multi-term example
two tokens & 5.479.613 & 3,38 & 233.319 \tabularnewline%
three tokens & 1.002.764 & 1,35 & 94.798 \tabularnewline%
four tokens & 151.649 & 1,11 & 8.238 \tabularnewline%
five tokens & 23.473 & 1,05 & 795 \tabularnewline%
six tokens & 4.155 & 1,04 & 106 \tabularnewline%
seven tokens & 919 & 1,03 & 18 \tabularnewline%
\end{tabular}
\end{table}
% consequence for javatok, multi-term index
The composite nouns are integrated in JavaTok, enabling an efficient
computation of multi-term representations. These additional
representations are exploited during indexing and retrieval to
improve retrieval performance.
% outlook, further work, open issues
The procedure identifying stopwords ($\frac{itf}{idf}$ ranking
described in Section~\ref{sec:stoplist_extraction}) can also be
applied on multi-terms. This results in the generation of
stop-phrases, which adds a fourth layer to the stopword model
depicted in Figure~\ref{fig:stoplist_levels}. However, the
structured document retrieval prototype developed does not apply
stop-phrase filtering.
\FloatBarrier
\subsection{Pattern Extraction Suited for Named Entities}
% description of the procedure
The same procedure obtaining composite nouns is applied to generate
a list of named entity suited patterns. Again, sequences of two to
seven tokens starting with capital letters (token type
$ALPHA\_COMMON\_UPPER$) are extracted. According to the composite
noun experiment, the minimum term length is three characters.
Multi-terms containing stopwords except content-related and
domain-specific nouns ($CR\_N$ and $DS\_N$) are not extracted. As
for the final composite noun experiment, the inclusion of nouns
within named entities is plausible.
% results
The top 5 named entity suited patterns of each length are presented
in Table~\ref{tab:named_entities}. $tf$ and $df$ refer to the term
frequency and document frequency. The top 24 named entities are
included in the appendix in Section~\ref{app:sec:named_entities}.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top 5 patterns suited for named entities}
\label{tab:named_entities}
\begin{tabular}{L{10cm}rr}
\rowcolor{tableheadcolor} \color{white} Multi-term & \mcc{$tf$} & \mcc{$df$} \tabularnewline%
%multi-term example
% length 2
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Two terms} \tabularnewline%
Pattern Recognition & 5.914 & 1.575 \tabularnewline%
Machine Intelligence & 5.670 & 1.568 \tabularnewline%
Artificial Intelligence & 4.802 & 1.819 \tabularnewline%
Distributed Computing & 3.885 & 1.749 \tabularnewline%
Parallel Processing & 3.057 & 1.227 \tabularnewline%
% Image Processing & 2797 & 1200 \tabularnewline%
% World Wide & 2734 & 1537 \tabularnewline%
% Wide Web & 2658 & 1508 \tabularnewline%
% Neural Networks & 2230 & 715 \tabularnewline%
% Signal Processing & 2039 & 1006 \tabularnewline%
% length 3
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Three terms} \tabularnewline%
% \rowcolor{tableheadcolor} \color{white} Multi-term & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
World Wide Web & 2.650 & 1.505 \tabularnewline%
Pattern Recognition Letters & 444 & 287 \tabularnewline%
Unified Modeling Language & 394 & 274 \tabularnewline%
Internet Engineering Task & 328 & 284 \tabularnewline%
Ad Hoc Networks & 311 & 74 \tabularnewline%
% Object Request Broker & 300 & 207 \tabularnewline%
% North Carolina State & 298 & 209 \tabularnewline%
% Jet Propulsion Laboratory & 282 & 186 \tabularnewline%
% Wide Web Consortium & 274 & 232 \tabularnewline%
% Engineering Task Force & 274 & 238 \tabularnewline%
% length 4
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Four terms} \tabularnewline%
% \rowcolor{tableheadcolor} \color{white} Multi-term & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
World Wide Web Consortium & 272 & 230 \tabularnewline%
Internet Engineering Task Force & 272 & 237 \tabularnewline%
Reader Interest Survey Indicate & 204 & 204 \tabularnewline%
Goddard Space Flight Center & 166 & 118 \tabularnewline%
Virtual Reality Modeling Language & 150 & 136 \tabularnewline%
% Mobile Ad Hoc Networks & 114 & 47 \tabularnewline%
% San Diego Supercomputer Center & 90 & 63 \tabularnewline%
% Ad Hoc Wireless Networks & 83 & 45 \tabularnewline%
% Web Services Description Language & 77 & 61 \tabularnewline%
% Field Programmable Gate Arrays & 73 & 50 \tabularnewline%
% length 5
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Five terms} \tabularnewline%
% \rowcolor{tableheadcolor} \color{white} Multi-term & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
Sara Reese Hedberg Sara Reese & 23 & 23 \tabularnewline%
Reese Hedberg Sara Reese Hedberg & 23 & 23 \tabularnewline%
Linda Dailey Paulson Linda Dailey & 23 & 23 \tabularnewline%
Dailey Paulson Linda Dailey Paulson & 23 & 23 \tabularnewline%
Markov Regenerative Stochastic Petri Nets & 22 & 8 \tabularnewline%
% Iowa State College Statistical Laboratory & 22 & 1 \tabularnewline%
% Upsilon Pi Epsilon Student Award & 19 & 11 \tabularnewline%
% Unified Modeling Language Reference Manual & 19 & 19 \tabularnewline%
% Neural Networks Outstanding Paper Award & 19 & 19 \tabularnewline%
% Fault Tolerant Wormhole Routing Strategy & 19 & 19 \tabularnewline%
% length 6
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Six terms} \tabularnewline%
% \rowcolor{tableheadcolor} \color{white} Multi-term & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
Sara Reese Hedberg Sara Reese Hedberg & 23 & 23 \tabularnewline%
Linda Dailey Paulson Linda Dailey Paulson & 23 & 23 \tabularnewline%
Mary Jean Harrold Mary Jean Harrold & 11 & 11 \tabularnewline%
Shari Lawrence Pfleeger Shari Lawrence Pfleeger & 9 & 9 \tabularnewline%
Khaled El Emam Khaled El Emam & 9 & 9 \tabularnewline%
% Alberto Del Bimbo Alberto Del Bimbo & 8 & 8 \tabularnewline%
% Mo Kim Cheng Albert Mo Kim & 7 & 7 \tabularnewline%
% Kim Cheng Albert Mo Kim Cheng & 7 & 7 \tabularnewline%
% Hee Yong Youn Hee Yong Youn & 7 & 7 \tabularnewline%
% Andy Hunt Dave Thomas Andy Hunt & 7 & 7 \tabularnewline%
% length 7
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Seven terms} \tabularnewline%
% \rowcolor{tableheadcolor} \color{white} Multi-term & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
Mo Kim Cheng Albert Mo Kim Cheng & 7 & 7 \tabularnewline%
Albert Mo Kim Cheng Albert Mo Kim & 7 & 7 \tabularnewline%
Optimal Infinite Impulse Response Edge Detection Filters & 5 & 5 \tabularnewline%
Stochastic Petri Nets Representing Generalized Service Networks & 4 & 4 \tabularnewline%
Reversible Jump Markov Chain Monte Carlo Computation & 4 & 4 \tabularnewline%
% Oak Ridge Association Junior Faculty Enhancement Award & 3 & 3 \tabularnewline%
% International Test Conference Tutorials Washington Sheraton Hotel & 3 & 3 \tabularnewline%
% Whitaker Jane Wilhelms Yves Willems Peter Williams & 2 & 2 \tabularnewline%
% Victor De La Luz Victor De La & 2 & 2 \tabularnewline%
% Time Warp Synchronized Parallel Discrete Event Simulation & 2 & 2 \tabularnewline%
\end{tabular}
\end{table}
%
The results show that top-ranked named entity patterns and composite
noun patterns occur with similar frequencies. Extracted multi-terms
longer than four terms tend to denote list of names instead of named
entities.
%
In some cases, the mapping of the original INEX documents onto the
generic document format (see Section~\ref{sec:generic_xml_schema})
led to text recurrences. It occurs if two XML components are
combined in a single \texttt{FRA} element and explains the
repetition of terms in the named entities of length five, six, and
seven.
% reducing the number of named entities
The numbers of extracted named entities are given in
Table~\ref{tab:size_named_entities}. Although uniquely named entity
patterns are not as numerous as composite nouns, a reduction is
still necessary for indexing and retrieval. As in the previous
experiment, only multi-terms that are more frequent than the average
term frequency of the actual multi-term length ($\ge$ {\O} $tf$) are
selected.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Number of unique patterns suited for named entities}
\label{tab:size_named_entities}
\begin{tabular}{lrrr}
\rowcolor{tableheadcolor} \color{white} Multi-term length & \mcc{\#} & \mcc{{\O} $tf$} & \mcc{\# $\ge$ {\O} $tf$} \tabularnewline%
%multi-term example
two tokens & 955.390 & 3,55 & 41.473 \tabularnewline%
three tokens & 232.245 & 1,98 & 15.276 \tabularnewline%
four tokens & 54.891 & 1,51 & 6.682 \tabularnewline%
five tokens & 9.873 & 1,25 & 872 \tabularnewline%
six tokens & 3.953 & 1,13 & 225 \tabularnewline%
seven tokens & 2.055 & 1,03 & 37 \tabularnewline%
\end{tabular}
\end{table}
\FloatBarrier
\subsection{Pattern Extraction Suited for Formulaic Speech}
% description and motivation
A third approach aims at the extraction of multi-terms that are
missed by the composite noun and named entity extraction tasks.
These multi-terms are summarized as formulaic speech patterns,
including not only common language patterns but other multi-terms
that were excluded from previous extraction for three reasons:
%
\begin{compactenum}
\item token sequences consist of mixed token types (e.g., \texttt{United States of America})
\item token sequences contain tokens of one or two characters (e.g., \texttt{et al})
\item token sequences contain stopwords (e.g., \texttt{state of the art})
\end{compactenum}
%
From the information retrieval point of view, this allows to include
stopwords in multi-term indices although stopword filtering is
applied on single-terms.
% procedure, first experiment without stopword filtering
Accounting for such patterns, a first experiment extracts token
sequences of two to seven tokens containing both token types,
$ALPHA\_COMMON\_LOWER$ and $ALPHA\_COMMON\_UPPER$. The length of
token strings is not accounted for and all strings are converted
into lower case. Stopword checking is disabled. Only multi-terms not
included in the composite noun or named entity dictionaries are
extracted. Note that this extraction tasks includes composite noun
or named entity patterns that are missed by the previous
experiments. This is the case if patterns are used only at the
beginning of sentences or as titles where the first term is
capitalized.
% results
The numbers of unique multi-terms (\#) are given in
Table~\ref{tab:size_identified_formulaic}. {\O} $tf$ is the average
term frequency and \# $\ge$ {\O} $tf$ the number of multi-terms with
a higher term frequency than the average. However, the high numbers
of unique multi-terms require further filtering to become useful for
retrieval.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Number of unfiltered formulaic speech}
\label{tab:size_identified_formulaic}
\begin{tabular}{lrrr}
\rowcolor{tableheadcolor} \color{white} Multi-term length & \mcc{\#} & \mcc{{\O} $tf$} & \mcc{\# $\ge$ {\O} $tf$} \tabularnewline%
%multi-term example
two tokens & 48.623.745 & 18,20 & 2.671.814 \tabularnewline%
three tokens & 44.899.721 & 2,54 & 16.041.542 \tabularnewline%
four tokens & 38.341.039 & 1,33 & 26.118.414 \tabularnewline%
five tokens & 32.096.913 & 1,12 & 27.450.841 \tabularnewline%
six tokens & 26.687.256 & 1,06 & 24.745.505 \tabularnewline%
seven tokens & 22.088.444 & 1,04 & 21.071.165 \tabularnewline%
\end{tabular}
\end{table}
An investigation of the unfiltered multi-terms shows that mainly
functional stopwords occur frequently in the patterns (see
Table~\ref{tab:unfiltered_formulaic}). The multi-terms in the table
are sorted according to their term frequencies. As before,
multi-terms longer than five terms turn out to be inappropriate for
indexing.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top-ranked unfiltered formulaic speech}
\label{tab:unfiltered_formulaic}
\begin{tabular}{lL{8cm}}
\rowcolor{tableheadcolor} \color{white} Multi-term length & \mcc{\color{white} Multi-terms} \tabularnewline%
%multi-term example
two tokens & \texttt{of the}, \texttt{in the}, \texttt{to the}, \texttt{and the}, \texttt{on the}, \texttt{can be}, \texttt{for the}, \texttt{of a}, \texttt{is a}, \texttt{number of} \tabularnewline%
three tokens & \texttt{curricula vitae of}, \texttt{the university of}, \texttt{the number of}, \texttt{as well as}, \texttt{one of the}, \texttt{a set of} \tabularnewline%
four tokens & \texttt{vitae curricula vitae of}, \texttt{curricula vitae curricula vitae}, \texttt{at the university of}, \texttt{from the university of}, \texttt{his research interests include} \tabularnewline%
five tokens & \texttt{curricula vitae curricula vitae of}, \texttt{is a member of the}, \texttt{he is a member of}, \texttt{in computer science from the} \tabularnewline%
six tokens & \texttt{he is a member of the}, \texttt{computer science from the university of}, \texttt{in computer science from the university} \tabularnewline%
seven tokens & \texttt{in computer science from the university of}, \texttt{of computer science at the university of}, \texttt{in electrical engineering from the university of} \tabularnewline%
\end{tabular}
\end{table}
% experiment 2, stopword filtering
Based on the intermediate results of the initial experiment (see
Tables~\ref{tab:size_identified_formulaic} and
\ref{tab:unfiltered_formulaic}), a second experiment including
minimal stopword filtering is run. In order to reduce the number of
patterns extracted, multi-terms started or ended by functional
stopwords are excluded.
% results
Table~\ref{tab:formulaic_speech_examples} lists the top 5 extracted
patterns with their term frequency $tf$ and document frequency $df$.
The top 24 ranked formulaic speech suited patterns are given in
Section~\ref{app:sec:formulaic_speech} in the appendix.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top 5 patterns suited for formulaic speech}
\label{tab:formulaic_speech_examples}
\begin{tabular}{L{10cm}rr}
\rowcolor{tableheadcolor} \color{white} Multi-term & \mcc{$tf$} & \mcc{$df$} \tabularnewline%
%multi-term example
% length 2
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Two terms} \tabularnewline%
computer science & 29.743 & 8.233 \tabularnewline%
research interests & 14.241 & 6.737 \tabularnewline%
et al & 12.946 & 3.645 \tabularnewline%
software engineering & 11.901 & 2.927 \tabularnewline%
interests include & 11.382 & 5.923 \tabularnewline%
% vitae curricula & 11048 & 11048 \tabularnewline%
% computer vision & 10632 & 1851 \tabularnewline%
% other hand & 10266 & 5282 \tabularnewline%
% data set & 8604 & 1525 \tabularnewline%
% computer society & 8379 & 3859 \tabularnewline%
% software development & 8023 & 2546 \tabularnewline%
% distributed systems & 7985 & 2749 \tabularnewline%
% computer graphics & 7927 & 1880 \tabularnewline%
% data sets & 6943 & 1528 \tabularnewline%
% information systems & 6554 & 2577 \tabularnewline%
% operating system & 6451 & 2425 \tabularnewline%
% other words & 6054 & 3301 \tabularnewline%
% see figure & 6043 & 2201 \tabularnewline%
% total number & 5978 & 2601 \tabularnewline%
% pattern analysis & 5691 & 1546 \tabularnewline%
% same time & 5126 & 3376 \tabularnewline%
% operating systems & 5106 & 2434 \tabularnewline%
% computer engineering & 5020 & 2477 \tabularnewline%
% user interface & 4892 & 1963 \tabularnewline%
% large number & 4841 & 2988 \tabularnewline%
% length 3
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Three terms} \tabularnewline%
vitae curricula vitae & 11.048 & 11.048 \tabularnewline%
curricula vitae curricula & 11.048 & 11.048 \tabularnewline%
research interests include & 10.260 & 5.479 \tabularnewline%
parallel and distributed & 6.937 & 2.158 \tabularnewline%
analysis and machine & 5.425 & 1.487 \tabularnewline%
% degree in computer & 4603 & 2397 \tabularnewline%
% institute of technology & 4563 & 2775 \tabularnewline%
% shown in figure & 4430 & 2078 \tabularnewline%
% number of processors & 3559 & 793 \tabularnewline%
% described in section & 3545 & 1819 \tabularnewline%
% department of computer & 3513 & 2221 \tabularnewline%
% national science foundation & 3390 & 2464 \tabularnewline%
% university of california & 3315 & 1951 \tabularnewline%
% science and engineering & 2985 & 1829 \tabularnewline%
% shown in table & 2901 & 1570 \tabularnewline%
% professor of computer & 2663 & 2010 \tabularnewline%
% number of nodes & 2652 & 932 \tabularnewline%
% electrical and computer & 2486 & 1442 \tabularnewline%
% hardware and software & 2392 & 1506 \tabularnewline%
% discussed in section & 2301 & 1372 \tabularnewline%
% point of view & 2300 & 1534 \tabularnewline%
% organized as follows & 2152 & 2115 \tabularnewline%
% degree in electrical & 2112 & 1339 \tabularnewline%
% university of illinois & 2058 & 1151 \tabularnewline%
% paper is organized & 2053 & 2053 \tabularnewline%
% length 4
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Four terms} \tabularnewline%
curricula vitae curricula vitae & 11.048 & 11.048 \tabularnewline%
pattern analysis and machine & 5.404 & 1.480 \tabularnewline%
analysis and machine intelligence & 5.384 & 1.478 \tabularnewline%
science from the university & 3.796 & 2.575 \tabularnewline%
degree in computer science & 3.765 & 2.125 \tabularnewline%
% parallel and distributed systems & 3052 & 1344 \tabularnewline%
% department of computer science & 2938 & 1934 \tabularnewline%
% professor in the department & 2342 & 1796 \tabularnewline%
% engineering from the university & 2337 & 1742 \tabularnewline%
% electrical and computer engineering & 2334 & 1383 \tabularnewline%
% professor of computer science & 2287 & 1739 \tabularnewline%
% science at the university & 2173 & 1593 \tabularnewline%
% parallel and distributed computing & 2133 & 1146 \tabularnewline%
% vision and pattern recognition & 1947 & 745 \tabularnewline%
% computer vision and pattern & 1937 & 737 \tabularnewline%
% computer science and engineering & 1917 & 1225 \tabularnewline%
% degree in electrical engineering & 1814 & 1220 \tabularnewline%
% engineering at the university & 1389 & 1060 \tabularnewline%
% degrees in computer science & 1347 & 1091 \tabularnewline%
% presented in this paper & 1280 & 917 \tabularnewline%
% engineering and computer science & 1259 & 898 \tabularnewline%
% due to the fact & 1176 & 851 \tabularnewline%
% electrical engineering and computer & 1149 & 808 \tabularnewline%
% current research interests include & 1060 & 878 \tabularnewline%
% vitae of curricula vitae & 1024 & 564 \tabularnewline%
% length 5
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Five terms} \tabularnewline%
pattern analysis and machine intelligence & 5.378 & 1.476 \tabularnewline%
computer science from the university & 3.602 & 2.455 \tabularnewline%
paper is organized as follows & 2.005 & 2.005 \tabularnewline%
computer science at the university & 1.998 & 1.465 \tabularnewline%
computer vision and pattern recognition & 1.918 & 733 \tabularnewline%
electrical engineering and computer science & 1.055 & 756 \tabularnewline%
% curricula vitae of curricula vitae & 1.024 & 564 \tabularnewline%
% electrical engineering from the university & 961 & 798 \tabularnewline%
% authors would like to thank & 877 & 865 \tabularnewline%
% department of electrical and computer & 822 & 599 \tabularnewline%
% associate professor in the department & 771 & 691 \tabularnewline%
% assistant professor in the department & 740 & 654 \tabularnewline%
% member of the technical staff & 627 & 505 \tabularnewline%
% university of texas at austin & 624 & 425 \tabularnewline%
% computer engineering from the university & 599 & 522 \tabularnewline%
% vitae curricula vitae of curricula & 550 & 550 \tabularnewline%
% work was supported in part & 545 & 537 \tabularnewline%
% state university of new york & 540 & 382 \tabularnewline%
% computer engineering at the university & 537 & 429 \tabularnewline%
% associate professor of computer science & 527 & 480 \tabularnewline%
% lecture notes in computer science & 521 & 357 \tabularnewline%
% knowledge discovery and data mining & 500 & 256 \tabularnewline%
% engineering from the indian institute & 477 & 390 \tabularnewline%
% programming languages and operating systems & 470 & 269 \tabularnewline%
% university of california at berkeley & 467 & 361 \tabularnewline%
% length 6
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Six terms} \tabularnewline%
professor in the department of computer & 1.224 & 986 \tabularnewline%
department of electrical and computer engineering & 818 & 597 \tabularnewline%
vitae curricula vitae of curricula vitae & 550 & 550 \tabularnewline%
curricula vitae curricula vitae of curricula & 550 & 550 \tabularnewline%
science from the university of california & 542 & 459 \tabularnewline%
professor in the department of electrical & 525 & 457 \tabularnewline%
% annals of the history of computing & 469 & 218 \tabularnewline%
% department of computer science and engineering & 466 & 357 \tabularnewline%
% vitae of curricula vitae of curricula & 457 & 241 \tabularnewline%
% support for programming languages and operating & 453 & 259 \tabularnewline%
% rest of the paper is organized & 452 & 452 \tabularnewline%
% transactions on knowledge and data engineering & 358 & 257 \tabularnewline%
% computer science department at the university & 356 & 305 \tabularnewline%
% professor of electrical and computer engineering & 354 & 317 \tabularnewline%
% rest of this paper is organized & 350 & 350 \tabularnewline%
% professor in the computer science department & 347 & 309 \tabularnewline%
% science from the university of illinois & 330 & 286 \tabularnewline%
% research interests are in the areas & 328 & 294 \tabularnewline%
% transactions on parallel and distributed systems & 312 & 263 \tabularnewline%
% science at the university of california & 303 & 242 \tabularnewline%
% science and engineering at the university & 296 & 233 \tabularnewline%
% national institute of standards and technology & 294 & 253 \tabularnewline%
% transactions on pattern analysis and machine & 290 & 197 \tabularnewline%
% remainder of this paper is organized & 290 & 290 \tabularnewline%
% degree in computer science and engineering & 290 & 246 \tabularnewline%
% length 7
\rowcolor{tableheadcolor} \mmcc{3}{\color{white} Seven terms} \tabularnewline%
degree in computer science from the university & 1.139 & 857 \tabularnewline%
professor in the department of computer science & 1.001 & 833 \tabularnewline%
professor of computer science at the university & 624 & 548 \tabularnewline%
department of computer science at the university & 580 & 476 \tabularnewline%
curricula vitae curricula vitae of curricula vitae & 550 & 550 \tabularnewline%
% computer science from the university of california & 515 & 434 \tabularnewline%
% vitae of curricula vitae of curricula vitae & 457 & 241 \tabularnewline%
% curricula vitae of curricula vitae of curricula & 457 & 241 \tabularnewline%
% degrees in computer science from the university & 452 & 412 \tabularnewline%
% support for programming languages and operating systems & 448 & 255 \tabularnewline%
% architectural support for programming languages and operating & 441 & 258 \tabularnewline%
% electrical and computer engineering at the university & 426 & 346 \tabularnewline%
% associate professor in the department of computer & 417 & 391 \tabularnewline%
% engineering from the indian institute of technology & 413 & 353 \tabularnewline%
% assistant professor in the department of computer & 404 & 363 \tabularnewline%
% degree in electrical engineering from the university & 351 & 311 \tabularnewline%
% computer science from the university of illinois & 328 & 284 \tabularnewline%
% computer science at the university of california & 292 & 232 \tabularnewline%
% transactions on pattern analysis and machine intelligence & 288 & 196 \tabularnewline%
% computer science and engineering at the university & 280 & 218 \tabularnewline%
% engineering and computer science at the university & 246 & 197 \tabularnewline%
% electrical and computer engineering from the university & 242 & 218 \tabularnewline%
% department of electrical engineering and computer science & 242 & 166 \tabularnewline%
% circling the appropriate number on the reader & 208 & 208 \tabularnewline%
% computer science from the university of maryland & 196 & 182 \tabularnewline%
\end{tabular}
\end{table}
As explained for the named entity extraction, mapping difficulties
of the INEX documents onto the generic format lead to the repetition
of texts. In Table~\ref{tab:formulaic_speech_examples}, the
multi-term \texttt{curricula vitae curricula vitae} indicates that
\texttt{curricula vitae} occurs as an attribute in the \texttt{<sec
title="curricula vitae">} element and in the content of the element
itself. During the mapping, both contents were merged into a single
\texttt{FRA} element.
%
As a consequence, the algorithm correctly extracted the three
multi-terms \texttt{vitae curricula vitae}, \texttt{curricula vitae
curricula}, and \texttt{curricula vitae curricula vitae} all with
the same term frequency $tf$ and document frequency $df$.
%
Generally, this kind of `redundant mentioning' is not the norm in
the INEX documents. Correcting these errors manually would cost much
time, which stands in no relation to the benefits.
%\FloatBarrier
% final result numbers
The final numbers of multi-terms are given in
Table~\ref{tab:size_formulaic_speech}. In contrast to the
multi-terms extracted previously, only patterns occurring five times
more often than the average term frequency (\# $\ge$ 5 $\cdot$ {\O}
$tf$) are extracted. This threshold is chosen for two reasons: (1)
The high number of multi-terms would drop retrieval performance, and
(2) the result covers a lot of less frequent multi-terms that are
not relevant for information retrieval. Although the extracted
patters are definitely useful for further linguistic analysis, their
benefit for retrieval has to be evaluated.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Number of unique patterns suited for formulaic speech}
\label{tab:size_formulaic_speech}
\begin{tabular}{lrrrr}
\rowcolor{tableheadcolor} \color{white} Multi-term length & \mcc{\#} & \mcc{{\O} $tf$} & \mcc{\# $\ge$ {\O} $tf$} & \mcc{\# $\ge$ 5 $\cdot$ {\O} $tf$} \tabularnewline%
%multi-term example
two tokens & 1.218.870 & 5,66 & 165.344 & 33.529 \tabularnewline%
three tokens & 6.063.211 & 1,77 & 1.195.166 & 102.381 \tabularnewline%
four tokens & 9.046.078 & 1,26 & 915.462 & 65.591 \tabularnewline%
five tokens & 8.516.547 & 1,10 & 451.576 & 24.180 \tabularnewline%
six tokens & 7.100.605 & 1,06 & 252.461 & 11.612 \tabularnewline%
seven tokens & 6.057.925 & 1,05 & 168.818 & 6.823 \tabularnewline%
\end{tabular}
\end{table}
%
% result interpretation
Extracted multi-terms longer than three terms only marginally
coincide with pure concepts. However, top-ranked patterns seem to be
helpful during searches. The extracted patterns further include:
%
\begin{compactitem}
%
\item Expected writing patterns, e.g., \texttt{et al}, \texttt{point of view}, \texttt{state of the art}, \texttt{paper is organized as follows}
%
\item Single letter combinations in titles written in spread out capital letters: e.g., \texttt{R~E~P~O~R~T} (realized as a multi-term consisting of six capitalized single-letter tokens)
%
\item Missing composite nouns in mixed case titles or acronyms, e.g., \texttt{Computer science}, \texttt{Research interests}, \texttt{National science foundation}
%
\end{compactitem}
\FloatBarrier
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Acronym Extraction and Expansion}
\label{sec:advanced_nlp:acronyms}
% introduction
Especially in technical documents, acronyms are extensively used to
reduce the length of a text, make the text more readable, and assign
names to complex entities. Knowledge about the meaning of an acronym
is essential to understand the whole content. In other words,
misinterpretation of acronyms leads to confusion and obscurity.
%
% acronyms & context
Within texts new acronyms are often defined on demand. Therefore,
complete lists of acronyms (even for restricted domains) are not
available. Acronyms have to be identified dynamically and must be
interpreted according to their context (eventually only valid in the
current document).
% IR point of view
From the information retrieval point of view, acronyms provide
important information about the content of a document. This is the
reason for including acronyms in the index terms. However, queries
that address the full form of an acronym do not match texts that
solely contain the short form of the same acronym. In structured
document retrieval where small portions of texts (e.g., \texttt{FRA}
elements) are compared against a query, this leads to a drop of
recall. On the other hand, queries addressing the short form of an
acronym may retrieve in a high number of results containing the
acronym but with a different meaning. Thus, retrieval precision
decreases.
% goal
This section describes an automatic extraction process of acronyms
and their corresponding full forms. The retrieval engine uses the
extracted patterns to expand queries to include both, short and full
forms of acronyms mentioned. Because queries are generally too short
to apply a proper context analysis, all known full forms are added
for each acronym.
% extracted patterns
An acronym dictionary is created in two steps: First, acronym
patterns together with their contexts are extracted. Second, context
analysis is applied for filtering incorrect full forms.
%
Figure~\ref{fig:acronym_patterns} depicts the two patterns
investigated.
%
The first $Pattern~I$ extracts acronyms that are followed by their
embraced full forms. The opening and closing brackets are exploited
to identify the complete full form.
%
The second $Pattern~II$ identifies full forms that are followed by
their embraced acronyms (reverse lookup). In contrast to the first
pattern, the size of the full form (number of terms included) is not
defined explicitly. Because acronyms may contain words that do not
appear in the acronym string itself (e.g., \texttt{of},
\texttt{for}, \texttt{in}, \texttt{and}), a dynamic matching
approach is applied.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.8\textwidth]{06_advanced_nlp/figures/acronym_patterns}
\caption{Extracted acronym patterns}
\label{fig:acronym_patterns}
\end{figure}
% procedure
In the experiment, two to seven letter single-tokens typed as
$ALPHA\_ACRONYM\_SIMPLE$ are investigated. The size of the
pre-window (resp. post-window) is set to the number of characters in
the acronym plus five additional single-tokens. This allows up to
five context words being included which are not reflected in the
acronym itself. These words are allowed to be stopwords of the
functional stopword layer only.
% three strategies
Each extracted context (full form) is analyzed according to its
conformance with the acronym (short form). Three different
strategies are tested:
%
\begin{compactitem}
\item \textbf{S1:} The same number of tokens than letters in the acronym is extracted.
\item \textbf{S2:} The same number of tokens than letters in the acronym is extracted. Initial token-string letters must match the corresponding acronym letters.
\item \textbf{S3:} The same or a higher number of tokens than letters in the acronym is extracted. Initial token-string letters must match the corresponding acronym letters. However, functional stopwords that do not correspond with acronym letters are allowed.
\end{compactitem}
% results
Table~\ref{tab:size_acronyms} summarizes the number of extracted
acronyms according to each strategy.
%
% Pattern I == right
% Pattern II == left
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Number of unique extracted acronyms}
\label{tab:size_acronyms}
\begin{tabular}{crrrrrrrrr}
%multi-term example
\rowcolor{tableheadcolor} & \mmcc{3}{\color{white} Pattern I} & \mmcc{3}{\color{white} Pattern II} & \mmcc{3}{\color{white} Pattern I $\cup$ Pattern II}\tabularnewline%
\rowcolor{tableheadcolor} \color{white} Acronym length & \mcc{\color{white} S1} & \mcc{\color{white} S2} & \mcc{\color{white} S3} & \mcc{\color{white} S1} & \mcc{\color{white} S2} & \mcc{\color{white} S3} & \mcc{\color{white} S1} & \mcc{\color{white} S2} & \mcc{\color{white} S3} \tabularnewline%
%
two letters & 323 & 206 & 206 & 5.192 & 2.509 & 2.608 & 5.428 & 2.630 & 2.729 \tabularnewline%
three letters & 1.218 & 1.051 & 1.051 & 15.940 & 7.588 & 8.187 & 16.539 & 8.037 & 8.636 \tabularnewline%
four letters & 533 & 451 & 451 & 8.879 & 2.695 & 3.621 & 9.197 & 2.942 & 3.868 \tabularnewline%
five letters & 95 & 68 & 68 & 2.157 & 372 & 674 & 2.232 & 422 & 724 \tabularnewline%
six letters & 29 & 18 & 18 & 643 & 43 & 106 & 669 & 59 & 122 \tabularnewline%
seven letters & 8 & 4 & 4 & 245 & 2 & 9 & 252 & 6 & 13 \tabularnewline%
\end{tabular}
\end{table}
% discussion of the numbers
The results show that acronym patterns in the INEX documents are
commonly used. $Pattern~II$ (e.g., \texttt{...Unified Modeling
Language (UML)...}) occurs more often than $Pattern~I$ (e.g.,
\texttt{...DAG (Directed Acyclic Graph)...}). The majority of
acronyms consist of three and four letters, where three letter
acronyms are most frequent. But even longer acronyms of length six
and seven characters are meaningful search concepts.
%
Interestingly, the acronyms of $Pattern~I$ and $Pattern~II$ show
only marginal overlap (see the last three columns in
Table~\ref{tab:size_acronyms}).
%
Another observation is that acronyms defined by $Pattern~I$ never
included a stopword in their corresponding full form. This becomes
evident by investigating the strategies $S2$ and $S3$, which
resulted in the same acronyms (regardless their lengths).
% discussion of the strategies
Strategy $S1$ extracts all occurrences of a pattern. However, it
includes many incorrect full forms of acronyms for both acronym
patterns (see Table~\ref{tab:acronyms_incorrect}). The second
strategy $S2$ matching first letters minimizes these errors and
results in 41,1\% of all patterns. Unfortunately, $S2$ excludes a
lot of acronyms that include stopwords. Hence, the third strategy
$S3$ relaxes the first letter matching constraint of $S2$ by
allowing functional stopwords within the acronyms' full form. $S3$
results in 46,9\% of all extracted acronyms.
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Examples of incorrectly identified acronyms applying strategy $S1$}
\label{tab:acronyms_incorrect}
\begin{tabular}{llrr}
\rowcolor{tableheadcolor} \color{white} Acronym & \color{white} Full form & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
%acronym full form tf df
% length 2
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Two tokens} \tabularnewline%
EM & the expectation-maximization & 42 & 1 \tabularnewline%
TR & v i & 20 & 1 \tabularnewline%
XP & extreme programming & 20 & 2 \tabularnewline%
HA & faulty fa & 12 & 1 \tabularnewline%
MP & t comm & 11 & 1 \tabularnewline%
% length 3
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Three tokens} \tabularnewline%
MIT & institute of technology & 77 & 2 \tabularnewline%
USC & of southern california & 53 & 9 \tabularnewline%
ETH & institute of technology & 45 & 5 \tabularnewline%
CNR & national research council & 41 & 5 \tabularnewline%
ISO & organization for standardization & 41 & 2 \tabularnewline%
% length 4
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Four tokens} \tabularnewline%
NIST & of standards and technology & 55 & 3 \tabularnewline%
CVPR & vision and pattern recognition & 35 & 9 \tabularnewline%
RTSS & ieee real-time systems symposium & 34 & 20 \tabularnewline%
PLDI & language design and implementation & 33 & 4 \tabularnewline%
NCSA & center for supercomputing applications & 33 & 3 \tabularnewline%
% length 5
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Five tokens} \tabularnewline%
KAIST & institute of science and technology & 56 & 9 \tabularnewline%
NSERC & engineering research council of canada & 34 & 11 \tabularnewline%
JETTA & testing : theory and applications & 21 & 7 \tabularnewline%
CIPIC & image processing and integrated computing & 20 & 6 \tabularnewline%
DISCA & the department of computer engineering & 17 & 5 \tabularnewline%
% length 6
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Six tokens} \tabularnewline%
ASPLOS & for programming languages and operating systems & 30 & 3 \tabularnewline%
SWEBOK & the software engineering body of knowledge & 14 & 1 \tabularnewline%
TAPADS & aspects of parallel and distributed systems & 11 & 5 \tabularnewline%
LARPBS & with a reconfigurable pipelined bus system & 11 & 1 \tabularnewline%
SIGMOD & interest group on management of data & 10 & 1 \tabularnewline%
% length 7
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Seven tokens} \tabularnewline%
EMMCVPR & methods in computer vision and pattern recognition & 7 & 3 \tabularnewline%
NOSSDAV & system support for digital audio and video & 7 & 2 \tabularnewline%
POSTECH & at pohang university of science and technology & 5 & 3 \tabularnewline%
IWFHRVI & sixth int'l workshop frontiers in handwriting recognition & 4 & 1 \tabularnewline%
SIGARCH & acm special interest group on computer architecture & 4 & 2 \tabularnewline%
\end{tabular}
\end{table}
\newpage
% result
Applying strategy $S3$, the results of both patterns are given in
Tables~\ref{tab:acronyms_I} and~\ref{tab:acronyms_II}.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top 5 extracted acronyms (Pattern I)}
\label{tab:acronyms_I}
\begin{tabular}{llrr}
\rowcolor{tableheadcolor} \color{white} Acronym & \color{white} Full form & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
%acronym full form tf df
% length 2
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Two tokens} \tabularnewline%
IP & Internet Protocol & 71 & 1 \tabularnewline%
IP & intellectual property & 59 & 1 \tabularnewline%
VR & virtual reality & 58 & 2 \tabularnewline%
AI & Artificial Intelligence & 49 & 3 \tabularnewline%
ML & maximum likelihood & 45 & 1 \tabularnewline%
% PE & processing element & 43 & 3 \tabularnewline%
% CA & CELLULAR AUTOMATA & 43 & 2 \tabularnewline%
% SA & simulated annealing & 39 & 1 \tabularnewline%
% RF & radio frequency & 37 & 1 \tabularnewline%
% CT & computed tomography & 37 & 2 \tabularnewline%
% GA & genetic algorithm & 33 & 1 \tabularnewline%
% IR & information retrieval & 31 & 2 \tabularnewline%
% OS & operating system & 30 & 2 \tabularnewline%
% EM & expectation maximization & 26 & 1 \tabularnewline%
% IT & information technology & 24 & 1 \tabularnewline%
% NN & nearest neighbor & 22 & 6 \tabularnewline%
% DP & dynamic programming & 22 & 1 \tabularnewline%
% PC & program counter & 21 & 1 \tabularnewline%
% MR & magnetic resonance & 21 & 3 \tabularnewline%
% VE & virtual environment & 20 & 1 \tabularnewline%
% MD & molecular dynamics & 19 & 1 \tabularnewline%
% EU & European Union & 18 & 1 \tabularnewline%
% SC & sequential consistency & 18 & 1 \tabularnewline%
% AR & augmented reality & 17 & 1 \tabularnewline%
% NI & network interface & 17 & 2 \tabularnewline%
% length 3
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Three tokens} \tabularnewline%
NSF & National Science Foundation & 207 & 16 \tabularnewline%
ATM & asynchronous transfer mode & 129 & 1 \tabularnewline%
UML & Unified Modeling Language & 96 & 2 \tabularnewline%
PCA & principal component analysis & 90 & 1 \tabularnewline%
RDF & Resource Description Framework & 87 & 3 \tabularnewline%
% DAG & directed acyclic graph & 87 & 2 \tabularnewline%
% MAP & Maximum a Posteriori & 84 & 2 \tabularnewline%
% FFT & fast Fourier transform & 81 & 2 \tabularnewline%
% GPS & Global Positioning System & 77 & 2 \tabularnewline%
% API & application programming interface & 75 & 2 \tabularnewline%
% SVD & singular value decomposition & 75 & 1 \tabularnewline%
% MIT & Massachusetts Institute of Technology & 74 & 2 \tabularnewline%
% DCT & discrete cosine transform & 70 & 2 \tabularnewline%
% JVM & Java Virtual Machine & 64 & 2 \tabularnewline%
% GUI & graphical user interface & 63 & 1 \tabularnewline%
% RMI & Remote Method Invocation & 63 & 1 \tabularnewline%
% IDL & Interface Definition Language & 61 & 1 \tabularnewline%
% LRU & Least Recently Used & 59 & 2 \tabularnewline%
% MDL & Minimum Description Length & 57 & 2 \tabularnewline%
% MPI & Message Passing Interface & 55 & 3 \tabularnewline%
% USC & University of Southern California & 53 & 9 \tabularnewline%
% HMM & Hidden Markov Model & 53 & 1 \tabularnewline%
% MRI & magnetic resonance imaging & 52 & 6 \tabularnewline%
% EDF & earliest deadline first & 51 & 2 \tabularnewline%
% SSL & Secure Sockets Layer & 50 & 1 \tabularnewline%
% length 4
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Four tokens} \tabularnewline%
IETF & Internet Engineering Task Force & 85 & 3 \tabularnewline%
VRML & Virtual Reality Modeling Language & 80 & 2 \tabularnewline%
ISCA & Int'l Symp. Computer Architecture & 56 & 1 \tabularnewline%
NIST & National Institute of Standards and Technology & 54 & 3 \tabularnewline%
VLSI & Very Large Scale Integration & 50 & 2 \tabularnewline%
% SOAP & simple object access protocol & 49 & 9 \tabularnewline%
% VLIW & very long instruction word & 45 & 1 \tabularnewline%
% WSDL & Web Services Description Language & 40 & 6 \tabularnewline%
% VLDB & Very Large Data Bases & 39 & 2 \tabularnewline%
% PODS & Principles of Database Systems & 38 & 3 \tabularnewline%
% LFSR & linear feedback shift register & 36 & 1 \tabularnewline%
% PSTN & Public Switched Telephone Network & 36 & 2 \tabularnewline%
% ICDE & Int'l Conf. Data Eng. & 36 & 2 \tabularnewline%
% ARPA & Advanced Research Projects Agency & 36 & 2 \tabularnewline%
% CVPR & Computer Vision and Pattern Recognition & 35 & 9 \tabularnewline%
% SMIL & Synchronized Multimedia Integration Language & 35 & 1 \tabularnewline%
% PARC & Palo Alto Research Center & 34 & 2 \tabularnewline%
% PLDI & Programming Language Design and Implementation & 33 & 4 \tabularnewline%
% SGML & Standard Generalized Markup Language & 33 & 2 \tabularnewline%
% NCSA & National Center for Supercomputing Applications & 33 & 3 \tabularnewline%
% ANSI & American National Standards Institute & 31 & 2 \tabularnewline%
% ATPG & automatic test pattern generation & 31 & 1 \tabularnewline%
% MTTF & mean time to failure & 30 & 1 \tabularnewline%
% RAID & Redundant Arrays of Inexpensive Disks & 30 & 8 \tabularnewline%
% TTTC & Test Technology Technical Council & 27 & 2 \tabularnewline%
% length 5
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Five tokens} \tabularnewline%
DARPA & Defense Advanced Research Projects Agency & 88 & 6 \tabularnewline%
CORBA & Common Object Request Broker Architecture & 58 & 2 \tabularnewline%
KAIST & Korea Advanced Institute of Science and Technology & 52 & 9 \tabularnewline%
ICDCS & Int'l Conf. Distributed Computing Systems & 20 & 2 \tabularnewline%
CIPIC & Center for Image Processing and Integrated Computing & 20 & 6 \tabularnewline%
% TAPOS & Theory and Practice of Object Systems & 16 & 2 \tabularnewline%
% EPSRC & Engineering and Physical Sciences Research Council & 16 & 3 \tabularnewline%
% ISSTA & Int'l Symp. Software Testing and Analysis & 16 & 3 \tabularnewline%
% NSERC & Natural Sciences and Engineering Research Council & 15 & 1 \tabularnewline%
% RIACS & Research Institute for Advanced Computer Science & 12 & 3 \tabularnewline%
% ICASE & Institute for Computer Applications in Science and Engineering & 11 & 5 \tabularnewline%
% ICANN & Internet Corporation for Assigned Names and Numbers & 11 & 2 \tabularnewline%
% TOSEM & Transactions on Software Engineering and Methodology & 10 & 3 \tabularnewline%
% ICDCS & International Conference on Distributed Computing Systems & 10 & 3 \tabularnewline%
% ISCAS & International Symposium on Circuits and Systems & 10 & 1 \tabularnewline%
% IJCAI & Int'l Joint Conf. Artificial Intelligence & 10 & 1 \tabularnewline%
% NPACI & National Partnership for Advanced Computational Infrastructure & 9 & 3 \tabularnewline%
% HKUST & Hong Kong University of Science and Technology & 9 & 3 \tabularnewline%
% ENIAC & Electronic Numerical Integrator and Computer & 9 & 1 \tabularnewline%
% PPOPP & Principles and Practice of Parallel Programming & 9 & 4 \tabularnewline%
% DARPA & Defense Advanced Research Project Agency & 9 & 1 \tabularnewline%
% AFIPS & American Federation of Information Processing Societies & 9 & 1 \tabularnewline%
% IPDPS & Int'l Parallel and Distributed Processing Symp. & 8 & 1 \tabularnewline%
% NSERC & Natural Science and Engineering Research Council & 8 & 1 \tabularnewline%
% ICCAD & International Conference on Computer Aided Design & 8 & 3 \tabularnewline%
% length 6
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Six tokens} \tabularnewline%
ASPLOS & Architectural Support for Programming Languages and Operating Systems & 25 & 3 \tabularnewline%
TAPADS & Theoretical Aspects of Parallel and Distributed Systems & 11 & 5 \tabularnewline%
SIGMOD & Special Interest Group on Management of Data & 10 & 1 \tabularnewline%
ICLASS & Illinois Computer Laboratory for Aerospace Systems and Software & 9 & 8 \tabularnewline%
PECASE & Presidential Early Career Award for Scientists and Engineers & 9 & 3 \tabularnewline%
% UMIACS & University of Maryland Institute for Advanced Computer Studies & 6 & 2 \tabularnewline%
% LSSDSV & Large Scientific and Software Data Set Visualization & 6 & 4 \tabularnewline%
% DOCSIS & Data Over Cable Service Interface Specification & 5 & 1 \tabularnewline%
% ASPLOS & Architecture Support for Programming Languages and Operating Systems & 5 & 1 \tabularnewline%
% DCGMRP & Delay Constrained Group Multicast Routing Problem & 4 & 1 \tabularnewline%
% CESDIS & Center for Excellence in Space Data and Information Sciences & 4 & 1 \tabularnewline%
% NCCUSL & National Conference of Commissioners on Uniform State Laws & 4 & 1 \tabularnewline%
% TOMACS & Transactions on Modeling and Computer Systems & 3 & 3 \tabularnewline%
% FMOODS & Formal Methods for Open Object-Based Distributed Systems & 3 & 1 \tabularnewline%
% BARWAN & Bay Area Research Wireless Access Network & 3 & 1 \tabularnewline%
% AHPCRC & Army High Performance Computing Research Center & 3 & 1 \tabularnewline%
% ESPRIT & European Strategic Programme of Research in Information Technology & 3 & 3 \tabularnewline%
% PECASE & Presidential Early Career Awards for Scientists and Engineers & 2 & 2 \tabularnewline%
% DOCSIS & Data over Cable System Interface Specification & 2 & 1 \tabularnewline%
% NCSTRL & Networked Computer Science Technical Reference Library & 2 & 1 \tabularnewline%
% RAPWBN & reconfigurable array of processors with wider bus networks & 2 & 1 \tabularnewline%
% MICCAI & Medical Image Computing and Computer Assisted Intervention & 2 & 1 \tabularnewline%
% LARPBS & linear arrays with reconfigurable pipelined bus systems & 2 & 1 \tabularnewline%
% CESDIS & Center of Excellence in Space Data and Information Sciences & 2 & 1 \tabularnewline%
% TIPHON & Telecommunications and Internet Protocol Harmonization over Networks & 2 & 1 \tabularnewline%
% length 7
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Seven tokens} \tabularnewline%
EMMCVPR & Energy Minimization Methods in Computer Vision and Pattern Recognition & 7 & 3 \tabularnewline%
SHOSLIF & Self-Organizing Hierarchical Optimal Subspace Learning and Inference Framework & 2 & 1 \tabularnewline%
SIGCAPH & Special Interest Group for Computers and the Physically Handicapped & 1 & 1 \tabularnewline%
INSPASS & Immigration and Naturalization Service Passenger Accelerated Service System & 1 & 1 \tabularnewline%
EMERALD & Event Monitoring Enabling Responses to Anomalous Live Disturbances & 1 & 1 \tabularnewline%
% ESORICS & European Symposium on Research in Computer Security & 1 & 1 \tabularnewline%
% ICANNGA & Int'l Conf. Artificial Neural Networks and Genetic Algorithms & 1 & 1 \tabularnewline%
% IKIWISI & I'll Know It When I See It & 1 & 1 \tabularnewline%
% ICIMADE & International Conference on Intelligent Multimedia and Distance Education & 1 & 1 \tabularnewline%
\end{tabular}
\end{table}
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top 5 extracted acronyms (Pattern II)}
\label{tab:acronyms_II}
\begin{tabular}{llrr}
\rowcolor{tableheadcolor} \color{white} Acronym & \color{white} Full form & \color{white} $tf$ & \color{white} $df$ \tabularnewline%
%acronym full form tf df
% length 2
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Two tokens} \tabularnewline%
IP & Internet protocol & 19 & 1 \tabularnewline%
EM & expectation maximization & 5 & 1 \tabularnewline%
NS & Network Simulator & 5 & 2 \tabularnewline%
CT & computed tomography & 5 & 1 \tabularnewline%
LP & longest path & 4 & 3 \tabularnewline%
% RF & radio frequency & 4 & 1 \tabularnewline%
% CP & critical path & 4 & 1 \tabularnewline%
% IP & intellectual property & 4 & 1 \tabularnewline%
% AM & Accumulative Monotonic & 3 & 1 \tabularnewline%
% FD & Flow Deviation & 3 & 1 \tabularnewline%
% AI & artificial intelligence & 3 & 1 \tabularnewline%
% MD & Mobility Directed & 3 & 1 \tabularnewline%
% VM & Virtual Machine & 2 & 1 \tabularnewline%
% IT & information technology & 2 & 1 \tabularnewline%
% NL & negative large & 2 & 1 \tabularnewline%
% CD & Count Distribution & 2 & 1 \tabularnewline%
% SW & shared write & 2 & 1 \tabularnewline%
% NT & narrower terms & 2 & 1 \tabularnewline%
% BC & backoff counter & 2 & 2 \tabularnewline%
% NS & negative small & 2 & 1 \tabularnewline%
% BT & broader terms & 2 & 1 \tabularnewline%
% FR & Federal Register & 2 & 2 \tabularnewline%
% VL & Very Low & 2 & 1 \tabularnewline%
% CT & computerized tomography & 2 & 1 \tabularnewline%
% CH & Cyclic Hashing & 2 & 1 \tabularnewline%
% length 3
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Three tokens} \tabularnewline%
PVM & Parallel Virtual Machine & 25 & 1 \tabularnewline%
ATM & asynchronous transfer mode & 25 & 1 \tabularnewline%
SQL & Structured Query Language & 23 & 1 \tabularnewline%
LRU & least recently used & 21 & 2 \tabularnewline%
CGI & common gateway interface & 18 & 1 \tabularnewline%
% GPS & Global Positioning System & 18 & 2 \tabularnewline%
% MPI & Message Passing Interface & 17 & 1 \tabularnewline%
% NOW & Networks of Workstations & 17 & 8 \tabularnewline%
% PCI & peripheral component interconnect & 16 & 1 \tabularnewline%
% DSL & digital subscriber line & 14 & 1 \tabularnewline%
% WAP & wireless application protocol & 13 & 1 \tabularnewline%
% EDI & Electronic Data Interchange & 12 & 1 \tabularnewline%
% ERP & enterprise resource planning & 12 & 1 \tabularnewline%
% SSL & Secure Sockets Layer & 12 & 1 \tabularnewline%
% DES & Data Encryption Standard & 12 & 1 \tabularnewline%
% API & application programming interface & 11 & 1 \tabularnewline%
% RDF & Resource Description Framework & 11 & 1 \tabularnewline%
% NFS & Network File System & 11 & 1 \tabularnewline%
% IDL & Interface Definition Language & 10 & 2 \tabularnewline%
% MAC & media access control & 10 & 1 \tabularnewline%
% LAN & local area network & 9 & 1 \tabularnewline%
% RPC & remote procedure call & 9 & 1 \tabularnewline%
% UDP & User Datagram Protocol & 9 & 1 \tabularnewline%
% UML & Unified Modeling Language & 9 & 1 \tabularnewline%
% TCP & Transmission Control Protocol & 9 & 1 \tabularnewline%
% length 4
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Four tokens} \tabularnewline%
VLIW & very long instruction word & 24 & 1 \tabularnewline%
SNMP & Simple Network Management Protocol & 15 & 1 \tabularnewline%
SGML & standard generalized markup language & 14 & 1 \tabularnewline%
MIPS & million instructions per second & 13 & 2 \tabularnewline%
TDMA & time division multiple access & 13 & 2 \tabularnewline%
% SOAP & Simple Object Access Protocol & 13 & 1 \tabularnewline%
% CDMA & code division multiple access & 13 & 2 \tabularnewline%
% SPMD & Single Program Multiple Data & 12 & 2 \tabularnewline%
% IETF & Internet Engineering Task Force & 12 & 1 \tabularnewline%
% VRML & Virtual Reality Modeling Language & 12 & 1 \tabularnewline%
% MIME & Multipurpose Internet Mail Extensions & 11 & 4 \tabularnewline%
% ISDN & integrated services digital network & 10 & 1 \tabularnewline%
% JPEG & Joint Photographic Experts Group & 10 & 1 \tabularnewline%
% DCOM & Distributed Component Object Model & 9 & 1 \tabularnewline%
% GPRS & General Packet Radio Service & 9 & 1 \tabularnewline%
% LDAP & Lightweight Directory Access Protocol & 9 & 1 \tabularnewline%
% FDDI & Fiber Distributed Data Interface & 8 & 1 \tabularnewline%
% DHCP & dynamic host configuration protocol & 7 & 1 \tabularnewline%
% ADSL & Asymmetric Digital Subscriber Line & 7 & 1 \tabularnewline%
% SMTP & simple mail transfer protocol & 7 & 1 \tabularnewline%
% PSTN & public switched telephone network & 6 & 1 \tabularnewline%
% EPIC & explicitly parallel instruction computing & 6 & 1 \tabularnewline%
% CAVE & Cave Automatic Virtual Environment & 6 & 1 \tabularnewline%
% POTS & plain old telephone service & 6 & 1 \tabularnewline%
% RISC & reduced instruction set computing & 6 & 1 \tabularnewline%
% length 5
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Five tokens} \tabularnewline%
CORBA & Common Object Request Broker Architecture & 23 & 1 \tabularnewline%
ENIAC & Electronic Numerical Integrator and Computer & 7 & 1 \tabularnewline%
EDSAC & Electronic Delay Storage Automatic Calculator & 4 & 1 \tabularnewline%
TSAPI & Telephony Services Application Programming Interface & 2 & 1 \tabularnewline%
PALKA & Parallel Automatic Linguistic Knowledge Acquisition & 2 & 1 \tabularnewline%
% ADEPT & Advanced Design Environment Prototype Tool & 2 & 1 \tabularnewline%
% EGDLM & Elastic Graph Dynamic Link Model & 2 & 1 \tabularnewline%
% ECDSA & elliptic curve digital signature algorithm & 2 & 1 \tabularnewline%
% LAPSE & Large Application Parallel Simulation Environment & 2 & 1 \tabularnewline%
% UCITA & Uniform Computer Information Transactions Act & 2 & 1 \tabularnewline%
% NYSIA & New York Software Industry Association & 1 & 1 \tabularnewline%
% APNNA & Asia Pacific Neural Network Assembly & 1 & 1 \tabularnewline%
% DPOSS & Digital Palomar Observatory Sky Survey & 1 & 1 \tabularnewline%
% JEIDA & Japan Electronic Industry Development Association & 1 & 1 \tabularnewline%
% PRISM & Portable Reusable Integrated Software Modules & 1 & 1 \tabularnewline%
% AWACS & Airborne Warning and Control Systems & 1 & 1 \tabularnewline%
% REYES & Renders Everything You Ever Saw & 1 & 1 \tabularnewline%
% DOSIS & Development of Statistical Information Systems & 1 & 1 \tabularnewline%
% AEGIS & Automatic Evolutional Grammar Interpretation System & 1 & 1 \tabularnewline%
% PRCGC & planar right constant generalized cylinder & 1 & 1 \tabularnewline%
% BLACS & Basic Linear Algebra Communication Subprograms & 1 & 1 \tabularnewline%
% MTTSF & Mean time To Serious Failure & 1 & 1 \tabularnewline%
% JEDEC & Joint Electron Device Engineering Council & 1 & 1 \tabularnewline%
% ADEPT & Advanced Design Environment Prototyping Tool & 1 & 1 \tabularnewline%
% CRAFT & Customer Restoration and Fault Testing & 1 & 1 \tabularnewline%
% length 6
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Six tokens} \tabularnewline%
YUPPIE & Yorktown Ultra Parallel Polymorphic Image Engine & 3 & 1 \tabularnewline%
PCMCIA & Personal Computer Memory Card International Association & 3 & 1 \tabularnewline%
EBCDIC & Extended Binary Coded Decimal Interchange Code & 3 & 1 \tabularnewline%
DOCSIS & Data Over Cable Service Interface Specification & 2 & 1 \tabularnewline%
WIDTIO & When In Doubt Throw It Out & 1 & 1 \tabularnewline%
% JSTARS & Joint Surveillance Target Attack Radar System & 1 & 1 \tabularnewline%
% LOCKSS & Lots of Copies Keeps Stuff Safe & 1 & 1 \tabularnewline%
% MLOCTF & Mean Lines of Code To Failure & 1 & 1 \tabularnewline%
% MFLIPS & million fuzzy logical inferences per second & 1 & 1 \tabularnewline%
% MANIAC & Mathematical And Numerical Integrator And Computer & 1 & 1 \tabularnewline%
% DOCSIS & Data Over Cable Service Interface Standard & 1 & 1 \tabularnewline%
% ANZIIS & Australia New Zealand Intelligent Information Systems & 1 & 1 \tabularnewline%
% MFLIPS & million fuzzy logical inference per second & 1 & 1 \tabularnewline%
% TELRIC & total estimated locally regulated incremental costs & 1 & 1 \tabularnewline%
% SUPARS & Syracuse University Psychological Abstract Retrieval Service & 1 & 1 \tabularnewline%
% PIVOTS & Private Information Viewing Offering Total Safety & 1 & 1 \tabularnewline%
% PCMCIA & Personal Computer Manufacturers Communications Interface Adapter & 1 & 1 \tabularnewline%
% DOCSIS & Data Over Cable System Interface Specification & 1 & 1 \tabularnewline%
% length 7
\rowcolor{tableheadcolor} \mmcc{4}{\color{white} Seven tokens} \tabularnewline%
WYSIWYG & what you see is what you get & 5 & 1 \tabularnewline%
WYSIWYR & what you see is what you record & 1 & 1 \tabularnewline%
YCAGWYS & you can always get what you see & 1 & 1 \tabularnewline%
WYSIWYC & What You See Is What You Compute & 1 & 1 \tabularnewline%
\end{tabular}
\end{table}
% conclusion
The final dictionary contains 2.729 two letter acronyms, 8.636 three
letter acronyms, 3.868 four letter acronyms, 724 five letter
acronyms, 122 six letter acronyms, and 13 seven letter acronyms.
Tables containing the top 24 extracted acronyms of each length are
attached in the appendix in Section~\ref{app:sec:acronyms}. During
retrieval the acronyms extracted provide an extra multi-term index
that is searched. In contrast to multi-terms of the previous
section, the smaller number allows indexing of the complete set of
acronyms. Note that this index also includes stopwords in special
contexts and considers the order of words.
% outlook
The results show clearly that both patterns investigated serve as an
appropriate means for extracting acronyms with high confidence.
Promising patterns left open for further investigation include
plurals (e.g., \texttt{POWs (Prisoners of War)}), special
capitalizations (e.g., \texttt{XML (eXtensibel Markup Language)}),
and multiple letter acronyms (e.g., \texttt{HyREX (Hyper-media
Retrieval Engine for XML)}).
\FloatBarrier
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Summary}
This chapter described the generation of natural language resources
that are relevant for information retrieval. The experiments
conducted solely rely on the output of Extended Tokenization.
Generated resources include typing rules for single-tokens and
multi-tokens, single-term dictionaries (full forms, abbreviations),
and multi-term dictionaries (composite nouns, named entities,
formulaic speech, acronyms with corresponding full forms). These
resources are incorporated in the indexing and retrieval processes
to improve retrieval performance. Multi-term dictionaries and rules
improve the output of the tokenizer and thus, the quality of content
representations. The rule generation process is supported by a Rule
Miner that infers rule proposals based on extracted patterns. A
bootstrapping-like approach is applied to identify meaningful
patterns in similar contexts automatically.
|