Research @ Hekkas.Com

Linguistically Enhanced Information Retrieval of Structured Documents

X-DOSE

"Science must begin with myths, and with the criticism of myths." Karl Popper


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Chapter 7 - Overview of the System
%
% last change: 16.08.2004
% correction hamid: xx.xx.2004
% correction prof: xx.xx.2004
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\hiddenabbrev{Hyper-media Retrieval Engine for XML}{HyREX}
\hiddenabbrev{HYpermedia System with Probabilistic Inference for the Retrieval of InformaTion}{HySpirit}
\hiddenabbrev{PArallel Document Retrieval Engine}{PADRE}
\hiddenabbrev{Commonwealth Scientific and Industrial Research Organisation}{CSIRO}

\aphorism{The mind is not a vessel to be filled, but a fire to be ignited.}{Plutarch}%
\chapter{Overview of the X-DOSE System}
\label{chapter:system}%

Based on the facets discussed in the previous chapters, this chapter
describes the system implemented that covers those aspects. X-DOSE,
the XML-Document Oriented Search Engine, is based on a client-server
architecture. A server processes incoming indexing and retrieval
requests. Query results are sent to the client which displays the
document components retrieved. Single results can be selected and
displayed in a document view which highlights relevant components in
their contexts. By browsing the document tree, users are able to
extend or refine initial queries by adding new constraints. A
classification machinery supports the user in creating groups of
related elements. Both, the architecture of the server and the
client are described in detail. Before explaining the system
proposed, existing retrieval systems operating on structured
documents are reviewed in Section~\ref{sec:system:related}.

 

 

 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Related Work}
\label{sec:system:related}

 

\subsection{HyREX}

HyREX~\cite{fuhr02hyrex1,fuhr02hyrex2,goevert03hyrex,fuhr03interface},
the Hyper-media Retrieval Engine for XML, is a system developed by
G\"overt and Fuhr at the University of Dortmund.
%
It relies on the XIRQL~\cite{fuhr04xirql} query language, which
extends the XPath subset of XQuery by introducing features for term
weighing, relevance-oriented searching, data types and vague
predicates, and semantic relativism. During query calculation, XIRQL
queries are first translated into a path algebra, which is optimized
and processed by the database engine.
%
HyREX aims at retrieving the most specific component of an XML
document according to Chiaramella's model for multimedia information
retrieval~\cite{chiaramella96model}.
%
To achieve that, non-overlapping sets of elements are a priori
defined as index nodes (see Figure~\ref{fig:system:indexing_fuhr}).
All index nodes are retrievable units for queries.
%
% representation
From the representation point of view, indexing nodes are considered
as individual documents (atomic units) that are retrievable units
for queries. Term weights are computed applying the standard vector
space model. During retrieval the system matches the query with each
of these index nodes and ranks them according to their relevance.
%
\begin{figure}[h]
\centering
\includegraphics[width=0.99\textwidth]{09_system/figures/related_fuhr}
\caption{XML document tree and corresponding indexing objects~\cite{fuhr01xirql}}
\label{fig:system:indexing_fuhr}
\end{figure}

% leaf content
The content of an XML document is restricted to the leaf nodes of
the document tree. It is, therefore, necessary to infer
representations of inner index nodes based on the leaf nodes. This
is done bottom-up and recursively, such that the representation of
each intermediate node is based on the representations of its
descendant node, relying on \emph{event keys} and \emph{event
expressions}~\cite{fuhr97algebra}. Basically, the weight of terms of
an ascendent node is computed using the so-called
\emph{inclusion-exclusion} mechanism~\cite[pp.
20]{billingsley79probability}.
%
Assuming term independency, the merge (join) probability of a term
lying in two nodes is computed using a modified version of the
traditional formula:
%
\begin{equation}
P(asc \vee (f\cdot desc)) = P(asc) + P(desc)\cdot f - P(asc) \cdot P(desc)\cdot f \label{eq:system:fuhr_join}
\end{equation}
%
where $asc$ and $desc$ indicate a node and its descendant node
containing a term $t$, while $f\in[0,1]$ is an augmentation factor
used to moderate the weights propagated upwards to the ascendant
node. Different index nodes may be assigned different augmentation
factors. In their experiments, G\"overt and Fuhr found that
augmentation weights chosen from $[0,3; 0,7]$ achieved best results.
Systematic methods like Amanti's approach of divergence from
randomness~\cite{abolhassani04hyrex} can be used to identify proper
augmentation factors automatically. However, those augmentation
factors are pre-estimated using empirical data, which makes weight
computation dependent on the corpus. Thus, these values cannot be
claimed valid with other data.

% further advances
Further enhancements of HyREX include bilingual information
retrieval for English and German~\cite{goevert01hyrex}, result
presentation
issues~\cite{grossjohann02interface,grossjohann02visualization,fuhr03interface},
and performance improvements (e.g., index
compression)~\cite{fuhr02compression,fuhr02index}.
%
General information about HyREX is also available via the
Internet\footnote{http://www.is.informatik.uni-duisburg.de/projects/hyrex/
(15.08.2008)}.

 

 

\subsection{HySpirit}

HySpirit~\cite{roelleke97hyspirit,roelleke97hyspirit1,fuhr98hyspirit,kazai02scalable,roelleke01hyspirit},
the HYpermedia System with Probabilistic Inference for the Retrieval
of InformaTion, is a scalable hypermedia framework developed by
R{\"o}lleke et al. The system is based on a probabilistic relational
algebra. Based on a probabilistic relational algebra, the system
processes large scale retrieval tasks on distributed databases in
parallel.
%
In this context, the notion of a local representation of a document
refers to the representation of a given element within a
sub-collection (independent database), whereas a global
representation describes the collection as a whole (all databases).
Each element in a document tree is regarded as an atomic unit
(separate document). Thus, a level in the hierarchy corresponds a
collection of atomic units.

A kind of $tf \cdot idf$ formula is adapted to compute term weights
like in traditional information retrieval. Since the concept of a
document is not clear, term frequency $tf$ and inverse document
frequency $idf$ are interpreted on the basis of XML elements at a
certain level in the collection's hierarchy. To do that, an
aggregation function is used to combine the $idf$ values of
different sub-collection elements using the formula:
%
\begin{equation}
\label{f_idf} idf(term_j) = \log \frac{ \sum\limits_{i=1}^n {N_i}}
{\sum\limits_{i=1}^n {n_{i,term_j}}}
\end{equation}
%
where $N_i$ is the number of direct subdocuments (child elements) in
the collection, and $n_{i,term_j}$ is the number of documents in
that collection referring to $term_i$. During retrieval, the
mechanism of augmentation is applied to represent the content of
elements made up of the contents of their sub-contexts. Target
elements of queries, the elements retrieved, are processed by a
post-retrieval filtering task.

Database pre-selection based on a cost-function and a content-based
measure estimates the efficiency of the sub-collections and speeds
up the overall data access. Term selection (stopwords) as well as
context selection (stop-contexts) are used to improve indexing and
retrieval.
%
Based on database accessibility and access costs, two different
strategies combine the query and database dimension in distributed
environments (parallel query computation):
%
\begin{compactitem}
\item For each query, the system retrieves from the set of databases.
\item For each database, the system runs the set of queries.
\end{compactitem}

 

\subsection{JuruXML}

Juru~\cite{carmel01juru} is a full-text information retrieval system
developed by Carmel et al. at the IBM Research Lab in Haifa. Juru is
implemented in Java and operates on an inverted list index. Novel
pruning methods and compression techniques are applied, reducing the
size of the index significantly while maintaining high precision.

In 2002, Mass et al.~\cite{mass02juruxml} extended Juru by means of
XML retrieval features to JuruXML. Their approach is based on
querying XML documents via pieces of XML documents, so-called XML
fragments, rather than inventing a new query language. This
differentiates JuruXML from other systems. Query results returned
contain not only perfect matches but matches that are `close
enough', according to some relevance measure. Most of the logic is
put to the ranking mechanism, targeting intuitive querying and user
friendliness.

%XML fragments (queries) are portions of XML combined with free text.
Within XML fragments, elements, attributes, and contents can be
augmented by prefixes for mandatory, prohibited, and phrase terms.
Optionally, a list of target elements to be returned can be
included. For better relevance ranking and approximate matching, an
extension of the vector space model described in~\cite{carmel02vsm}
is proposed. In contrast to global term weights, the content of an
XML component is expressed by pairs of terms $t_i$ and contexts
$c_j$, where the context refers to the XPath of the component (e.g.,
\texttt{/article[1]/abstract[1]}).
%
A document is considered relevant if at least one path (and its
content) of the document matches at least one path (and its content)
of the XML fragment (query). The match of paths $cr$ (document and
query paths) is expressed by the longest common subsequence. The
similarity between a document $d$ and a query $q$ is given by
%
\begin{equation}\label{eq:system:juruxml}
sim(d,q) = \frac{ \sum_{(t_i,c_j) \in q} \sum_{(t_i,c_k) \in d} w_d(t_i,c_k) \cdot w_q(t_i,c_j) \cdot cr(c_j,c_k) }
{||d|| \cdot ||q||}
\end{equation}
%
$w_d(t,c)$ and $w_q(t,c)$ denote the weight of term $t$ in context
$c$ in the document $d$ and the query $q$, $cr(c_j,c_k)$ is the
similarity of the contexts $c_j$ and $c_k$, and $||d||$ (resp.
$||q||$) denotes the number of contexts in $d$ (resp. $q$).
%
%
To this stage, JuruXML takes term statistics only on the document
level. Thus, matching and ranking are performed for complete
documents only.

 

 

\subsection{XXL Search Engine}

Theobald and Weikum describe the XXL Search Engine
in~\cite{theobald01relevance,theobald02indexbased}. It is based on
the XXL (fleXible XML search Language) query language, combining
standard $tf \cdot idf$ weighing and ontology searching. The index
structure consists of three layers: an element path index $EPI$, an
element content index $ECI$, and an ontology index $OI$.

In a first step, XXL queries are decomposed and represented as a
graph. After that, the order of the subquery evaluation is chosen.
Then, the order of the internal path expressions for each subquery
is decided. Subqueries are evaluated making best usage of the three
indices. Finally, intermediate subquery results are composed into a
global result.

 

\subsection{K2 Search Engine from Verity}

Tong~\cite{tong02verity} presented an approach for XML retrieval
using K2, an enterprise-class document retrieval platform from
Verity, Inc.\footnote{http://www.verity.com (15.08.2008)} Although
the system does not have an explicit representation for the document
structure, `zone indexing' is applied to distinguish different
regions within a document.

Queries are formulated in the \abbrev{Verity Query Language}{VQL}, a
rich and expressive query language that supports standard keyword
operators, boolean operators, and weighing. Since the system does
not allow to return pointers to specific document elements, a path
reporting strategy was adopted that retrieves either the first or
the smallest XML element found. An evaluation at INEX 2002 showed
that further investigation and revision is needed. The performance
of the K2 search engine was clearly lower than that of other systems
addressing the document structure explicitly.

 

 

 

\subsection{Cheshire II}

Cheshire II~\cite{larson02cheshire} is an XML retrieval system
proposed by Larson, formerly developed as an online library catalog
system. It is based on a client-server architecture with an embedded
database engine. Features include multiple, scriptable clients, and
relevance feedback.

Each node in the XML tree is assigned an index data structure
(B-TREE), an extraction type (e.g., KEYWORD, EXACTKEY, DATE), and
its type of normalization (e.g., STEM, NONE).
%
Queries are represented applying stopword removal, stemming, and
word replacements exploiting a wordnet dictionary and thesaurus.
Logistic regression is used to match, rank, and retrieve document
components of any granularity.

The probability of relevance $R$ given a query $Q$ and a document
$D$ is expressed by $P(R|Q,D)$. Since Cheshire II supports
probabilistic and boolean searches, a kind of `Fusion Search' merges
the subsets retrieved from different searches to a global result,
defined as
%
\begin{equation}
P(R|Q,D) = P(R|Q_{bool},D) \cdot P(R|Q_{prob},D)
\end{equation}
%
where $P(R|Q_{bool},D)$ is the probability estimate from the
probabilistic portion of the search, and $P(R|Q_{prob},D)$ is the
estimate from the boolean operator.

 

 

\subsection{PADRE}

The PADRE (Parallel Document Retrieval Engine)
system~\cite{hawking94padre,hawking94searching} was developed by
Hawking at the Australian National University of Canberra in 1994.
It was the first text retrieval system that implemented indexing,
query processing, ranking, and retrieval of text documents on
parallel and distributed systems for terabyte-scale collections.
PADRE views document retrieval as an inherently parallel problem,
since a document collection can be divided into $N$ sub-collections
searched independently. The only exception to this are global term
statistics (i.e., inverse document frequency) and ranking, discussed
by Bailey and Hawking~\cite{bailey96parallel}.

Mechanisms for load balancing and automatic document distribution
improve response time considerably. Further performance improvements
are achieved by single-pass scanning of multiple alternate patterns,
fast indexing methods, and decreased collection load
times~\cite{hawking96padre,hawking96padremanual}. TREC-5 experiments
on automatic query generation, distance-based relevance scoring,
server selection, and result merging are described
in~\cite{hawking97anuacsys}.
%
A multi-user, time-sharing implementation developed in 1996 applied
natural language processing for automatic query generation to prose
specification texts~\cite{hawking96padre,hawking96parallel}.
%
Further developments led to WWW
capabilities~\cite{hawking96parallel} and retrieval of OCR-scanned
texts~\cite{hawking96document}.
%
The initial system already allowed restricted searches on marked
sections like \texttt{authorname} or
\texttt{title}~\cite{hawking95design,hawking96padre}.
%
Based on compressed inverted file indices, former versions of PADRE
like PADRE97~\cite{hawking97scalable,hawking97padre} and
PADRE99~\cite{hawking00search} outperformed many other
state-of-the-art retrieval systems.
%
%
Today, PADRE is the core of CSIRO's (Commonwealth Scientific \&
Industrial Research Organisation\footnote{http://www.csiro.au
(24.08.2004)}) Panoptic Enterprise Search
Engine\footnote{http://www.panopticsearch.com (24.08.2004)}.

 

 

% XML retrieval
Vercoustre et al.~\cite{vercoustre02padre} extended PADRE for XML
retrieval. In their work, they added additional database techniques
to the underlying text retrieval technology.
%
% indexing
The definition of meaningful retrieval components is implemented as
a pre-indexing task. Before indexing, XML documents are split into
retrievable units. Both, the complete documents and the split
document components are treated as atomic units.

% querying
During retrieval, queries are first translated into PADRE syntax.
Terms lying in specific XML components are mapped onto corresponding
PADRE fields, using a set of metadata classes. This avoids semantic
misinterpretation of same-named relational data fields. Dates and
texts are defined as basic types. Progressive weakening of query
constraints supports retrieval of XML components not fulfilling all
conditions.
%
% results
Results returned by PADRE are sent to an extractor unit. According
to the query constraints, the extractor re-ranks the PADRE results
and presents them to the user.

Currently, an implementation of the system is in productive
operation at the Australian National University in Canberra. It is
used for email and intranet retrieval, and demonstrates fast
indexing and query processing on an inexpensive hardware.
%
Additional information about PADRE is available via the
Internet\footnote{http://www.csiro.au
(15.08.2008)}$^,$\footnote{http://www.panopticsearch.com
(15.08.2008)}$^,$\footnote{http://deneb.soi.city.ac.uk/~andym/PADRE/tarpubs.html
(15.08.2008)}.

 

 

 

 

 

 

%\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Architecture of X-DOSE}

% general introduction
Taking the aspects discussed in
Chapters~\ref{chapter:documentrepresentation} to
\ref{chapter:xml_document_clustering} into account, the
\abbrev{XML-Document Oriented Search Engine}{X-DOSE} was developed.
The focus has been put on the natural language text representation,
and the retrieval improvements achieved by clustering of XML
components.
%
%
% architecture
X-DOSE is fully implemented in Java and consists of three modular
subsystems:
%
\begin{compactitem}
%
\item An external database server stores the documents and their
corresponding representations in a relational database.
%
\item The server processes index, query, and classification requests that
are transmitted via the \abbrev{Remote Method Invocation}{RMI}
protocol.
%
\item A client starts index requests, aids the query formulation, enables
classification of arbitrary components, and displays the results to
the user.
%
\end{compactitem}
%
An overview of this architecture is given in
Figure~\ref{fig:system:architecture_overview}. Triggered by an index
request, the server obtains the document from the internet and
stores it in the database. During querying, the information in the
database is used to retrieve relevant document components. The
results are transferred to the client, which displays them to the
user. Additionally, links referring to the original source document
are provided.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/architecture_overview}
\caption{Architecture of X-DOSE}
\label{fig:system:architecture_overview}
\end{figure}

 

% outline
In the sequel, the client and its tasks, index requesting, query
formulation, result presentation, and class management are described
in Section~\ref{sec:system:architecture_client}. Details about the
architecture of the server and its corresponding tasks, indexing,
retrieval (including clustering), and classification are presented
in Section~\ref{sec:system:architecture_server}. Final remarks and
future extensions conclude the chapter.

 

 

 

 

%\newpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Client}
\label{sec:system:architecture_client}

The client provides the user with a graphical interface and
communicates with the server. An overview of the architecture is
given in Figure~\ref{fig:system:architecture_client}. This section
describes the main tasks of the client: indexing, querying,
displaying of results, and class management.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/architecture_client}
\caption{Architecture of the client}
\label{fig:system:architecture_client}
\end{figure}

 

 

 

\subsection{Index Request}

Proper management of the documents indexed is essential to maintain
a consistent information pool. Figure~\ref{fig:system:screen_index}
shows the user interface of the system handling the indexing
objects. New documents of different sources (e.g., local files and
directories, HTTP, FTP, HIS, etc.) and of various types (e.g., TXT,
XML, HTML, PDF, etc.) can be added to the index. Existing documents
may be reindexed and obsolete documents removed.
%
\begin{figure}[ht]
\centering
\includegraphics[width=.81\textwidth]{09_system/figures/screen_index}
\caption{Index interface of the client}
\label{fig:system:screen_index}
\end{figure}

%\FloatBarrier

 

\subsection{Query Formulation}

The client supports three different types of queries: simple
\emph{keyword queries}, \emph{free text queries}, and complex
\emph{XOR queries}. According to the query type, the client offers
three different query interfaces.

% KWD
Keyword queries (see Figures~\ref{fig:system:screen_query_KWD}) are
considered as a set of search terms (google-like interface).
Optionally, document components searched (and retrieved) can be
stated to restrict the results to certain XML components.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{09_system/figures/screen_query_KWD}
\caption{Keyword query interface of the client}
\label{fig:system:screen_query_KWD}
\end{figure}
%
%
% free text
In contrast, free text queries (see
Figures~\ref{fig:system:screen_query_TXT}) are formulated as a
complete piece of text (e.g., a paragraph). The system retrieves
those components that are most similar to the content provided.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{09_system/figures/screen_query_TXT}
\caption{Free text query interface of the client}
\label{fig:system:screen_query_TXT}
\end{figure}
%
%
% XOR queries
Both of these query types are translated internally into XOR
queries. Thus, this section concentrates on the description of this
type of queries.
%
%
% XOR
XOR queries (see Figure~\ref{fig:system:screen_query_XOR}) pose the
most complex type of queries. These queries allow searches of
metadata, structure, and content information. Several parameters
described below provide the possibility to tune the query results
computed.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.8\textwidth]{09_system/figures/screen_query_XOR}
\caption{XOR query interface of the client}
\label{fig:system:screen_query_XOR}
\end{figure}

% XOR syntax
The XOR query language was already described in
Section~\ref{sec:sdr_rel_work_ql} and
Section~\ref{sec:xmlretrieval:query_language}. These queries are
composed of a sequence of subqueries, where each subquery result
serves as a filter for subsequent subqueries. Consider the XOR
example given in Listing~\ref{listing:system:xor}:

% example XOR query
\begin{lstlisting}[caption={[XOR query example]XOR query example},label=listing:system:xor]
/DOC[meta(.,documentMeta.title like '\%Computer\%')]
//SEC[(about(.,retrieval) AND about(./FRA,information))]
/FRA[about(.,processing data storage instructions logic)]
\end{lstlisting}
%
For better readability, each subquery is written in a separate line.
The first subquery retrieves documents with titles containing the
term \texttt{computer}. Within those documents, the second subquery
refines the search to sections at any level (e.g., subsection,
subsubsection, etc.) that include the term \texttt{retrieval} and
further contain a fragment that deals about \texttt{information}.
Out of these sections, the third subquery retrieves all fragments
that contain any of the terms \texttt{processing}, \texttt{data},
\texttt{storage}, \texttt{instructions}, or \texttt{logic}. This set
of final fragments is returned to the user as the result of the
query. Thus, the retrieval result is defined by the last subquery
component.

% search and retrieval paths

As indicated by the second subquery in the example, two different
aspects have to be distinguished: components searched (search path,
support elements) and components retrieved (retrieval path, target
elements). The search path specifies the components that are matched
against the current query. In contrast, the retrieval path specifies
the components that are returned to the user.
%
In most cases, these two paths are equal (e.g.,
\texttt{/FRA[about(.,processing data storage instructions logic)]}.
However, the second subquery \texttt{//SEC[about(.,retrieval) AND
about(./FRA,information)]} is an example where the retrieval path
\texttt{//SEC} is a subset of the search path \texttt{//SEC/FRA}.
Note that the search path is always the same or more specific than
the retrieval path.

 

 

% XOR query interface
In order to avoid long and confusing single-line queries, query
formulation is done by chaining XOR subqueries in the client
interface.
%
This concept is closely related to the natural way of questioning,
where a query is successively refined by introducing additional
constraints (subqueries).
%
Each subquery result operates as a strict filter, allowing only
elements of the same or smaller granularity to be retrieved. This
improves the performance without skipping relevant components.
Furthermore, the chains are used to reweigh elements regarding to a
user-defined generality parameter.

 

% predicates

The system processes an extended XOR query syntax that supports
different kinds of matching.
%
%
% about()
An \texttt{about(path,keywords)} predicate matches all keywords
against a component's content.
%
% query terms (keywords)
Table~\ref{tab:system:keyword_semantics} summarizes the modifiers to
express different semantics of keywords.
%
\texttt{+} (MUST) and \texttt{-} (MUST NOT) indicate whether a term
must or must not be present in a component.
%
% multi-terms
More complex is the treatment of quoted keywords. For instance,
\texttt{books} and \texttt{"books"} have to be treated differently.
While \texttt{books} is stemmed and matches the term \texttt{book}
in documents as well, \texttt{"books"} searches for contents that
explicitly contain the term as it is (a plural noun) in the full
text. No kind of linguistic transformation is applied on these
terms.
%
Quoted multi-terms are particulary difficult to process. Consider
\texttt{"red cars"}. The single-term \texttt{red} is an adjective.
Thus, it may not be included in the index. In another context (e.g.,
``\texttt{Red Cross}''), it is part of a proper noun and, therefore,
may exist in the index.
%
The approach proposed treats quoted keywords in two steps: First,
all terms are treated as unquoted and their similarity is computed
using the standard vector space model of single-term and multi-term
indices. Second, instead of searching the indices, a string matching
strategy is applied on the full text of the component to re-rank the
results.
%
Combinations of \texttt{+}/\texttt{-} and quoted expressions are
treated as if all single-terms within the quotes are separately
marked by \texttt{+}/\texttt{-} and an initial result set is
computed. This result is reduced to those components that exactly
contain the quoted expression (using string matching).
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Different semantics of keywords}
\label{tab:system:keyword_semantics}
\begin{tabular}{lL{9cm}}
%
\rowcolor{tableheadcolor} \mcc{Keywords} & \mcc{Semantics} \tabularnewline%
%
\smallskip
\texttt{information retrieval techniques} & All terms optional \tabularnewline%
%
\smallskip
\texttt{+information +retrieval -techniques}& \texttt{information} and \texttt{retrieval} must occur,\\
\texttt{techniques} must not occur \tabularnewline%
%
\smallskip
\texttt{"information retrieval" techniques} & \texttt{"information retrieval"} optionally occurs as multi-term,\\
all terms are optional single-terms \tabularnewline%
%
\smallskip
\texttt{+"information retrieval" techniques}& \texttt{"information retrieval"} must occur as multi-term,\\
\texttt{information} and \texttt{retrieval} must both occur as single-terms,\\
\texttt{techniques} is optional \tabularnewline%
%
\smallskip
\texttt{+"running"} & \texttt{"running"} must occur unstemmed in the text (string matching) \tabularnewline%
%
\end{tabular}
\end{table}

% meta()
Similar to the \texttt{about()} predicate for searching contents, an
extra \texttt{meta(path,condition)} search predicate is included for
metadata searches. It allows, for example, to efficiently deal with
queries like: ``return all documents written by Einstein'' using the
command \texttt{//DOC[meta(.,author like '\%Einstein\%')]}. The
result for this query is computed by a single SQL select statement
and, thus, is answered by the database engine without further
processing.

% query parameters
In addition to the XOR subquery chains, several query parameters can
be specified by the user (see
Figure~\ref{fig:system:screen_query_XOR}):

\begin{itemize}
%
\item\textbf{Maximum results ($maxRes$):} Defines the maximum
number of results returned. Its values range from $1$ to $MAXINT$.
%
\item\textbf{Minimum similarity ($minSim$):} Defines the minimum
similarity of results returned $\in [0,1]$,
truncating the list of results below a given similarity threshold.
%
\item\textbf{Content importance ($ci$):} Defines the influence of
the content similarity and the metadata similarity on the
calculation of the retrieval status value ($rsv$, element relevance).
This parameter ranges from $0,0$ (only $simMeta$) to $1,0$ (only $simCont$). The
final $rsv$ (ranking criteria) is computed as $rsv = simCont * ci +
simMeta * (1,0-ci)$.
%
\item\textbf{Generality factor ($gf$):} This parameter ($\in [0,1]$)
influences the retrieval granularity. The higher the parameter, the
more weight is put on the previous subquery relevance.
Let $q_{1..n}$ be a chain of
$n$ subqueries. Then, $rsv$ is recursively computed as $rsv_i =
rsv_{i-1} * gf + rsv_i * (1,0-gf)$.
%
\item\textbf{Result type ($rt$):} Defines which kind of results is
obtained: focused or unfocused. Unfocused retrieval returns
all relevant components, including multiple results of the same document that
are in a structural relationship (e.g., the section and its child
fragment) to each other.
%
In contrast, focused retrieval only returns the most
relevant component of a branch in the document tree. Overlapping elements
in the result set are discarded. Note that
focused retrieval may also retrieve multiple components of the same
document. This strategy reduces the number of elements returned drastically.
%
Figure~\ref{fig:system:result_type} shows the results
of the same document according to the result type. Green encircled components
are returned. The number in the nodes indicate their relevance ($rsv$).
%
\begin{figure}[h]
\centering
\subfloat[Unfocused retrieval]{\includegraphics[width=0.4\textwidth]{09_system/figures/system_unfocussed}\label{fig:system:unfocussed}}
\qquad%
\subfloat[Focused retrieval]{\includegraphics[width=0.4\textwidth]{09_system/figures/system_focussed}\label{fig:system:focussed}}
\caption[Result types]{Result types} \label{fig:system:result_type}
\end{figure}
%
\end{itemize}

 

 

\subsection{Result Display}

% result browser
After a query is computed by the server, the list of results is
returned to the client. The results are ranked according to their
retrieval status value in decreasing order. Besides metadata
similarity $metaSim$ and content similarity $contSim$, the retrieval
status value further combines the similarities of chained subqueries
(described in the previous section).
Figure~\ref{fig:system:screen_result_list} shows the interface for
browsing the results.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/screen_result_list}
\caption{Result display of the client}
\label{fig:system:screen_result_list}
\end{figure}

% document browser
If the user selects a particular result (e.g., a relevant
subsection), the system displays the whole document in an
explorer-like view (see
Figure~\ref{fig:system:screen_result_document}). Relevant components
in the document are highlighted using different colors reflecting
the degree of similarity of the matched elements. The document's
structure is presented as an expandable tree, where the selected
element is expanded and focused.
%
A graphical representation at the bottom of
Figure~\ref{fig:system:screen_result_document} assists the user in
understanding the result's granularity and substructure, similar to
the TextBars presented in Section~\ref{sec:sdr:result_presentation}.
%
Besides outlining the document structure, it can be used
alternatively to navigate through the document. The actual component
selected is highlighted using a red border. Each component is filled
with up to two different colors that reflect its metadata (top half)
and content (bottom half) similarity. In the figure, two different
kinds of yellow are used.
%
Having similarity values available on the screen in both, the table
of contents and the graphical representation, the document can be
browsed efficiently.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/screen_result_document}
\caption{Result document browser of the client}
\label{fig:system:screen_result_document}
\end{figure}

% result refinement
In most cases a final search result is achieved through iterative
refinement of the query, where the number of results is reduced
stepwise by adding new information to the query.
%
To enable such a feature, the user may include a list of preliminary
results in a query. If such a result is set, it acts as a strict
filter during query processing, avoiding a computation starting from
scratch.
%
The system offers the possibility for the user to save (partial)
results, $R$, in response to her/his query $Q$ at a given working
session. In future sessions, the user might refine $R$ via a refined
query $Q'$. For this purpose, the query $Q'$ along with the result
set $R$ can be introduced to the system. If so, $R$ is the
sub-collection of elements to be searched instead of the whole
collection, and consequently a refined result set $R'$ is returned.

% result selection in an external application
The information presented by the client solely comes from the
documents stored in the database. Documents found highly relevant to
the query can be opened in an external application (e.g., a web
browser or specific file browser). Therefore, the reference stored
in the \texttt{sourcepath} attribute in the \texttt{METADATA} block
of the \texttt{DOC} element is used to acquire the original source
directly.

 

 

\subsection{Class Manager}

% class manager
The class manager provides methods to create and delete classes of
document components of arbitrary granularity. The main management
interface is shown in Figure~\ref{fig:system:screen_class_manager}).
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.8\textwidth]{09_system/figures/screen_class_manager}
\caption{Class manager of the client}
\label{fig:system:screen_class_manager}
\end{figure}
%
%
% class browser
Details about the components assigned to a single class are shown by
the class browser (see
Figure~\ref{fig:system:screen_class_browser}).
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.8\textwidth]{09_system/figures/screen_class_browser}
\caption{Class browser of the client}
\label{fig:system:screen_class_browser}
\end{figure}

 

% idea, benefits
The main concern of the classification system is, besides the
compilation of related contents, the automatic identification of
document components that are similar to the components of that
class. A classification process matches all components known by the
system to the members of the class. If the average similarity
exceeds a user-defined threshold (e.g., $sim_{avg}>0,0$), the
component is added to the results, which is sorted according to the
similarity computed. The result of the classification process
(similar to the result of a retrieval process) is presented to the
user by the interface illustrated in
Figure~\ref{fig:system:screen_class_result}. By manually assigning
result items to a class, each client creates its own classification
system.
%
Note that the top two classification results in
Figure~\ref{fig:system:screen_class_result} are the same components
initially assigned to that class
(Figure~\ref{fig:system:screen_class_browser}).
%
\begin{figure}[t]
\centering
\includegraphics[width=0.8\textwidth]{09_system/figures/screen_class_result}
\caption{Classification result browser of the client}
\label{fig:system:screen_class_result}
\end{figure}

\FloatBarrier

 

 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{The Server}
\label{sec:system:architecture_server}

The main components of the server include an RMI communication
server, a threadpool of indexing tasks, a threadpool of retrieval
tasks, a threadpool of classification tasks, and a direct data
request unit. Figure~\ref{fig:system:architecture_server} shows the
five components and their interconnections.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/architecture_server}
\caption{Architecture of the server}
\label{fig:system:architecture_server}
\end{figure}

 

% architecture
The RMI server processes incoming requests and initiates a new
thread for each call. The goal of this concept is to achieve a high
degree of parallelism. The maximum number of parallel threads
depends on the performance of the hardware and is configured in a
global configuration file.
%
From the software architecture point of view, both index and
retrieval tasks use a pipelined pattern of processing units (see
Figure~\ref{fig:system:architecture_server}). For portability and
tuning purposes, threads and single processing units are
independently configurable via separate configuration files.

% indexing
During indexing, requested documents are downloaded by the
DataCollector. The complete information of the source document is
kept in a temporary repository of raw XML documents.
%
The DataMapper transforms the raw files into the generic document
format and places them in another temporary repository of mapped XML
documents. For each raw document format acquired by the
DataCollector, an XSLT is defined for the transformation which
depends on the original document source format.
%
Afterwards, a DataStorer analyzes the transformed XML documents.
Document structure, metadata, and content are stored in the
relational database.
%
Finally, a DataIndexer computes multiple representations of the XML
components that contain plain natural language text.

% retrieval
As soon as a query is sent to the system, the query content is
analyzed in the same way the DataIndexer processed the textual
contents previously. If possible, the query is expanded by
additional search terms that may improve retrieval results. Query
terms are weighed according to the XML components addressed.
Documents and document components in the database are matched
against the query, and relevant results are ranked in decreasing
order. The final result set is transmitted to the client. Users can
refine their search within the set of results retrieved, browse
relevant documents, and access original source documents by using
the references stored in the database during indexing.

% classification
In order to classify documents and document components, the client
provides an interface for adding and removing classes. Each of the
classes contains a number of elements (either documents, sections,
or fragments) assigned to that class. By comparing the components in
the database to the members of a class, a similarity-ranked result
set is computed. Based on the results, users may further assign
elements to the class currently being investigated. Applying this
process recursively, groups of closely related and user-specific
contents are grown continually.

 

% direct data exchange
A set of general database requests necessary for the client is
supported by a database abstraction layer. Such direct data requests
allow fast read and write access to the database via the RMI
communication server.

 

 

\subsection{Indexing}

This section overviews the indexing task including document
acquisition, document mapping, document storage, and computation of
the indices.

 

\subsubsection{DataCollector}

The task of the DataCollector is to access the documents requested
and to store them as lossless XML documents in the raw XML document
repository. In this repository, documents are only held temporary
during the download. Supporting scalability, the DataCollector can
be run in two modes, a single file mode and a collection mode.

The architecture of the DataCollector is given in
Figure~\ref{fig:system:datacollector}. The abstract DataCollector
performs five processing steps. First, a data connection between the
server and the source document is established, possibly including
login information or connection-relevant parameters. The document is
transferred as an XML tree into the main memory of the server.
Optionally, filtering of irrelevant information reduces the stored
data load (e.g., file owner, access rights). Data conversion is
carried out to get rid of tags (e.g., HTML markup, formatting
information) within textual contents. Finally, the complete XML
document is stored in the raw XML document repository.
%
\hiddenabbrev{Hyperwave Information Server}{HIS}
%
Currently, the system includes a DataCollector implementation for
plain text documents, XML documents, HTML documents, and Hyperwave
Information Server (HIS) documents. Further extensions may include
other formats such as PDF documents, MS-Word documents, or
PostScript documents.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/architecture_datacollector}
\caption{The DataCollector}
\label{fig:system:datacollector}
\end{figure}

 

 

 

\subsubsection{DataMapper}

The DataMapper transforms the raw XML documents into the generic XML
document format described in Section~\ref{sec:generic_xml_schema}.
For each of the raw XML document formats (i.e., the source document
formats), a separate XSLT stylesheet is defined for the mapping
procedure. This simplifies the transformation process, requiring a
single coding language (XSL) interpreted by a common XSLT engine
only.

% this is new
Since the INEX 2005 document collection does not account for
explicit linkage within or among documents (at least it is not
evaluated), the linking functionality of the generic schema is
neither exploited in this section nor in the evaluation chapter.

% figure
Figure~\ref{fig:system:datamapper} explains the three main
components of the DataMapper. An XML parser processes the XML
documents and outputs them to the XSLT transformation. After
transformation, the XML document conforming to the generic document
format is stored in the mapped XML repository.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/architecture_datamapper}
\caption{The DataMapper}
\label{fig:system:datamapper}
\end{figure}

 

 

 

\subsubsection{DataStorer}

After the documents are transformed into the generic document
format, the DataStorer stores the structure, metadata, and content
in the relational database. This step involves typing of metadata
elements and content elements. Figure~\ref{fig:system:datastorer}
overviews the components of the DataStorer.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/architecture_datastorer}
\caption{The DataStorer}
\label{fig:system:datastorer}
\end{figure}

% database
A global view on the database is depicted in
Figure~\ref{fig:system:database_UML}.
%
The tables \texttt{files} and \texttt{file\_processings} keep track
of the documents indexed and their processing status. The central
\texttt{XMLstructure} table holds the structure of the XML
documents. XML paths and XML tags are kept in separate tables.
Substructure entries such as links and mathematical environments
within a fragment's content are stored in the
\texttt{XMLsubstructure} and \texttt{XMLsubstructureContent} tables.
Metadata (\texttt{Metadata} table) and content
(\texttt{XMLstructureContent} table) are associated with the
document structure, where both entries are optional. In the current
implementation metadata entries are considered as unstructured
(flat). Different representations of the contents are stored using
the vector space model representation framework (gray rectangles
comprising blue contents), which is explained in the next section.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/database_UML}
\caption{Conceptual database schema}
\label{fig:system:database_UML}
\end{figure}

 

 

 

\subsubsection{DataIndexer}
\label{sec:system:indexing}

\hiddenabbrev{Mathematical Markup Language}{MathML}

The DataIndexer computes the content representations of the
fragments (\texttt{FRA} elements). Optionally, redundant
representations of specific inner nodes (e.g., sections at a certain
level) can be included to improve retrieval performance.
%
Since the focus of this work rests on text retrieval, the current
implementation focuses on multiple representations of plain texts.
However, representations of images (e.g., histograms), tables, or
other multimedia contents can be included.

 

% architecture
An overview of the building blocks of the DataIndexer is given in
Figure~\ref{fig:system:dataindexer}. Texts that are indexed are
loaded from the database. Natural language processing is applied on
the texts to achieve mutually comparable representations. This step
includes tokenization, tagging, term extraction, stemming, and term
frequency calculation. Besides single-terms (indices ST01 -- ST12),
the final term frequency vectors include a composite noun vector
(index MT01), a named entity vector (index MT02), a formulaic speech
vector (index MT03), and a vector of full forms of acronyms (index
MT04).
%
Taking advantage of the modularity aspect, different configurations
of the natural language processing components are instantiated and
selected during runtime. Thus, the system can be easily adapted to
process documents in other languages.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/architecture_dataindexer}
\caption{The DataIndexer}
\label{fig:system:dataindexer}
\end{figure}

% vector space model representation framework
Each representation is stored in the database tables of the
corresponding vector space model representation framework (see the
gray rectangles with blue content in
Figure~\ref{fig:system:database_UML}). This framework allows to
maintain multiple representations of the same content.
%
It consists of four database tables for each representation model
(e.g., \texttt{ST01}). A table \texttt{index\_terms\_ST01} holds the
terms that are used to represent the content in the \texttt{ST01}
model. In \texttt{index\_ST01}, terms and term frequencies are
associated with structural entries of the documents. Thus, the table
contains the term frequency representations of the document
components. Derived from the representation table,
\texttt{index\_IEF\_ST01} stores the number of term occurrences in a
given XML path. This table is essential to compute the inverse
element frequency needed for term weighing. Further, it enables
dynamic term space generations as described in
Section~\ref{sec:representation}. Table updates during indexing,
re-indexing, and removal of documents are immediately done.
%
In addition to the term frequency representation table, a table
\texttt{index\_objects\_ST01} is provided to store binary
representations of document components that do not fit the term
frequency representation type. Such representations include semantic
concepts (no meaningful $ief$), figure representations (e.g.,
histograms), or formulas (e.g., like MathML), to name just some of
them.

 

% Maintained Indices
In order to compare the performance of different representations,
several indexing models are tested.
Table~\ref{tab:system:maintained_single-term_indices} lists the
supported single-term indices.
%
These indices are implemented as uncontrolled vocabulary, meaning
that unknown terms extracted of new documents are added to the term
space.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Single-term indices maintained by the system}
\label{tab:system:maintained_single-term_indices}
\begin{tabular}{llllll}
%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{Tokenizer} & \mcc{Tagger} & \mcc{Extractor} & \mcc{Stemmer} & \mcc{Stopword Filtering} \tabularnewline%
%
ST01 & SimpleTokenizer & - & all & - & - \tabularnewline%
ST02 & OpenNLPTokenizer & - & all & - & - \tabularnewline%
ST03 & JavaTok & - & all & - & - \tabularnewline%
%
ST04 & OpenNLPTokenizer & QTag & nouns, verbs & PorterStemmer & Fox (the best) \tabularnewline%
ST05 & JavaTok & QTag & nouns, verbs & PorterStemmer & Fox (the best) \tabularnewline%
%
ST06 & JavaTok & QTag & nouns, verbs & PorterStemmer & FS, CR, DS \tabularnewline%
ST07 & JavaTok & QTag & nouns, verbs, adjectives, adverbs & PorterStemmer & FS, CR, DS \tabularnewline%
ST08 & JavaTok & QTag & nouns, verbs, adjectives, adverbs & - & FS, CR, DS \tabularnewline%
%
ST09 & JavaTok & token types & valid words & PorterStemmer & FS, CR, DS \tabularnewline%
ST10 & JavaTok & token types & valid words & PorterStemmer & FS, CR \tabularnewline%
ST11 & JavaTok & token types & valid words & PorterStemmer & FS \tabularnewline%
ST12 & JavaTok & token types & valid words & PorterStemmer & -
%
\end{tabular}
\end{table}

% multi-terms
In addition to the single-terms, multi-term indices are maintained
to improve matching of term sequences.
Table~\ref{tab:system:maintained_multi-term_indices} presents the
set of multi-term indices used. In contrast to the single-terms,
multi-terms define a controlled vocabulary. This is necessary
because the large number of unique term combinations is inapplicable
for run-time query computation. The four indices include the
multi-terms extracted in
Section~\ref{sec:advanced_nlp:multi_term_dictionaries} and
Section~\ref{sec:advanced_nlp:acronyms}.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Multi-term indices maintained by the system}
\label{tab:system:maintained_multi-term_indices}
\begin{tabular}{lp{15cm}}
%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{Description} \tabularnewline%
%
MT01 & JavaTok-based composite nouns of arbitrary length ($\sim 340.000$ terms) \tabularnewline%
MT02 & JavaTok-based named entities of arbitrary length ($\sim 65.000$ terms) \tabularnewline%
MT03 & JavaTok-based formulaic speech of arbitrary length ($\sim 245.000$ terms) \tabularnewline%
MT04 & JavaTok-based full forms of acronyms of arbitrary length ($\sim 17.000$ terms)
%
\end{tabular}
\end{table}

 

% final remarks
Since indexing only stores term frequency vectors in the database,
weight computation is totally executed on-the-fly during the
retrieval process.

 

 

%\newpage
\subsection{Retrieval}

This section explains the retrieval process. In particular, it
describes how the query is parsed, analyzed, and expanded. Term
weighing takes place during retrieval only. The matching process is
guided by mechanisms improving performance in both regards,
computational complexity and quality of retrieval results. These
mechanisms include, besides clustering, a set of pre- and
post-filtering steps.

 

\subsubsection{Query Parser}

Queries introduced to the server by the client are formulated in an
extended XOR language syntax. Queries further come in chains of
subqueries. Each subquery are parsed by a JavaCC parser
implementation, which creates an \abbrev{Abstract Syntax Tree}{AST}.
%
Listing~\ref{listing:system:xor2}, which is replicated from
Listing~\ref{listing:system:xor} for better readability, provides a
XOR query example.
%
% example XOR query
\begin{lstlisting}[caption={[XOR query example]XOR query example},label=listing:system:xor2]
/DOC[meta(.,documentMeta.title like '\%Computer\%')]
//SEC[(about(.,retrieval) AND about(./FRA,information))]
/FRA[about(.,processing data storage instructions logic)]
\end{lstlisting}

 

% path resolving
In a first step the search paths and retrieval paths are resolved.
The paths of subsequent subqueries are obtained by concatenation.
For instance, the retrieval path of subquery 2 is
\texttt{/DOC//SEC}, and the two search paths are \texttt{/DOC//SEC}
and \texttt{/DOC//SEC/FRA}.

 

\subsubsection{NLP Analysis}

% query term extraction of about()
According to the index that is used for matching, query terms
occurring in the \texttt{about()}-predicates are extracted by the
DataIndexer (see Section~\ref{sec:system:indexing}) with the same
settings. This results in a term frequency vector of query terms.
%
Terms not occurring in the document term space are neglected. Terms
marked with \texttt{+}/\texttt{-} are separately kept for fast
pre-filtering of single-term representations during matching. Quoted
terms are remembered for post-filtering of matched results.
%
During retrieval, the similarity of a query representation and the
document representations stored serves to compute the relevance of
an XML component.

% query term extraction of meta()
Metadata information such as titles or author names are not indexed
using the vector space model. In order to avoid missing query terms
that are not included in the term space (neglected terms), all query
terms are compared unstemmed to metadata fields. For instance, the
subquery \texttt{//*[about(.,to be or not to be)]} matches all
elements with metadata fields that contain at least one of the terms
\texttt{to}, \texttt{be}, \texttt{or}, or \texttt{not} to a certain
degree.

 

 

\subsubsection{Query Expansion}

Queries often contain quite specific search terms. In case of
general answers expected, related terms can be added to query terms
automatically, increasing the number of results returned. Generally,
this step covers synonyms of wanted terms and antonyms of unwanted
terms. If the query terms added are well chosen, additional results
may also be relevant to the same query with high confidence.

%problem
However, adding terms to a query automatically may lead to confusion
of results retrieved not containing any of the initial query terms
(i.e., only terms expanded match the query). Since many terms can be
used synonymously, the set of query terms expanded increases
rapidly. Further, longer and thus more specific queries are expanded
by a larger set of terms than shorter queries. This may be
unsatisfying, if the user tried to restrict the search to specific
contents excluding somehow similar elements using a specific
vocabulary.

% system
Especially domain-specific acronyms pose the possibility of
meaningful query term extension. The system proposed uses the
acronyms and their full forms (extracted in
Section~\ref{sec:advanced_nlp:acronyms}) to add context-specific
acronyms and/or their full forms.

 

 

\subsubsection{Term Weighing}

Weighing of both, query terms and document terms, takes place during
retrieval only. The system supports two kinds of term spaces, a
static term space and dynamic term spaces (see
Section~\ref{sec:representation}). Both methods apply the same
weighing formula (see Equations~\ref{eq:vsm_w_static} and
\ref{eq:vsm_w_dynamic}).
%
The weight $w_{i,j}$ of a term $i$ in a component $j$ is given by
the term frequency $tf_{i,j}$ multiplied by the inverse element
frequency $ief_{i,c}$ of the current search path $c$ (i.e., term
space).

% procedure
During weighing, the granularity aspect is considered in a first
place. This means that, according to the searched XML components,
the term space and the corresponding inverse element frequency
vector are determined first.
%
In order to minimize the number of (dynamic) $ief$ calculations and
term space generations, queried elements are grouped according to
their expanded search paths. For instance, the search path
\texttt{/DOC//SEC/FRA} is resolved to \texttt{/DOC/SEC/FRA},
\texttt{/DOC/SEC/SEC/FRA}, \texttt{/DOC/SEC/SEC/SEC/FRA} etc. The
corresponding $ief_{i,c}$ values are calculated once per path and
kept in main memory.
%
Both, query terms and document terms are mapped onto this common
term space and weighed using the vector of inverse element
frequencies.
%
% length normalization
Note that using a normalized term weighing function such as the
vector space model, the length of a node's content is already taken
into account.

 

 

 

\subsubsection{Result Computation}

The process of computing the result set of a subquery is illustrated
in Figure~\ref{fig:system:architecture_resultcomputation}.
%
\begin{figure}[ht]
\centering
\includegraphics[width=1.0\textwidth]{09_system/figures/architecture_resultcomputation}
\caption{Result computation}
\label{fig:system:architecture_resultcomputation}
\end{figure}

 

% preselection
Each of the subqueries is processed sequentially, where the result
of one subquery serves as a filter for the result of subsequent
subqueries. A preselection mechanisms observes only components that
are feasible for further investigation. It is based on three steps:
%
\begin{compactenum}
%
\item Components must match the structural constraint (retrieval
path). This is accomplished by a single database query on the
\texttt{XMLstructure} table specifying the \texttt{pathID} attribute.
%
\item If a previous result is available, the component $comp$ itself or any
of the component's ancestors $anc$ must be included. This is equivalent to
a retrieval status value above zero ($rsv>0$). The containment is
computed by checking for components $preorder_{anc} \leq preorder_{comp}$
and $postorder_{anc} \geq postorder_{comp}$.
In cases where no previous result is set, all components are investigated.
%
\item Finally, \texttt{+}/\texttt{-} conditions on
query terms exclude components. This filtering step is applied
on single-term indices only, because these are based on uncontrolled
vocabulary. By checking a components' term vector for the existence
(\texttt{+}) or absence (\texttt{-}) of query terms,
the performance of retrieval is boosted.
%
\end{compactenum}

% preselection benefits
This concept of preselection ensures that only suitable components
are processed, which reduces the number of comparisons computed
tremendously.

 

% matching
Matching includes two types of information, metadata and content.
While metadata is queried by using the \texttt{meta(path,condition)}
predicate, content is searched by using
\texttt{about(path,keywords)}.
%
% meta
Metadata similarity, $simMeta$, is computed by addressing metadata
fields directly in an SQL-like manner. Consequently, the
\texttt{meta()} predicate retrieves components that strictly match
the condition. If components are retrieved, $simMeta$ is defined as
the maximum value $1,0$. Otherwise, $simMeta$ is set to zero. This
allows to compute metadata queries efficiently, because the database
retrieves the results directly without further processing.
%
% about
Alternatively, the similarity of content, $simContent$, is evaluated
by using the \texttt{about()} predicate. The matching procedure is
straightforward, applying the cosine similarity formula of the
vector space model (see Section~\ref{sec:representation}) on the
weighed term vectors of the query and the component. In case of
multiple indices being used, $simCont$ is defined as the average of
all computed values.

% about HC
Hierarchical clustering is applied to improve matching performance
of the \texttt{about()} predicates. Therefore, the document level
(e.g., XML path \texttt{/DOC}) is clustered according to the
approach described in Chapter~\ref{chapter:xml_document_clustering}.
Each document initially creates its own cluster. Recursively, the
two most similar clusters are merged. The process ends if a single
cluster remains.
%
Matching starts at the top cluster. If the content of the cluster is
somehow similar to the query (i.e., similarity above zero), the two
child clusters are investigated. Leaf clusters are not further
investigated but added to the list of investigated components. These
remaining components are then compared to the query in the usual
manner. For the classification, parameters are set to
$\alpha_{struct}=0,0$ and $\beta_{parent}=0,2$, which achieved the
best evaluation results (see
Sections~\ref{sec:clustering:experiment_1} and
\ref{sec:clustering:experiment_4}).
%
% matchAboutMeta
Besides comparing the contents only, the keywords of the
\texttt{about()} predicate are also compared to the metadata
information available. The overall metadata similarity $simMeta$ is
defined as the number of matching single-terms divided by the
smaller number of terms (either keywords or metadata terms).

 

 

% boolean operations
Subqueries may contain constraints combined with boolean operators.
For instance, the second subquery in
Listing~\ref{listing:system:xor}, \texttt{//SEC[about(.,retrieval)
AND about(./FRA,information)]}, combines sections about
\texttt{retrieval} with sections that include fragments containing
the term \texttt{information}. During processing, both of these
result sets are computed independently. The \texttt{AND} operator
then computes the intersection of both sets, where elements
occurring in both sets are assigned the minimum similarities. The
\texttt{OR} operator, instead, defines the union of both sets, where
elements occurring in both sets are assigned the maximum
similarities. The same procedure is applied to \texttt{OR}-connected
retrieval paths (e.g., \texttt{//(SEC|FRA)[about(.,information
retrieval)]}.

 

%reweighing
Having computed the similarity values $sim_{new}$ of the components,
reweighing combines these similarities with the similarities of the
previous result $sim_{prev}$. This is done by using the generality
factor, the $gf$ parameter, specified in the query. The overall
similarities are given by $sim = gf \cdot sim_{prev} + (1,0 - gf)
\cdot sim_{new}$. Reweighing is applied on the metadata similarity
$simMeta$, the content similarity $simCont$, and the retrieval
status value $rsv$.

 

% result description
The computed result consists of tuples of the form ($ID$,
$preorder$, $postorder$, $simMeta$, $simCont$, $rsv$). Document
$ID$, $preorder$, and $postorder$ come directly from the database.
$simMeta$ and $simCont$ are the calculated meta similarity and
content similarity. The $rsv$ value combines $simMeta$, $simCont$,
and the parent-child relationship according to the query parameters
$ci$ (content importance) and $gf$ (generality factor).
%
%
%
% sorting
The list of preliminary results is sorted according to the $rsv$ in
descending order. Thus, components ranked at the top of the list are
most relevant to the query.

 

% result filtering
Post-filtering of results is done in three steps:
%
First, if the result type is specified as focused, only the
component with the highest $rsv$ along the path to the root node of
it's document is added to the result as best entry point. Retrieval
results of the type unfocused are not filtered.
%
Second, result elements not meeting the minimum similarity criteria
$minSim$ are further discarded.
%
Third, if the number of results still exceeds the number of maximum
results $maxRes$, the list is truncated.
%
%
%
% result list is returned to the client
The final list of results is transmitted to the client for result
presentation.

 

 

 

\subsection{Classification}

The system supports classification of arbitrary document components
according to user-defined classes. While the client enables users to
define those classes and assign components to them, the server is
able to retrieve components that are similar to the components of a
class. The result of such a `classification task' is displayed
similarly to the retrieval results responding to a user query. This
kind of retrieval needs no additional parameters being set by the
user.

As mentioned, the key issue of classifying documents is the
similarity function that compares two structured document trees.
Therefore, the system relies on the $TED\_CM$ approach of combining
tree edit distance $TED\_SO$ (structure only) with content matrix
matching $CM\_any$ (ignores labels), described in
Chapter~\ref{chapter:xml_document_classification}. According to the
preliminary evaluation, the parameter combining both approaches,
$\alpha$, is set to $0,5$. $TED\_SO$ specific parameters are set to
$\alpha=0,5$, $\beta=2,0$, $C_{del}=1,0$, and $C_{ins}=1,0$.

% classification process
During classification, each component in the database is compared to
the set of components assigned to that class. The distances to all
components of the class are summed up and averaged over the size of
the class. In order to get comparable `relevance' values, the result
list of component distances is normalized on a [0,1] scale and
subtracted from 1,0 (similarity instead of distance). Elements not
relevant are discarded. The final list is sorted according to the
relevance in descending order and sent to the client.

 

 

 

\subsection{Direct Data Requests}

In a client-server architecture, the client often needs to access
and/or alter the database directly. These direct data requests help
reducing data loads and access times. Further, some of the
application logic can be transferred to the client instead of being
computed on the server.

The data abstraction layer of the system developed comprises methods
for accessing whole documents (structure, metadata, and content),
the set of available element paths, document processing information,
links to the original source of a document, and general statistics
on documents and the document collection.

 

 

 

\section{Future Extensions}

Changing environments, new application domains, and tuning are
inherent in software development and software engineering.
Consequently, system adaptations are inevitable over time. Besides
these changes, a number of future extensions would improve the value
of X-DOSE considerably:

\begin{itemize}

\item \textbf{Metadata:} The treatment of metadata has to be
refined. At the moment multiple occurrences of the same metadata,
e.g. two authors of a paper, are ignored. Also mechanisms for
structured metadata are to be developed.

\item \textbf{Representation:} The XSLT transformation applied
on the INEX corpus is based on a stylesheet consisting of over 2180
lines of code. Thus, the transformation surely faced mapping
difficulties with some documents. Those errors were corrected
manually. More consistent mapping mechanisms may improve the
retrieval results.

\item \textbf{Index enhancements:} The natural language core
processing depends on different internal (JavaTok tokenizer) and
external tools (tagger, stemmer). For the experiments, only a small
set of possible configurations was tested. Advanced corpus-based
analysis techniques may be used to generate resources that further
improve language analysis and representation.

\item \textbf{Content types:}
A valuable extension of X-DOSE may be additional indices on other
content types than plain text. Database index structures are already
provided by X-DOSE to store that data (e.g., binary data). However,
models for representing and matching formulae, figures, pictures,
and multimedia data still have to be integrated.

\item \textbf{Retrieval:} Performance improvements, especially of
the database, are necessary to reduce query response time.

\item \textbf{Classification and clustering:} The approaches proposed
illustrate the benefits of grouping similar components
automatically. However, other approaches with lower computational
complexity should be tested for information retrieval.

\item \textbf{Distributed information retrieval:} In order to cope
with large amounts of data, a distributed version of the system
seems to be necessary.
%
The systems' architecture is designed to operate independently.
However, a central core engine managing the distribution and merging
of multiple result sets needs to be designed and implemented.

\item \textbf{Incorporation of external resources:} Integrating external resources
or services like multilingual thesauri or wordnet ontologies may
improve the quality of representations. Including the output of
other search engines as intermediate results (e.g., Google) may
further extend the search space, decrease complexity, and improve
the overall retrieval performance.

\end{itemize}

 

%\subsection{Distributed and Parallel Architecture}
%
%\begin{enumerate}
% \item server run independently on single collections
% \item iterative document analysis, retrieval and merging of documents: fast user feedback
% \item an hyper-server is able to get results from multiple servers and aggregates their results
% \item hyper-server users can choose between servers he want to be queries
% \item documents are processed one by one
% \item if one retrieval unit fits to a minimum degree, whole document components are analyzed and result of that document get sent immediately to the requesting hyper-server
% \item hyper-server sorts results of single servers on-the-fly
% \item hyper-server merges results of single servers on-the-fly
% \item allows iterative analysis and fast user feedback
% \item result: serverID, docID (element), relevanceMeasure (block-graph), ...
% \item every communication conforms to http requests, automatic authentication of servers, MPI (message passing interface)?
% \item ...
%\end{enumerate}

 

 

 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Chapter 8
%
% Last Change: 2005-11-08, to be reworked
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Summary}
\label{sec:system:summary}

This chapter presented X-DOSE, the XML-Document Retrieval Engine
developed. X-DOSE is based on a client-server architecture and
relies on a relational MySQL database storage. The systems' main
tasks, indexing, retrieval, and classification are described and
details about the interaction of the client and the server are
highlighted.
%
An easy-to-use graphical interface of the client enables the user to
access the server's functionality.
%
Index, retrieval, and classification requests are sent to the server
and processed in parallel. A multi-level preselection mechanism and
hierarchical clustering are applied to improve computational
performance during query processing.