"Science must begin with myths, and with the criticism of myths." Karl Popper
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Chapter 8 - Evaluation of the System
%
% last change: 16.08.2004
% correction hamid: xx.xx.2004
% correction prof: xx.xx.2004
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\aphorism{The outcome of any serious research can only be to make two questions grow where only one grew before.}{Thorstein Veblen}%
\chapter{Evaluation of X-DOSE}
\label{chapter:evaluation}%
This chapter is concerned with the evaluation of the system
described in the previous chapter. Based on an official data set
provided by INEX, several experiments were conducted that show the
effects and benefits of different search engine settings. Final
results are compared to a former X-DOSE version as well as to
similar systems of other INEX 2005 participants.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Experimental Settings}
This section presents the preliminaries and experimental settings
used for rating the performance of the system implemented. The
evaluation is based on the official data set provided by INEX in
2005\footnote{\url{http://inex.is.informatik.uni-duisburg.de/2005},
(10.10.2008)} (see Section~\ref{sec:sdr:evaluation}).
%
This data set includes a large document repository, a set of queries
or topics, and the corresponding query results handpicked by human
experts. A set of retrieval tasks and evaluation metrics used at
INEX 2005 enable a comparison of different systems operating on the
same data.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\subsection{Document Repository}
% general dataset information
The latest version of the INEX document collection v1.9 (used in the
Ad Hoc Retrieval track and the Natural Language track) consists of
16.819 articles of the IEEE Computer Society's publications from the
field of computer science. The 764 MB collection is organized in 24
folders, each of them representing a particular journal (magazine or
transaction). Within these folders, the documents are further
subclassified according to the year they were published, starting
from 1995 to 2004. All documents are structured according to a DTD
of about 700 lines of code, defining 178 different tags.
%
In total, the XML documents contain over 11 million document
components. On average, each document contains 687,46 elements an
reaches a nested depth of 8,09.
% generic schema
In order to feed the collection to the system, the documents are
transformed into the generic XML format described in
Section~\ref{sec:generic_xml_schema}. The XSLT 2.0 stylesheet used
for that transformation consists of about 2.200 lines of code,
reflecting the structural richness of the documents. The generic
schema defines only 33 different tags, which is less than 18,54\% of
the original DTD.
%
% transformed document statistics
The average depth of the transformed documents is about 9,07, where
each document contains 737,53 elements on the average. Since the
basic XML elements of the generic format (\texttt{DOC},
\texttt{SEC}, \texttt{FRA}) contains a synthetic level
(\texttt{METADATA}, \texttt{CONTENT}) that is not indexed, the
average nesting is reduced to one half ($\sim 4,54$) and the average
number of elements per document decreases to one third ($\sim
245,78$). Table~\ref{eval:tab:corpus_statistics} summarizes the
statistics about the documents.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Document repository statistics}
\label{eval:tab:corpus_statistics}
\begin{tabular}{Bcc}
%
\rowcolor{tableheadcolor} & \mcc{INEX 2005} & \mcc{Mapped INEX} \tabularnewline%
%
Number of documents & \multicolumn{2}{c}{16.819} \tabularnewline%
Storage requirement & 764 MB & 689 MB \tabularnewline%
Number of different tags & 178 & 33 \tabularnewline%
Number of components & $\sim$ 11.500.000 & $\sim$ 4.100.000 \tabularnewline%
Average components per document & 687,46 & 245,78 \tabularnewline%
Average nesting of a document & 8,09 & 4,54 \tabularnewline%
%
\end{tabular}
\end{table}
% document transformation
Using Java, SAX~\cite{url_sax}, and Saxon~\cite{url_saxon}, the
transformation process for the whole collection took about 45
minutes.
%
Special characters were retained and escaped by standardized escape
sequences defined by W3C in the ISO8879 character
map\footnote{http://www.w3.org/2003/entities/iso8879doc/overview.html
(13.01.2009)}, resulting in UTF-8 conformant XML documents.
%
% structural corrections
Only 71 out of the 16.819 documents (0,42\%) needed manual
correction of the structure. In most cases the corrections handled
flattening of unnecessary nested structures such as lists within
lists, tables within paragraphs, and paragraphs within paragraphs.
Only four corrections had to be carried out on the metadata,
removing paragraphs from title and keywords fields. All of these
corrections had to be conducted carefully because disturbing a
documents' (sub)structure consequently leads to a drop of retrieval
performance. For instance, removing a table from a paragraph (e.g.,
\texttt{/article/par/tab} becomes \texttt{/article/tab}) changes the
XPath (the position) of subsequent paragraphs (e.g.,
\texttt{/art[1]/par[3]} becomes \texttt{/art[1]/par[2]}).
% mapping difficulties
Due to the mapping of INEX documents onto the generic document
schema, some of the INEX queries cannot be evaluated exhaustively by
the system. This is because the generic format gets rid of
layout-related information (e.g., \texttt{<b>}, \texttt{<emph>},
etc.) and synthetic elements (e.g., \texttt{<bdy>}, \texttt{<bm>},
etc.). Other queries cannot be answered exactly: INEX topics
addressing \texttt{/article/bdy} or \texttt{/article} elements are
regarded as equal topics addressing the transformed \texttt{/DOC}
element.
\subsection{Topics}
% general topic types
INEX topics are of two types: Content-Only (CO) and
Content-And-Structure (CAS)~\cite{malik05overview}. The type of a
topic reflects the knowledge about the document structure in the
collection.
%
CO topics refer to queries of users that do not have insight into
structure, or simply do not make use of it. Most users are of this
type.
%
The second type, CAS topics, include structural knowledge of the
documents searched. This information is used as a device for
enhancing the precision of the results retrieved.
%
Based on the types, several subtypes focusing on extension and
interpretation aspects are defined in INEX 2005:
%
% subtypes
\begin{itemize}
%
\item \textbf{CO subtypes:} Investigating the usefulness of
structural hints in queries, CO topics are extended to
\abbrev{Content Only + Structure}{COS} topics. While pure CO topics
consist of content conditions only, COS topics formulate the same
query with structural constraints. This enables an evaluation of the
same topic with and without structural information across different
retrieval systems.
%
\item \textbf{CAS subtypes:} CAS topics contain conditions on the
content and the structure. Structural constraints include elements
that are searched (search path, support elements) and elements that
are retrieved (retrieval path, target elements). Both of those
elements can be considered as either strict (S) condition (path must be
matched exactly) or vague (V) condition (path is simply a hint).
According to these interpretations, VVCAS, SVCAS, VSCAS, and SSCAS
topics are distinguished. The first letter defines the elements retrieved,
and the second letter the elements searched.
%
\end{itemize}
% topic results
All of the topics were created by the INEX 2005 participants, where
each party was asked to submit up to 6 candidate topics (3 CO/COS
and 3 CAS). The final set of 87 topics consisted of 40 CO (28 COS),
and 47 CAS topics. The complete list of INEX 2005 topics can be
found in the Appendix in Section~\ref{app:sec:inex_topics}.
% assessments
Result assessments of the topics were also carried out by the
participants. Each party was assigned two to three topics to
evaluate, where topics were assigned multiple times to cross-check
the assessments. Supported by the \abbrev{XML Retrieval Assessment
Interface}{X-RAI}, relevant topics results were highlighted
manually. After the assessment phase, the participant was allowed to
access the complete set of topic assessments and granted access to
the results submitted by other INEX participants.
% transformation
As the system operates on transformed INEX documents, the topics
were adapted to fit the generic document model. This step included
element path renaming (e.g., \texttt{article} elements became
\texttt{DOC} elements) and metadata inclusion (e.g.,
\texttt{//fm/au[about(.,Einstein)]} became
\texttt{/DOC[meta(AUTHOR,Einstein)]}, \texttt{.//yr<=2000} became
\texttt{/DOC[meta(YEAR,<=2000)]}).
%
% specific topic changes
%Topics 207 (\texttt{DOC and SAX}), 214 (\texttt{"adaptive learning"
%and "interactive learning" in education}), 215 (\texttt{Conference
%on Information and Knowledge Management CIKM}), 217
%(\texttt{user-centered design of web sites}), 231 (\texttt{markov
%chains in graph related algorithms}), 238 (\texttt{neural network
%algorithm for chess}), 240 (\texttt{Software quality control and
%measurement}), and 241 (\texttt{Single sign on + LDAP}) produced
%misleading results because of the terms \texttt{and}, \texttt{of},
%\texttt{in}, \texttt{for}, and \texttt{on}. Although these terms are
%not contained in the index (filtered by the stopword lists),
%metadata such as section titles or table captions often matched.
%Also, the special character \texttt{+} of topic 241 needs further
%query term processing (escaping) and was removed from the topic.
%
COS topic 212 was the only topic that included a structural count
constraint (\texttt{.//en > 3}). This condition on the structure had
to be removed because this type of constraint is not yet implemented
in X-DOSE.
\subsection{Retrieval Tasks}
In INEX 2005, the Ad Hoc retrieval track was concerned with the
evaluation of the topic results achieved by different retrieval
systems. A set of assumptions regarding the output of those systems
led to the definition of special retrieval tasks. According to the
query types, several retrieval strategies were distinguished in
INEX:
%For CAS tasks, all tasks were assumed to follow the Thorough strategy.
%
\begin{compactitem}
%
\item \textbf{CO.Thorough:} This task is considered as the `basic'
retrieval task that returns all relevant components within the
collection. Overlap within the results is not a concern of this
tasks, which may lead to a large number of overlapping elements.
Main focus is put on the ranking mechanisms of the results.
%
\item \textbf{CO.Focused:} This strategy is supposed to return the
most exhaustive and most specific element along a single XPath within a
document. No overlapping elements are allowed in the result, targeting the
appropriate level of granularity. If parent and child elements are
equally relevant, the parent element is to be returned.
%
\item \textbf{CO.FetchBrowse:} The fetch and browse task combines
document retrieval and element retrieval strategies. It consists of
two phases: A first fetching phase ranks the documents according to
their relevance. In a second browsing phase, elements within a
document are compared to other elements within the same document.
According to this intra-document relevances, elements are ranked and
returned by the system.
%
%
\item \textbf{COS.Thorough:} Same task as the CO.Thorough strategy,
but considering constraints on the structure.
%
\item \textbf{COS.Focused:} Same task as the CO.Focused strategy,
but considering constraints on the structure.
%
\item \textbf{COS.FetchBrowse:} Same task as the CO.FetchBrowse strategy,
but considering constraints on the structure.
%
%
\item \textbf{VVCAS:} Following the thorough strategy, vague
matching of elements retrieved and vague matching of elements
searched is applied.
%
\item \textbf{SVCAS:} Following the thorough strategy, strict
matching of elements retrieved and vague matching of elements
searched is applied.
%
\item \textbf{VSCAS:} Following the thorough strategy, vague
matching of elements retrieved and strict matching of elements
searched is applied.
%
\item \textbf{SSCAS:} Following the thorough strategy, strict
matching of elements retrieved and strict matching of elements
searched is applied.
%
\end{compactitem}
% this work
Out of these tasks, X-DOSE is evaluated on the basis of CO.Thorough,
CO.Focused, COS.Thorough, and COS.Focused. FetchBrowse tasks were
not considered because X-DOSE has not implemented that strategy.
%
Since X-DOSE processes structural conditions as strict filters to
improve computational retrieval performance, SSCAS is the
appropriate strategy that is evaluated. Vague matching of support
and target elements was left open for further experiments.
\subsection{Evaluation Metrics}
In INEX 2005, two kinds of official metrics were introduced to
evaluate the performance of XML retrieval systems. A recall-oriented
measure at fixed ranks based on cumulated gain, and a
precision-oriented effort-precision/gained-recall measure. Both
measures are computed by
EvalJ~\footnote{\url{http://evalj.sourceforge.net} (12.10.2008)}, an
open source evaluation software implemented for INEX. Since
performance values of other INEX 2005 systems are available for
comparison, X-DOSE is evaluated according to these measures for
comparison.
\subsubsection{eXtended Cumulated Gain ($xCG$) Measures}
% CG introduction
Cumulated gain~\cite{jaervelin02cumulatedgain} measures reflect the
number of results expected among the number of results retrieved at
a fixed cutoff point. An extension of those metrics at INEX led to a
new set of \abbrev{eXtended Cumulated Gain}{xCG}
measures~\cite{kazai05evaluation}, making structured document
retrieval performance judgements more accurate. The $xCG$ measure is
defined as a vector of accumulated gain. The cumulated gain at a
given rank $i$ is computed as the sum of all relevance scores
$xG[j]$ up to that rank (Equation~\ref{eq:xcg_1}).
%
\begin{equation}\label{eq:xcg_1}
xCG[i] = \sum_{j=1}^{i} xG[j]
\end{equation}
%
% for each query
For each topic, the ideal gain vector $xI$ is derived from the
recall-base acquired in the topic assessment phase. The
corresponding accumulated ideal gain vector of the optimal results
is referred to as $xCI$.
%
The final normalized extended cumulated gain $nxCG$ score is given
by Equation~\ref{eq:xcg_2}. For any rank, $nxCG=1,0$ represents an
ideal result.
%
\begin{equation}\label{eq:xcg_2}
nxCG[i] = \frac{xCG[i]}{xCI[i]}
\end{equation}
% quantization function
The relevance of a component is computed based on its exhaustivity
($e \in \{0,1,2\}$)\footnote{Officially, INEX defined exhaustivity
values as $e \in \{?,0,1,2\}$, where $e = ?$ denotes elements judged
as `too small'. Applying the $nxCG$ measure, these elements were
processed as $e = 0$.~\cite[pp. 17]{kazai05evaluation}} and
specificity ($s \in [0,1]$) values assigned during the assessment.
According to these $(e,s)$ pairs, three quantization functions are
defined:
%
\begin{eqnarray}
quant_{strict}(e,s) & = &
\begin{cases}
1 & \text{if}~ e=2 ~\text{and}~ s=1\\
0 & \text{otherwise}
\end{cases}
\\
%
quant_{gen}(e,s) & = & e \cdot s\\
%
quant_{genLifted}(e,s) & = & (e+1) \cdot s
\end{eqnarray}
%
The last function, $quant_{genLifted}$, enables `too small' elements
to be considered as near-misses.
% relevance computation of an element
These quantization functions are applied to compute the $xG[j]$
values used in Equation~\ref{eq:xcg_2}. According to the different
retrieval tasks, a set of relevance value functions, referred to as
$rv$, were defined.
%
% Thorough tasks
For the thorough retrieval tasks, the cumulated gain is given by
%
\begin{equation}
xG[j] = rv(c_i) = quant((e,s)_i)
\end{equation}
%
where $c_i$ is the component at rank $j$, $quant$ is one of the
three quantization functions mentioned, and $(e,s)_i$ is the
assessed exhaustivity-specificity pair of $c_i$.
%
% Focused tasks
For the focused retrieval tasks, two aspects of structured document
retrieval results are considered: Near-misses (e.g., neighboring
paragraphs, container sections) and overlap (e.g., a paragraph and
its container section are both retrieved).
%
Near-misses are introduced as rewards of non-ideal components
retrieved that are structurally related to ideal components. The set
of structurally related components consists of relevant components
(as per quantization function) that are not included in the ideal
result set.
%
Overlap is explicitly included in the relevance value function:
%
\begin{eqnarray}
xG[j] = rv(c_i) & = &
\begin{cases}
quant((e,s)_i) & \text{if}~ c_i ~\text{has not yet been seen}\\
(1-\alpha) \cdot quant((e,s)_i) & \text{if}~ c_i ~\text{has been fully seen}\\
\alpha \cdot \frac{\sum_{j=1}^{m} (rv(c_j) \cdot |c_j|)}{|c_i|}
+ (1-\alpha) \cdot quant((e,s)_i) & \text{if}~ c_i ~\text{has been partially seen~~~}
\end{cases}
\end{eqnarray}
%
$m$ is the number of child components of $c_i$, $| \cdot |$ is the
length of an element in characters or words, and $\alpha \in [0,1]$
is a user's intolerance factor of redundant components in the
result. The higher the value $\alpha$, the less interested the user
is in any overlapping result.
A normalization function $rv_{norm}$ safeguards against higher
relevance values of components (by summing the relevances of its
child nodes) than that of the ideal node.
%
\begin{equation}
xG[j] = rv_{norm}(c_i) = \min( rv(c_i) , rv(c_{ideal}) - \sum^{S} rv(c_j) )
\end{equation}
%
In the formula, $c_{ideal}$ is the ideal node that lies on the same
relevant path as $c_i$, and $S$ is the set of child nodes $c_j$ of
the ideal node that has already been seen.
%
Figure~\ref{eval:fig:metrics:nxcg} illustrates the behavior of the
normalization function. In the example, yellow nodes $c_i$ and $c_j$
are retrieved. Their relevance values are computed by the retrieval
system and given by $rv$. The ideal node to be retrieved is
$c_{ideal}$, which is not included in the result set. For
$c_{ideal}$, the human judgement of its retrieval value is $rv=0,9$.
In the given scenario, the normalization function limits the
retrieval value of node $c_i$ to
$rv_{norm}=min(0,7,~0,9-(0,2+0,2+0,2)) = 0,3$, although the
retrieval value computed for $c_i$ is $rv=0,7$.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.65\textwidth]{10_evaluation/figures/metrics_nxcg}
\caption{Normalization function $rv_{norm}$ of $nxCG$}
\label{eval:fig:metrics:nxcg}
\end{figure}
\subsubsection{Effort-Precision and Gain-Recall ($ep-gr$) Measures}
The $ep$-$gr$ measure~\cite{kazai05evaluation} reflects the effort
of users required to reach a given level of cumulated gain.
Therefore, the given result ranking is compared to the ideal
ranking.
%
% effort precision
Formally, effort-precision $ep$ is given by
%
\begin{equation}
ep[r] = \frac{i_{ideal}}{i_{run}}
\end{equation}
%
where $i_{ideal}$ is the rank position at which the cumulated gain
of $r$ is reached by the ideal gain vector $xCI$, and $i_{run}$ is
the rank position at which the same cumulated gain is reached by the
system $xCG$. A value of $1,0$ refers to an optimal performance.
% gain recall
Gain-recall, $gr$, is computed as the cumulated gain value divided
by the total cumulated gain achievable:
%
\begin{equation}
gr[i] = \frac{xCG[i]}{xCI[n]}
= \frac{ \sum_{j=1}^{i} xG[j] } { \sum_{j=1}^{n} xI[j] }
\end{equation}
%
where $n$ is the total number of documents in the recall base and
$xI[j]$ is the ideal gain vector.
%
%
% overall result
In $ep$-$gr$ graphs, effort-precision is plotted against gain-recall
(similar to traditional recall/precision graphs), providing a global
summary of a system's overall performance.
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Results}
This section describes the experiments conducted and results
achieved. Nine different sets of experiments were performed to
evaluate several indexing strategies, X-DOSE parameter settings,
time measurements, performance improvements due to clustering, and
an overall comparison to similar systems.
%
The main focus of the experimental evaluation was put on the
following INEX retrieval tasks: CO.Thorough, CO.Focused,
COS.Thorough, COS.Focused, and SSCAS. For each retrieval task, the
complete set of CO topics (40), COS topics (28), and CAS topics (47)
was processed.
For better readability, experiments and discussions focus on the
strict $nxGC$ measure. The complete set of $nxCG$ evaluation
measures is included in the Appendix in
Section~\ref{app:sec:inex_eval}. Times for indexing and retrieval
are relative times according to the maximum time needed for the
current set of tasks investigated. Since the experiments were run on
multiple virtual machines and in parallel, measurements are meant
for relative comparison only. Relying on absolute times would be
misleading and inappropriate.
\subsection{Experiment I - Single-Term Index Performance}
The goal of the initial experiment was to identify the single-term
index that performs best. All twelve single-term indices (ST01 --
ST12, see Table~\ref{tab:system:maintained_single-term_indices})
were evaluated and compared to each other. This allowed to (1)
measure the effect of each text analysis step independently, and to
(2) find an optimal overall configuration of analysis steps.
%
%
% configuration
For the tests, the complete set of CO topics was used. Query
parameters were fixed at $maxRes=1500$ (official threshold used at
INEX), $minSim=0,0$, $ci=0,2$, $gf=0,2$, and $rt=unfocused$
(CO.Thorough). Further, a static term space was presupposed in the
experiment.
%
For better readability,
Table~\ref{tab:system:maintained_single-term_indices} explaining the
single-term index configurations is replicated in
Table~\ref{tab:system:maintained_single-term_indices2}.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Single-term indices maintained by the system}
\label{tab:system:maintained_single-term_indices2}
\begin{tabular}{llllll}
%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{Tokenizer} & \mcc{Tagger} & \mcc{Extractor} & \mcc{Stemmer} & \mcc{Stopword Filtering} \tabularnewline%
%
ST01 & SimpleTokenizer & - & all & - & - \tabularnewline%
ST02 & OpenNLPTokenizer & - & all & - & - \tabularnewline%
ST03 & JavaTok & - & all & - & - \tabularnewline%
%
ST04 & OpenNLPTokenizer & QTag & nouns, verbs & PorterStemmer & Fox (the best) \tabularnewline%
ST05 & JavaTok & QTag & nouns, verbs & PorterStemmer & Fox (the best) \tabularnewline%
%
ST06 & JavaTok & QTag & nouns, verbs & PorterStemmer & FS, CR, DS \tabularnewline%
ST07 & JavaTok & QTag & nouns, verbs, adjectives, adverbs & PorterStemmer & FS, CR, DS \tabularnewline%
ST08 & JavaTok & QTag & nouns, verbs, adjectives, adverbs & - & FS, CR, DS \tabularnewline%
%
ST09 & JavaTok & token types & valid words & PorterStemmer & FS, CR, DS \tabularnewline%
ST10 & JavaTok & token types & valid words & PorterStemmer & FS, CR \tabularnewline%
ST11 & JavaTok & token types & valid words & PorterStemmer & FS \tabularnewline%
ST12 & JavaTok & token types & valid words & PorterStemmer & -
%
\end{tabular}
\end{table}
%
The $nxCG$ results of each of the indices at ranks 10, 25, and 50
(official ranks used at INEX), are presented in
Table~\ref{eval:tab:exp1_nxcg}.
%
Note that in the $nxCG$ figures, these official ranks lay at very
small x-axis values (0,007, 0,017, and 0,033).
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$nxCG$ of the CO.Thorough task (single-terms)}
\label{eval:tab:exp1_nxcg}
\begin{tabular}{lrlllllllll}
%
\rowcolor{tableheadcolor} & & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{Size} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
ST01 & 1.272.103 & ~~~~0,1524 & 0,1649 & 0,1660 & ~~~~0,0115 & 0,0442 & 0,0514 & ~~~~0,1745 & 0,1841 & 0,1809 \tabularnewline%
ST02 & 741.688 & ~~~~0,1544 & 0,1760 & 0,1850 & ~~~~0,0231 & 0,0387 & 0,0760 & ~~~~0,1795 & 0,1998 & 0,2034 \tabularnewline%
ST03 & 711.832 & ~~~~0,1403 & 0,1660 & 0,1744 & ~~~~0,0115 & 0,0277 & 0,0474 & ~~~~0,1666 & 0,1889 & 0,1924 \tabularnewline%
%
ST04 & 479.644 & ~~~~0,1559 & 0,1575 & 0,1504 & ~~~~0,0197 & 0,0399 & 0,0502 & ~~~~0,1831 & 0,1783 & 0,1672 \tabularnewline%
ST05 & 460.992 & ~~~~0,1415 & 0,1490 & 0,1426 & ~~~~0,0158 & 0,0328 & 0,0417 & ~~~~0,1647 & 0,1693 & 0,1591 \tabularnewline%
%
ST06 & 461.205 & ~~~~0,1506 & 0,1519 & 0,1406 & ~~~~0,0197 & 0,0328 & 0,0487 & ~~~~0,1763 & 0,1719 & 0,1569 \tabularnewline%
ST07 & 516.742 & ~~~~0,1567 & 0,1570 & 0,1524 & ~~~~0,0197 & 0,0381 & 0,0454 & ~~~~0,1806 & 0,1760 & 0,1686 \tabularnewline%
ST08 & 555.777 & ~~~~0,1478 & 0,1575 & 0,1537 & ~~~~0,0115 & 0,0308 & 0,0373 & ~~~~0,1767 & 0,1815 & 0,1719 \tabularnewline%
%
ST09 & 302.204 & ~~~~0,1435 & 0,1334 & 0,1378 & ~~~~0,0235 & 0,0274 & 0,0447 & ~~~~0,1606 & 0,1509 & 0,1530 \tabularnewline%
ST10 & 302.248 & ~~~~0,1462 & 0,1396 & 0,1378 & ~~~~0,0235 & 0,0304 & 0,0482 & ~~~~0,1673 & 0,1591 & 0,1541 \tabularnewline%
ST11 & 302.559 & ~~~~0,1459 & 0,1418 & 0,1385 & ~~~~0,0235 & 0,0320 & 0,0431 & ~~~~0,1680 & 0,1608 & 0,1542 \tabularnewline%
ST12 & 302.755 & ~~~~0,1430 & 0,1422 & 0,1401 & ~~~~0,0197 & 0,0313 & 0,0416 & ~~~~0,1649 & 0,1615 & 0,1540 \tabularnewline%
%
\end{tabular}
\end{table}
%
In the sequel, each of the text analysis steps is investigated in
more detail.
\subsubsection{Tokenizer Performance}
% tokenizer findings
The performance of tokenization was investigated by comparing the
results achieved by the indices ST01, ST02, ST03, ST04, and ST05.
Figure~\ref{fig:exp1_1_tokenizer} shows the $nxCG$ curves and the
corresponding processing times.
%
The x-axis in Figure~\ref{fig:exp1_1_tokenizer_strict} denotes the
percentage of result components achieving the current cumulated
gain. For the INEX tasks, only the 1.500 top-ranked results are
considered. The y-axis is the normalized cumulated gain at a given
number of results.
%
\begin{figure}[ht]
\centering
\subfloat[strict $nxCG$ Performance]{\includegraphics[width=0.8\textwidth]{10_evaluation/figures/exp1_1_tokenizer_strict}\label{fig:exp1_1_tokenizer_strict}}
\\%
\subfloat[Indexing times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp1_1_tokenizer_index}\label{fig:exp1_1_tokenizer_index}}
\quad%
\subfloat[Retrieval times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp1_1_tokenizer_retrieval}\label{fig:exp1_1_tokenizer_retrieval}}
%
\caption{Tokenizer performance}
\label{fig:exp1_1_tokenizer}
\end{figure}
% description of the figure
% performance
Retrieval results achieved by the indices ST01 (no tokenizer, no
tagger), ST02 (OpenNLPTokenizer, no tagger), and ST03 (JavaTok, no
tagger) differed only to a small degree. There was no obvious trend
that one of the three tokenization approaches outperformed the
others. Taking tagging into account, ST04 (OpenNLPTokenizer, QTag)
returned marginally better results than ST05 (JavaTok, QTag).
%
% term space
For the indices (ST01, ST02, ST03, ST04, and ST05), the number of
index terms differed considerably (1.272.102, 741.688, 711.832,
479.644, and 460.992). Compared to ST01, term space was reduced by
41,7\% in ST02, 44,0\% in ST03, 62,3\% in ST04, and 63,8\% in ST05.
This reduction led to a considerable speed up of indexing and
retrieval (Figures~\ref{fig:exp1_1_tokenizer_index} and
\ref{fig:exp1_1_tokenizer_retrieval}). As for indexing time, 100\%
was equivalent to 75,8 hours and 100\% of retrieval time stood for
141,8 hours. In subsequent experiments, the longest taking indexing
(resp. retrieval) time is also referred to as 100\% indexing (resp.
retrieval) time.
% comparing JavaTok and OpenNLPTokenizer
Interestingly, the OpenNLPTokenizer slightly outperformed JavaTok
according to the retrieval quality in case when tagging was
involved. This became evident by the ST04 and ST05 curves. The
reason for this behavior is threefold: First, JavaTok does not
include the complete set of punctuation marks used in the documents.
This leads to tokens that include unknown markers at the beginning
or at the end of token strings. Such tokens are not typed as
`proper' words and are, consequently, not included in the index.
Second, OpenNLPTokenizer segments hyphenated words into multiple
separate tokens (e.g., \texttt{element-based} is split into
\texttt{element} and \texttt{based}). This behavior increases the
index size (compare ST02 and ST03), but allows to retrieve
additional XML components due to more general index terms. However,
precision may be lost because strings such as `\texttt{element
based}' and `\texttt{element-based}' match completely. Finally, QTag
is not optimized for preprocessed JavaTok inputs. Multi-tokens and
token types are not supported by QTag.
%
Both, the OpenNLPTokenizer and JavaTok operated at the same speed
during indexing and retrieval (ST02 and ST03). Incorporating tagging
in the processing (ST04 and ST05), especially during indexing,
JavaTok preprocessed inputs speeded up QTag processing considerably.
% consequences and discussion
Both tokenization shortcomings, the treatment of punctuation marks
and the splitting of hyphenated words, could be integrated in
JavaTok by extending the charset definition and the rule base. Due
to limited time, re-computation of several indices including this
JavaTok update was not conducted.
%
In the experiments, JavaTok (ST03 and ST05) achieved similar $nxCG$
performance compared to the OpenNLPTokenizer (ST02 and ST04).
Additionally, JavaTok speeded up indexing and retrieval. This
processing improvement, the capability of extending and tailoring
the functionality of JavaTok, the smaller indices, and the
improvements of tagging favored JavaTok tokenization.
\subsubsection{Tagger Performance}
After tokenization, the influence of tagging on the retrieval
performance was measured. Figure~\ref{fig:exp1_2_tagger} shows the
effect of applying QTag to identify syntactically relevant
information that was used to represent the content.
%
\begin{figure}[ht]
\centering
\subfloat[strict $nxCG$ Performance]{\includegraphics[width=0.8\textwidth]{10_evaluation/figures/exp1_2_tagger_strict}\label{fig:exp1_2_tagger_strict}}
\\%
\subfloat[Indexing times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp1_2_tagger_index}\label{fig:exp1_2_tagger_index}}
\quad%
\subfloat[Retrieval times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp1_2_tagger_retrieval}\label{fig:exp1_2_tagger_retrieval}}
%
\caption{Tagger performance}
\label{fig:exp1_2_tagger}
\end{figure}
% description of the figure
A comparison of ST03 (no tagging, using all terms), ST07 (tagging,
using nouns, verbs, adjectives, and adverbs), and ST09 (no tagging,
using token types) showed that QTag did not improve the $nxCG$
retrieval performance. Both indices, ST03 and ST09, clearly
outperformed ST07.
%
% term space and times
The size of the term space did not influence the $nxCG$ performance
directly: ST03 (711.832), ST07 (516.742), and ST09 (302.204).
Fastest indexing times (100\% was equivalent to 20,1 hours) and
retrieval times (100\% was equivalent to 70,0 hours) were achieved
for index ST09.
% comparison
Since tagging (re-)assigns syntactic tags based on word chains, any
kind of token might be tagged as a noun if it occurs in a certain
context. Thus, tagging quality drops instantly if word categories
are not identified correctly. A brief investigation of the ST07
index terms showed that over 37.000 terms did not even start with a
letter or a number. Thus, a large subset of the terms indexed seemed
to be inappropriate or irrelevant for searching. Instead, index
terms extracted on the basis of JavaTok assigned token types (ST09)
better fitted the notion of `meaningful' words.
%
% consequences and discussion
As a result of this experiment, the small index size, the better
index terms, and the faster processing clearly favored index ST09
without tagging.
\subsubsection{Extractor Performance}
The effect of including different (syntactic) categories of words in
the index is shown in Figure~\ref{fig:exp1_3_extractor}. ST06
selected nouns and verbs only, ST07 considered nouns, verbs,
adjectives, and adverbs, and ST09 extracted all `proper' words
identified via token types.
%
\begin{figure}[ht]
\centering
\subfloat[strict $nxCG$ Performance]{\includegraphics[width=0.8\textwidth]{10_evaluation/figures/exp1_3_extractor_strict}\label{fig:exp1_3_extractor_strict}}
\\%
\subfloat[Indexing times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp1_3_extractor_index}\label{fig:exp1_3_extractor_index}}
\quad%
\subfloat[Retrieval times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp1_3_extractor_retrieval}\label{fig:exp1_3_extractor_retrieval}}
%
\caption{Extractor performance}
\label{fig:exp1_3_extractor}
\end{figure}
% description of the figure
Taking into account only the first results returned ($0-0,12\%$),
performance was nearly identical. As in previous results,
performance was independent of the size of the index: ST06 (461.205
terms), ST07 (516.742 terms), and ST09 (302.204 terms). Again, ST09
achieved fastest indexing time (100\% was equivalent to 34,6 hours)
and retrieval time (100\% was equivalent to 48,0 hours).
%
As expected, index ST06 resulted in the worst performance. This is
because adjectives and adverbs containing important information were
neglected. Clearly better results were achieved by ST07, which
included all terms of ST06 plus adjectives and adverbs. Both indices
were outperformed by ST09.
%
Looking at the curves for ST06 and ST07, the number of index terms
extracted linearly increased the retrieval performance. However,
only correct terms relevant to user queries were helpful. Thus, the
considerably smaller index ST09 achieved best results.
% consequences and discussion
This experiment confirmed that the selection of index terms based on
JavaTok token types is a promising procedure. $nxCG$ performance and
fast processing during indexing and retrieval support that argument.
\subsubsection{Stemmer Performance}
A performance comparison of index ST07 (with stemming) and index
ST08 (without stemming) is given in Figure~\ref{fig:exp1_4_stemmer}.
%
\begin{figure}[ht]
\centering
\subfloat[strict $nxCG$ Performance]{\includegraphics[width=0.8\textwidth]{10_evaluation/figures/exp1_4_stemmer_strict}\label{fig:exp1_4_stemmer_strict}}
\\%
\subfloat[Indexing times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp1_4_stemmer_index}\label{fig:exp1_4_stemmer_index}}
\quad%
\subfloat[Retrieval times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp1_4_stemmer_retrieval}\label{fig:exp1_4_stemmer_retrieval}}
%
\caption{Stemmer performance}
\label{fig:exp1_4_stemmer}
\end{figure}
% description of the figure
The performance of both indices were nearly the same. In the
experiment, stemming seemed to achieve slightly better retrieval
results. The term space of ST07 contained 516.742 terms, while the
term space of ST08 included 555.777 terms. Since stemming is a
lightweight process that runs very fast, there was nearly no
difference in the indexing times (100\% was equivalent to 19,4
hours) and retrieval times (100\% was equivalent to 48,0 hours).
% consequences and discussion
One may conclude that stemming is an appropriate procedure. It
improved information retrieval by reducing the size of the
vocabulary and by providing concept-like index terms. At the same
time it operates very fast and nearly without additional
computational costs. During retrieval, it speeded up processing and
reduced query answer times.
\subsubsection{Stopword Filtering Performance}
Finally, the effect of filtering stopwords was investigated.
Figure~\ref{fig:exp1_5_stoplist} summarizes the results of the
experimental runs using the indices ST05 (QTag, Fox's stopwords),
ST06 (QTag, functional FS, content-related CR, and domain-specific
DS stopwords), ST09 (no tagger, FS, CR, DS), ST10 (no tagger, FS,
CR), ST11 (no tagger, FS), and ST12 (no tagger, no stoplist).
%
%
For each comparison (ST05 and ST06, ST09 to ST12), only the single
stopword filtering step is alternated.
%
\begin{figure}[ht]
\centering
\subfloat[strict $nxCG$ Performance]{\includegraphics[width=0.8\textwidth]{10_evaluation/figures/exp1_5_stoplist_strict}\label{fig:exp1_5_stoplist_strict}}
\\%
\subfloat[Indexing times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp1_5_stoplist_index}\label{fig:exp1_5_stoplist_index}}
\quad%
\subfloat[Retrieval times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp1_5_stoplist_retrieval}\label{fig:exp1_5_stoplist_retrieval}}
%
\caption{Stopword filtering performance}
\label{fig:exp1_5_stoplist}
\end{figure}
% description of the figure
First, one notes that both stopword lists, the stopwords proposed by
Fox (ST05) and the stopwords generated in this work (ST06) performed
nearly the same. Also, the index sizes of ST05 (460.992 terms) and
ST06 (461.205 terms) could be considered as equal. Both, ST05 and
ST06, included tagging. As a consequence, indexing times (100\% was
equivalent to 34,6 hours) and retrieval times (100\% was equivalent
to 70,5 hours) were considerably higher than that of the other
indices.
Iterative exclusion of different stopword layers led to the indices
ST09, ST10, ST11, and ST12. As expected, the size of the index
increased linearly with the exclusion of layers (302.204, 302.248,
302.559, and 302.755). These minimal changes had only a slight
impact on index computation and retrieval. However, the large number
of comparisons during retrieval favored an approach based on the
ST09 index.
% consequences and discussion
Summarizing the stopword filtering procedure, appropriate selection
of index terms in advance showed that the effect of stopword
filtering was reduced. Content-related and domain-specific stopwords
did not influence the retrieval performance to a high degree.
% conclusion
According to these experiments, ST09 turned out to achieve best
retrieval results while reducing the computational complexity and
processing times. Hence, subsequent experiments were conducted using
ST09 as the best performing single-term index.
\FloatBarrier
\subsection{Experiment II - Multi-Term Index Performance}
Continuing the previous experiment, multi-term index performance
metrics of MT01, MT02, MT03, MT04, and their combination MT were
evaluated.
%
MT refers to the performance achieved by combining all four
multi-term indices using equal weights ($\frac{1}{4}=0,25$). None of
the initial query parameters was changed.
%
For better readability,
Table~\ref{tab:system:maintained_multi-term_indices} explaining the
multi-term indices is repeated in
Table~\ref{tab:system:maintained_multi-term_indices2}.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Multi-term indices maintained by the system}
\label{tab:system:maintained_multi-term_indices2}
\begin{tabular}{lp{13cm}}
%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{Description} \tabularnewline%
%
MT01 & JavaTok-based composite nouns of arbitrary length ($\sim 340.000$ terms) \tabularnewline%
MT02 & JavaTok-based named entities of arbitrary length ($\sim 65.000$ terms) \tabularnewline%
MT03 & JavaTok-based formulaic speech of arbitrary length ($\sim 245.000$ terms) \tabularnewline%
MT04 & JavaTok-based full forms of acronyms of arbitrary length ($\sim 17.000$ terms)
%
\end{tabular}
\end{table}
%
Table~\ref{eval:tab:exp2_nxcg} summarizes the results of this
experiment.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$nxCG$ of the CO.Thorough task (multi-terms)}
\label{eval:tab:exp2_nxcg}
\begin{tabular}{lrlllllllll}
%
\rowcolor{tableheadcolor} & & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{Size} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
MT01 & 318.116 & ~~~~0,0653 & 0,0492 & 0,0606 & ~~~~0,0077 & 0,0031 & 0,0015 & ~~~~0,0765 & 0,0615 & 0,0702 \tabularnewline%
MT02 & 50.950 & ~~~~0,0816 & 0,0744 & 0,0803 & ~~~~0,0173 & 0,0158 & 0,0213 & ~~~~0,1031 & 0,0923 & 0,0961 \tabularnewline%
MT03 & 213.183 & ~~~~0,0541 & 0,0435 & 0,0388 & ~~~~0,0000 & 0,0000 & 0,0024 & ~~~~0,0561 & 0,0462 & 0,0397 \tabularnewline%
MT04 & 15.603 & ~~~~0,0661 & 0,0606 & 0,0536 & ~~~~0,0055 & 0,0070 & 0,0133 & ~~~~0,0716 & 0,0659 & 0,0563 \tabularnewline%
%
MT & 597.852 & ~~~~0,0603 & 0,0644 & 0,0751 & ~~~~0,0077 & 0,0046 & 0,0165 & ~~~~0,0822 & 0,0825 & 0,0887 \tabularnewline%
%
\end{tabular}
\end{table}
% table explanation
Since MT is a combination of all other multi-term indices, the
(theoretical) size of the term space was given by the sum of all
other term space sizes.
%
As expected, the performance of multi-term indices was much lower
compared to the performance achieved by single-term indices (see
Table~\ref{eval:tab:exp1_nxcg} and the $nCG$ scales of
Figures~\ref{fig:exp1_1_tokenizer} to ~\ref{fig:exp1_5_stoplist}).
%
This was because multi-terms are less frequent and do not allow
partial matching of query terms.
%
Figure~\ref{fig:exp2_multiterms} presents the $nxCG$ performance
achieved.
%
\begin{figure}[ht]
\centering
\subfloat[strict $nxCG$ Performance]{\includegraphics[width=0.8\textwidth]{10_evaluation/figures/exp2_multiterms_strict}\label{fig:exp2_multiterms_strict}}
\\%
\subfloat[Indexing times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp2_multiterms_index}\label{fig:exp2_multiterms_index}}
\quad%
\subfloat[Retrieval times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp2_multiterms_retrieval}\label{fig:exp2_multiterms_retrieval}}
%
\caption{Multi-term index performance}
\label{fig:exp2_multiterms}
\end{figure}
% times
Indexing and retrieval times correlated with the sizes of the
indices. For MT, the theoretical indexing time (100\% was equivalent
to 822,6 hours) and retrieval time (100\% was equivalent to 22,9
hours) are given by the cumulated times of all multi-term indices.
% discussion
The evaluation showed that some categories of multi-terms were more
useful to information retrieval than others. For instance, named
entities (MT02) and full forms of acronyms (MT04) turned out to be
better index terms than composite nouns (MT01) and formulaic speech
(MT03). Interestingly, the term spaces of MT02 and MT04 were much
smaller than the term spaces of MT01 and MT03.
%
Most astonishing was the fact that MT02 performed equal or even
slightly better than the combination of all multi-terms together.
This means that in the INEX topics named entities are the most
important type of multi-terms. Looking at the gen and genLifted
graphs in the Appendix (see Section~\ref{app:sec:inex_eval_exp2}),
only the complete set of the multi-term indices summed up to MT.
% result
The benefit of each index depends on the user queries. In the INEX
topics, named entities and acronyms were frequently used, while
composite nouns and formulaic speech were not. One reason for that
is that the topics were constructed by domain experts that searched
for specific pieces of information. One explanation would be that
named entities and acronyms are more likely to be used to express
such special information needs. However, this may not be true for
common users querying more general topics. Thus, this work relies on
a combination of all multi-term indices.
\subsection{Experiment III - Combined Single-Term and Multi-Term Index Performance}
In a third experiment, a combination of the best performing
single-term index ST09 and the multi-term index MT (consisting of
MT01, MT02, MT03, and MT04) was studied. This approach was called
TOP.
%
Again, none of the initial query parameters was changed. As before,
the overall relevance of a document component was averaged over the
relevances computed for each index separately using equal weights
($\frac{1}{5}=0,2$).
%
Table~\ref{eval:tab:exp3_nxcg} summarizes the results.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$nxCG$ of the CO.Thorough task (single-terms and multi-terms)}
\label{eval:tab:exp3_nxcg}
\begin{tabular}{llllllllll}
%
\rowcolor{tableheadcolor} & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
ST09 & 0,1435 & 0,1334 & 0,1378 & ~~~~0,0235 & 0,0274 & 0,0447 & ~~~~0,1606 & 0,1509 & 0,1530 \tabularnewline%
%
MT & 0,0603 & 0,0644 & 0,0751 & ~~~~0,0077 & 0,0046 & 0,0165 & ~~~~0,0822 & 0,0825 & 0,0887 \tabularnewline%
%
TOP & 0,1524 & 0,1415 & 0,1503 & ~~~~0,0231 & 0,0240 & 0,0424 & ~~~~0,1804 & 0,1660 & 0,1718 \tabularnewline%
%
\end{tabular}
\end{table}
The results in the table show that a combination of single-terms and
multi-terms achieved better results than each of the indices
separately. Only among the first 8\% of the results retrieved, the
strict $nxCG$ performance of the ST09 index is minimally higher than
the performance of the TOP index. This was not upheld for the gen
and genLifted $nxCG$ curves, where in both cases TOP achieved best
results.
%
In Figure~\ref{fig:exp3_combination}, the performance curves of the
three indices are depicted.
%
\begin{figure}[ht]
\centering
\subfloat[strict $nxCG$ Performance]{\includegraphics[width=0.8\textwidth]{10_evaluation/figures/exp3_combination_strict}\label{fig:exp3_combination_strict}}
\\%
\subfloat[Indexing times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp3_combination_index}\label{fig:exp3_combination_index}}
\quad%
\subfloat[Retrieval times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp3_combination_retrieval}\label{fig:exp3_combination_retrieval}}
%
\caption{Combined single-term and multi-term index performance}
\label{fig:exp3_combination}
\end{figure}
% discussion
As in the previous experiment, indexing time (100\% was equivalent
to 843,3 hours) for the index TOP was cumulated from the indexing
times of ST09 and MT. Since each index is compared sequentially,
retrieval time (100\% was equivalent to 54,6 hours) using the TOP
index took, as expected, nearly as long as the retrieval of ST09 and
MT together.
%
The figure also shows that the contribution of the single-term index
was much higher than that of the multi-term index. However, using
additional multi-terms in combination with single-terms increased
retrieval performance. Subsequent experiments were, therefore,
conducted using the TOP index.
%\FloatBarrier
\subsection{Experiment IV - Content and Structure}
This experiment focused on the impact of structural constraints on
the quality of the retrieval results. Query parameters remain
unchanged at $maxRes=1500$, $minSim=0,0$, $ci=0,2$, and $gf=0,2$.
The INEX tasks CO.Thorough, CO.Focused, COS.Thorough, COS.Focused,
and SSCAS were evaluated and compared to each other. In all cases,
the TOP index was used for retrieval.
%
The results are given in Table~\ref{eval:tab:exp4_nxcg}.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{$nxCG$ of CO, COS, and SSCAS}
\label{eval:tab:exp4_nxcg}
\begin{tabular}{llllllllll}
%
\rowcolor{tableheadcolor} & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
CO.Thorough & 0,1524 & 0,1415 & 0,1503 & ~~~~0,0231 & 0,0240 & 0,0424 & ~~~~0,1804 & 0,1660 & 0,1718 \tabularnewline%
%
CO.Focused & 0,1350 & 0,1285 & 0,1225 & ~~~~0,0192 & 0,0231 & 0,0234 & ~~~~0,1449 & 0,1380 & 0,1368 \tabularnewline%
%
COS.Thorough & 0,0977 & 0,0846 & 0,0852 & ~~~~0,0118 & 0,0118 & 0,0235 & ~~~~0,1046 & 0,0897 & 0,0878 \tabularnewline%
%
COS.Focused & 0,1037 & 0,0876 & 0,0957 & ~~~~0,0059 & 0,0071 & 0,0225 & ~~~~0,1084 & 0,0890 & 0,0957 \tabularnewline%
%
SSCAS & 0,1932 & 0,2857 & 0,3365 & ~~~~0,4000 & 0,3739 & 0,3389 & ~~~~0,2063 & 0,3011 & 0,3470 \tabularnewline%
%
\end{tabular}
\end{table}
The performance indicators of the different INEX tasks varied
remarkably. CO topic performance obtained 20\% of the maximum
cumulated gain possible. Worst results were achieved for the COS
topics ($<15\%$). Best results were computed for complex SSCAS
topics, reaching up to 40\% of $nxCG$. For both tasks, CO and COS,
the thorough strategy turned out to perform better than the focused
strategy.
%
\begin{figure}[ht]
\centering
\subfloat[CO]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp4_CO_strict}\label{fig:exp4_CO_strict}}
% \quad%
\subfloat[COS]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp4_COS_strict}\label{fig:exp4_COS_strict}}
\\%
% \quad%
\subfloat[SSCAS]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp4_CAS_strict}\label{fig:exp4_CAS_strict}}
%
% \\%
\subfloat[Retrieval times]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp4_retrieval}\label{fig:exp4_retrieval}}
%
\caption{Strict $nxCG$ Performance}
\label{fig:exp4_structure_content}
\end{figure}
The figures show that focused retrieval was not as successful as
thorough retrieval. Especially for the CO tasks, the difference was
considerable. Taking the complete result sets of 1500 results per
topic into account, the CO task achieved 24\% of the total cumulated
gain possible.
%
The performance of the COS tasks showed a similar behavior. Thorough
retrieval outperformed focused retrieval. However, the difference
was not as large as for the CO tasks. In general, the performance of
the COS task was quite low.
%
The most complex topics, the SSCAS retrieval task, achieved markedly
better results. The complete SSCAS results reached about 75\% of the
total cumulated gain possible.
% times
As indicated by Figure~\ref{fig:exp4_retrieval}, result computation
for CO topics took longest. In the figure, the average retrieval
times per topic for each of the task are given. 100\% retrieval time
was equivalent to 1,3 hours. These processing times are explained by
the computational complexity of the on-the-fly weight computation.
In a productive system these unfeasible retrieval times can be
avoided by storing pre-computed term weights redundantly in the
database.
%
Since X-DOSE implements structural constraints as filter criteria,
results for COS and SSCAS topics -- although more complex -- were
computed three times faster.
\subsection{Experiment V - Static Term Space versus Dynamic Term Spaces}
The differences of applying a single static term space instead of
multiple dynamic term spaces are described in
Section~\ref{sec:representation}. This experiment showed its impact
on the retrieval performance. According to experiment IV, all INEX
tasks were evaluated applying the same query parameters.
%
The results are summarized in Table~\ref{eval:tab:exp5_nxcg}.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Static term space versus dynamic term spaces}
\label{eval:tab:exp5_nxcg}
\begin{tabular}{lclllllllll}
%
\rowcolor{tableheadcolor} & & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{Type} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
CO.Thorough & static & ~~~~0,1524 & 0,1415 & 0,1503 & ~~~~0,0231 & 0,0240 & 0,0424 & ~~~~0,1804 & 0,1660 & 0,1718 \tabularnewline%
CO.Thorough & dynamic & ~~~~0,1524 & 0,1415 & 0,1503 & ~~~~0,0231 & 0,0240 & 0,0424 & ~~~~0,1804 & 0,1660 & 0,1718 \tabularnewline%
%
CO.Focused & static & ~~~~0,1350 & 0,1285 & 0,1225 & ~~~~0,0192 & 0,0231 & 0,0234 & ~~~~0,1449 & 0,1380 & 0,1368 \tabularnewline%
CO.Focused & dynamic & ~~~~0,1350 & 0,1285 & 0,1225 & ~~~~0,0192 & 0,0231 & 0,0234 & ~~~~0,1449 & 0,1380 & 0,1368 \tabularnewline%
%
COS.Thorough & static & ~~~~0,0977 & 0,0846 & 0,0852 & ~~~~0,0118 & 0,0118 & 0,0235 & ~~~~0,1046 & 0,0897 & 0,0878 \tabularnewline%
COS.Thorough & dynamic & ~~~~0,0977 & 0,0846 & 0,0852 & ~~~~0,0118 & 0,0118 & 0,0235 & ~~~~0,1046 & 0,0897 & 0,0878 \tabularnewline%
%
COS.Focused & static & ~~~~0,1037 & 0,0876 & 0,0957 & ~~~~0,0059 & 0,0071 & 0,0225 & ~~~~0,1084 & 0,0890 & 0,0957 \tabularnewline%
COS.Focused & dynamic & ~~~~0,1037 & 0,0876 & 0,0957 & ~~~~0,0059 & 0,0071 & 0,0225 & ~~~~0,1084 & 0,0890 & 0,0957 \tabularnewline%
%
SSCAS & static & ~~~~0,1932 & 0,2857 & 0,3365 & ~~~~0,4000 & 0,3739 & 0,3389 & ~~~~0,2063 & 0,3011 & 0,3470 \tabularnewline%
SSCAS & dynamic & ~~~~0,1932 & 0,2857 & 0,3365 & ~~~~0,4000 & 0,3739 & 0,3389 & ~~~~0,2063 & 0,3011 & 0,3470 \tabularnewline%
%
\end{tabular}
\end{table}
Unexpectedly, dynamic term spaces had no impact at all on the
retrieval performance. Although term weights of the document
component representations differed (because of different $ief$
values), the ranking of components retrieved maintained the same.
The explanation for that behavior was the large number of leaf
components and, consequently, the amount of text within the leave
nodes (\texttt{FRA} elements). Due to this fact, dynamic term spaces
of leave components performed nearly equal to the complete static
term space. Since term spaces of components higher in the hierarchy
contain the term spaces of descendant components, this effect was
even enforced for dynamic term spaces at intermediate levels.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{10_evaluation/figures/exp5_retrieval}
\caption{Retrieval times of static and dynamic term spaces}
\label{fig:exp5_retrieval}
\end{figure}
The retrieval times (100\% retrieval time was equivalent to 78,5
hours), as given in Figure~\ref{fig:exp5_retrieval}, indicated that
dynamic term space computation consumed up to 45\% more time.
Learning from that experience, subsequent experiments were conducted
using a single static term space.
\subsection{Experiment VI - The Effect of Content Importance $ci$}
In order to find an optimal parameter setting for the importance of
content relative to metadata, several $ci$ settings were tested.
Since the $ci$ factor combines the impact of metadata and content
relevance, this experiment was conducted on each of the CO, COS, and
CAS tasks. Previous query parameters remained unchanged
($maxRes=1500$, $minSim=0,0$, and $gf=0,2$).
%
Experimental results are provided in Table~\ref{eval:tab:exp6_nxcg}.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Impact of content importance $ci$}
\label{eval:tab:exp6_nxcg}
\begin{tabular}{lllllllllll}
%
\rowcolor{tableheadcolor} & & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{$ci$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
CO.Thorough & 0,0 & ~~~~0,0509 & 0,0303 & 0,0209 & ~~~~0,0000 & 0,0000 & 0,0000 & ~~~~0,0520 & 0,0302 & 0,0199 \tabularnewline%
CO.Thorough & 0,2 & ~~~~0,1524 & 0,1415 & 0,1503 & ~~~~0,0231 & 0,0240 & 0,0424 & ~~~~0,1804 & 0,1660 & 0,1718 \tabularnewline%
CO.Thorough & 0,5 & ~~~~0,1451 & 0,1401 & 0,1498 & ~~~~0,0231 & 0,0270 & 0,0440 & ~~~~0,1767 & 0,1656 & 0,1729 \tabularnewline%
CO.Thorough & 0,8 & ~~~~0,1522 & 0,1400 & 0,1541 & ~~~~0,0231 & 0,0270 & 0,0440 & ~~~~0,1845 & 0,1669 & 0,1777 \tabularnewline%
CO.Thorough & 1,0 & ~~~~0,1519 & 0,1395 & 0,1523 & ~~~~0,0231 & 0,0286 & 0,0440 & ~~~~0,1841 & 0,1665 & 0,1757 \tabularnewline%
%
CO.Focused & 0,0 & ~~~~0,0573 & 0,0357 & 0,0245 & ~~~~0,0000 & 0,0000 & 0,0000 & ~~~~0,0553 & 0,0329 & 0,0215 \tabularnewline%
CO.Focused & 0,2 & ~~~~0,1350 & 0,1285 & 0,1225 & ~~~~0,0192 & 0,0231 & 0,0234 & ~~~~0,1449 & 0,1380 & 0,1368 \tabularnewline%
CO.Focused & 0,5 & ~~~~0,1312 & 0,1267 & 0,1225 & ~~~~0,0192 & 0,0277 & 0,0264 & ~~~~0,1461 & 0,1402 & 0,1393 \tabularnewline%
CO.Focused & 0,8 & ~~~~0,1368 & 0,1268 & 0,1250 & ~~~~0,0192 & 0,0277 & 0,0264 & ~~~~0,1506 & 0,1399 & 0,1418 \tabularnewline%
CO.Focused & 1,0 & ~~~~0,1339 & 0,1250 & 0,1233 & ~~~~0,0231 & 0,0277 & 0,0264 & ~~~~0,1503 & 0,1373 & 0,1400 \tabularnewline%
%
COS.Thorough & 0,0 & ~~~~0,0218 & 0,0122 & 0,0150 & ~~~~0,0000 & 0,0000 & 0,0012 & ~~~~0,0238 & 0,0129 & 0,0148 \tabularnewline%
COS.Thorough & 0,2 & ~~~~0,0977 & 0,0846 & 0,0852 & ~~~~0,0118 & 0,0118 & 0,0235 & ~~~~0,1046 & 0,0897 & 0,0878 \tabularnewline%
COS.Thorough & 0,5 & ~~~~0,0986 & 0,0851 & 0,0865 & ~~~~0,0118 & 0,0118 & 0,0247 & ~~~~0,1066 & 0,0905 & 0,0894 \tabularnewline%
COS.Thorough & 0,8 & ~~~~0,1007 & 0,0836 & 0,0862 & ~~~~0,0118 & 0,0118 & 0,0247 & ~~~~0,1085 & 0,0890 & 0,0893 \tabularnewline%
COS.Thorough & 1,0 & ~~~~0,1007 & 0,0824 & 0,0864 & ~~~~0,0118 & 0,0118 & 0,0247 & ~~~~0,1085 & 0,0875 & 0,0895 \tabularnewline%
%
COS.Focused & 0,0 & ~~~~0,0257 & 0,0156 & 0,0190 & ~~~~0,0000 & 0,0000 & 0,0012 & ~~~~0,0261 & 0,0150 & 0,0169 \tabularnewline%
COS.Focused & 0,2 & ~~~~0,1037 & 0,0876 & 0,0957 & ~~~~0,0059 & 0,0071 & 0,0225 & ~~~~0,1084 & 0,0890 & 0,0957 \tabularnewline%
COS.Focused & 0,5 & ~~~~0,1056 & 0,0919 & 0,0989 & ~~~~0,0059 & 0,0118 & 0,0272 & ~~~~0,1117 & 0,0933 & 0,0986 \tabularnewline%
COS.Focused & 0,8 & ~~~~0,1078 & 0,0896 & 0,0990 & ~~~~0,0059 & 0,0118 & 0,0272 & ~~~~0,1136 & 0,0915 & 0,0986 \tabularnewline%
COS.Focused & 1,0 & ~~~~0,1051 & 0,0895 & 0,0992 & ~~~~0,0059 & 0,0118 & 0,0272 & ~~~~0,1101 & 0,0914 & 0,0989 \tabularnewline%
%
SSCAS & 0,0 & ~~~~0,0000 & 0,0439 & 0,0495 & ~~~~0,0000 & 0,0300 & 0,0550 & ~~~~0,0190 & 0,0639 & 0,0606 \tabularnewline%
SSCAS & 0,2 & ~~~~0,1932 & 0,2857 & 0,3365 & ~~~~0,4000 & 0,3739 & 0,3389 & ~~~~0,2063 & 0,3011 & 0,3470 \tabularnewline%
SSCAS & 0,5 & ~~~~0,3361 & 0,2857 & 0,3365 & ~~~~0,4000 & 0,3739 & 0,3388 & ~~~~0,3492 & 0,3011 & 0,3470 \tabularnewline%
SSCAS & 0,8 & ~~~~0,3363 & 0,3013 & 0,3371 & ~~~~0,4000 & 0,3839 & 0,3389 & ~~~~0,3543 & 0,3178 & 0,3479 \tabularnewline%
SSCAS & 1,0 & ~~~~0,2525 & 0,3013 & 0,3371 & ~~~~0,4000 & 0,3839 & 0,3389 & ~~~~0,2705 & 0,3178 & 0,3479 \tabularnewline%
%
\end{tabular}
\end{table}
The results in the table show that higher values of $ci$ generated
better results than lower $ci$ values. This indicates that the
content of a document component was more important than its'
metadata information. The explanation for that is simple: The
majority of the components retrieved, the \texttt{FRA} elements, did
not contain metadata that was queried explicitly (except tables and
figures).
%
Figure~\ref{fig:exp6_ci} shows the impact of the content importance
factor $ci$ on the retrieval performance. Colors are used to
distinguish the different $ci$ values.
%
\begin{figure}[ht]
\centering
\subfloat[CO.Thorough]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp6_ci_CO_Thorough_strict}\label{fig:exp6_ci_CO_Thorough_strict}}
\subfloat[CO.Focused]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp6_ci_CO_Focused_strict}\label{fig:exp6_ci_CO_Focused_strict}}
\\%
\subfloat[COS.Thorough]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp6_ci_COS_Thorough_strict}\label{fig:exp6_ci_COS_Thorough_strict}}
\subfloat[COS.Focused]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp6_ci_COS_Focused_strict}\label{fig:exp6_ci_COS_Focused_strict}}
\\%
\subfloat[SSCAS]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp6_ci_SSCAS_strict}\label{fig:exp6_ci_SSCAS_strict}}
%
\caption{Strict $nxCG$ Performance of $ci$}
\label{fig:exp6_ci}
\end{figure}
For all tasks, low $ci$ values led to worse results.
%
Best performance was achieved for $ci=0,8$ and $ci=1,0$. Therefore, the
$ci$ parameter was fixed at $0,8$ for subsequent experiments. This
value was assumed to give best results, because much weight is put on the
similarity of contents while metadata similarity is not ignored
completely.
\subsection{Experiment VII - The Effect of the Generality Factor $gf$}
The generality factor $gf$ controls the influence of the ancestor
components' relevances stated in the user query on the components'
relevance itself. $gf=0,0$ means that the relevance of a component
is independent of the component ancestors' relevances. $gf=1,0$
defines that a components' relevance is given by the component
ancestors' relevance only. The query parameters for the experiment
were fixed at $maxRes=1500$, $minSim=0,0$, and $ci=0,8$. Since CO
topics do not contain ancestor components (a single, unchained
subquery), these tasks (thorough and focused) were skipped. For the
COS and SSCAS tasks five different $gf$ values in the range between
0 and 1 were evaluated.
%
Table~\ref{eval:tab:exp7_nxcg} presents the results of this
experiment.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Impact of the generality factor $gf$}
\label{eval:tab:exp7_nxcg}
\begin{tabular}{lllllllllll}
%
\rowcolor{tableheadcolor} & & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Index} & \mcc{$gf$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
COS.Thorough & 0,0 & ~~~~0,1007 & 0,0836 & 0,0872 & ~~~~0,0118 & 0,0118 & 0,0247 & ~~~~0,1085 & 0,0890 & 0,0903 \tabularnewline%
COS.Thorough & 0,2 & ~~~~0,1007 & 0,0836 & 0,0862 & ~~~~0,0118 & 0,0118 & 0,0247 & ~~~~0,1085 & 0,0890 & 0,0893 \tabularnewline%
COS.Thorough & 0,5 & ~~~~0,1034 & 0,0837 & 0,0862 & ~~~~0,0118 & 0,0118 & 0,0247 & ~~~~0,1120 & 0,0893 & 0,0893 \tabularnewline%
COS.Thorough & 0,8 & ~~~~0,1007 & 0,0832 & 0,0856 & ~~~~0,0118 & 0,0118 & 0,0247 & ~~~~0,1069 & 0,0877 & 0,0884 \tabularnewline%
COS.Thorough & 1,0 & ~~~~0,0789 & 0,0551 & 0,0617 & ~~~~0,0118 & 0,0071 & 0,0176 & ~~~~0,0849 & 0,0592 & 0,0652 \tabularnewline%
%
COS.Focused & 0,0 & ~~~~0,1078 & 0,0896 & 0,0990 & ~~~~0,0059 & 0,0118 & 0,0272 & ~~~~0,1136 & 0,0915 & 0,0986 \tabularnewline%
COS.Focused & 0,2 & ~~~~0,1078 & 0,0896 & 0,0990 & ~~~~0,0059 & 0,0118 & 0,0272 & ~~~~0,1136 & 0,0915 & 0,0986 \tabularnewline%
COS.Focused & 0,5 & ~~~~0,1104 & 0,0898 & 0,0979 & ~~~~0,0059 & 0,0118 & 0,0272 & ~~~~0,1171 & 0,0917 & 0,0976 \tabularnewline%
COS.Focused & 0,8 & ~~~~0,1091 & 0,0895 & 0,0973 & ~~~~0,0059 & 0,0118 & 0,0272 & ~~~~0,1144 & 0,0910 & 0,0969 \tabularnewline%
COS.Focused & 1,0 & ~~~~0,0854 & 0,0594 & 0,0725 & ~~~~0,0059 & 0,0071 & 0,0129 & ~~~~0,0902 & 0,0612 & 0,0711 \tabularnewline%
%
SSCAS & 0,0 & ~~~~0,3504 & 0,2967 & 0,3419 & ~~~~0,4250 & 0,3739 & 0,3389 & ~~~~0,3634 & 0,3112 & 0,3524 \tabularnewline%
SSCAS & 0,2 & ~~~~0,3363 & 0,3013 & 0,3371 & ~~~~0,4000 & 0,3839 & 0,3389 & ~~~~0,3543 & 0,3178 & 0,3479 \tabularnewline%
SSCAS & 0,5 & ~~~~0,3504 & 0,2912 & 0,3408 & ~~~~0,4250 & 0,3739 & 0,3539 & ~~~~0,3682 & 0,3110 & 0,3540 \tabularnewline%
SSCAS & 0,8 & ~~~~0,3646 & 0,3115 & 0,3484 & ~~~~0,4500 & 0,4039 & 0,3689 & ~~~~0,3825 & 0,3350 & 0,3647 \tabularnewline%
SSCAS & 1,0 & ~~~~0,2365 & 0,2657 & 0,2756 & ~~~~0,0750 & 0,1300 & 0,0950 & ~~~~0,2737 & 0,2926 & 0,2964 \tabularnewline%
%
\end{tabular}
\end{table}
As Table~\ref{eval:tab:exp7_nxcg} shows, the minimum $gf=0,0$ and
the maximum $gf=1,0$ values were not optimal. Best results were
achieved by $gf=0,5$ for COS topics and $gf=0,8$ for CAS topics.
% figure
Figure~\ref{fig:exp7_gf} sketches the $nxCG$ performance of the
different $gf$ values. In the figure, colors decode the different
$gf$ values used.
%
\begin{figure}[ht]
\centering
\subfloat[COS.Thorough]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp7_gf_COS_Thorough_strict}\label{fig:exp7_gf_COS_Thorough_strict}}
\subfloat[COS.Focused]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp7_gf_COS_Focused_strict}\label{fig:exp7_gf_COS_Focused_strict}}
\\%
\subfloat[SSCAS]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp7_gf_SSCAS_strict}\label{fig:exp7_gf_SSCAS_strict}}
%
\caption{Strict $nxCG$ Performance of $gf$}
\label{fig:exp7_gf}
\end{figure}
%
The experiment showed that the relevance of ancestor components had
no large impact on the components' relevance. This was because many
of the topics addressed components directly without specifying
ancestor components. Such topics were unaffected by the $gf$
parameter. Out of the 28 COS and 47 CAS topics, only 12 COS and 37
CAS specified container elements explicitly. Of course, some topics
impli\-citly included ancestor relationships such as \texttt{//SEC}
addressed \texttt{/DOC/SEC} and \texttt{/DOC/SEC/SEC} components. On
the other hand, container components were mostly given by their
structure without any content restrictions (e.g.,
\texttt{/DOC/SEC}).
%
In the experiment, the maximum value of $gf=1,0$ led to bad
performance. The performance of other $gf$ values were close to each
other. Generally, lower values that put more emphasis on the
components relevance, achieved better performance than higher ones.
The small differences indicated that a factor smaller than $1,0$ did
not influence the result too much. According to the figures, values
of $gf=0,5$ for COS topics and $gf=0,8$ for CAS topics were chosen.
\subsection{Experiment VIII - INEX 2005 Comparison}
The results achieved in this work were compared to other INEX 2005
participants using the same documents, topics, and evaluation
metrics.
%
At INEX 2005, a former version of the X-DOSE system participated
(see~\cite{hassler05searching}). In order to point out the progress
of the X-DOSE development over the last years, the former X-DOSE'05
performance is shown and compared to the current X-DOSE'09 version.
% description
For the comparison, the best performing X-DOSE'09 parameter settings
for the CO tasks ($ci=0,8$, $gf=0,8$ although irrelevant), the COS
tasks ($ci=0,8$, $gf=0,5$), and the SSCAS task ($ci=0,8$, $gf=0,8$)
were selected. Table~\ref{eval:tab:exp8_inex} summarizes the results
of X-DOSE'09 and X-DOSE'05.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Progress of the X-DOSE development}
\label{eval:tab:exp8_inex}
\begin{tabular}{lllllllllll}
%
\rowcolor{tableheadcolor} & & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Task} & \mcc{System} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
CO.Thorough & X-DOSE'05 & 0,1486 & 0,1229 & 0,1085 & ~~~~0,0115 & 0,0320 & 0,0407 & ~~~~0,1592 & 0,1320 & 0,1141 \tabularnewline%
CO.Thorough & X-DOSE'09 & 0,1522 & 0,1400 & 0,1541 & ~~~~0,0231 & 0,0270 & 0,0440 & ~~~~0,1845 & 0,1669 & 0,1777 \tabularnewline%
%
CO.Focused & X-DOSE'05 & 0,1247 & 0,0913 & 0,0819 & ~~~~0,0160 & 0,0112 & 0,0117 & ~~~~0,1283 & 0,0926 & 0,0785 \tabularnewline%
CO.Focused & X-DOSE'09 & 0,1368 & 0,1268 & 0,1250 & ~~~~0,0192 & 0,0277 & 0,0264 & ~~~~0,1506 & 0,1399 & 0,1418 \tabularnewline%
%
COS.Thorough & X-DOSE'05 & 0,1036 & 0,0889 & 0,0719 & ~~~~0,0000 & 0,0183 & 0,0312 & ~~~~0,1077 & 0,0907 & 0,0709 \tabularnewline%
COS.Thorough & X-DOSE'09 & 0,1034 & 0,0837 & 0,0862 & ~~~~0,0118 & 0,0118 & 0,0247 & ~~~~0,1120 & 0,0893 & 0,0893 \tabularnewline%
%
COS.Focused & X-DOSE'05 & 0,1216 & 0,0875 & 0,0827 & ~~~~0,0000 & 0,0212 & 0,0365 & ~~~~0,1206 & 0,0821 & 0,0732 \tabularnewline%
COS.Focused & X-DOSE'09 & 0,1104 & 0,0898 & 0,0979 & ~~~~0,0059 & 0,0118 & 0,0272 & ~~~~0,1171 & 0,0917 & 0,0976 \tabularnewline%
%
SSCAS & X-DOSE'05 & 0,1672 & 0,1494 & 0,1548 & ~~~~0,3500 & 0,3578 & 0,3828 & ~~~~0,1781 & 0,1641 & 0,1654 \tabularnewline%
SSCAS & X-DOSE'09 & 0,3646 & 0,3115 & 0,3484 & ~~~~0,4500 & 0,4039 & 0,3689 & ~~~~0,3825 & 0,3350 & 0,3647 \tabularnewline%
%
\end{tabular}
\end{table}
As the table shows, X-DOSE'09 outperforms X-DOSE'05 on the CO tasks.
On the COS tasks, X-DOSE'05 seemed to perform even a bit better than
the current version. For the SSCAS task, the improvements of
X-DOSE'09 boosted the retrieval performance.
%
Figure~\ref{fig:exp8_performance} plots the corresponding $nxCG$
curves of X-DOSE'05, X-DOSE'09, and all participating INEX'05
systems.
%
\begin{figure}[ht]
\centering
\subfloat[CO.Thorough]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp8_CO_Thorough_strict}\label{fig:exp8_CO_Thorough_strict}}
\subfloat[CO.Focused]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp8_CO_Focused_strict}\label{fig:exp8_CO_Focused_strict}}
\\%
\subfloat[COS.Thorough]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp8_COS_Thorough_strict}\label{fig:exp8_COS_Thorough_strict}}
\subfloat[COS.Focused]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp8_COS_Focused_strict}\label{fig:exp8_COS_Focused_strict}}
\\%
\subfloat[SSCAS]{\includegraphics[width=0.47\textwidth]{10_evaluation/figures/exp8_SSCAS_strict}\label{fig:exp8_SSCAS_strict}}
%
\caption{Strict $nxCG$ Performance at INEX 2005}
\label{fig:exp8_performance}
\end{figure}
%
% former versus newer version
All graphs show that for all retrieval tasks X-DOSE'09 performed
clearly better than X-DOSE'05. For the first 5\%-10\% of the
retrieval results, $nxCG$ performance laid within a narrow margin.
The more results included, the more the cumulated gain measure
differed for both systems. Taking the complete number of retrieval
results into account, X-DOSE'09 outperformed X-DOSE'05 clearly.
% comparison to other system
In Figure~\ref{fig:exp8_performance}, gray curves denote performance
profiles of other systems competing at INEX. Compared to other
systems, X-DOSE was outperformed for the CO and COS tasks. Instead,
for the SSCAS task X-DOSE outperformed most of the other systems.
Tables~\ref{eval:tab:exp8_inex_top_CO_Thorough}--\ref{eval:tab:exp8_inex_top_SSCAS}
show the performance measures of the top 10 ranked INEX'05 systems.
%
% explanation for the low performance of CO and COS
The reason for the low performance of the CO and COS retrieval tasks
is threefold:
%
\begin{itemize}
%
\item The mapping procedure of the INEX documents onto
the generic document format changed the document structure to some
degree (e.g., corrections of the structure). Some structural
components of the initial INEX schema did not occur in the new
format (e.g., tags for layouting, synthetic elements such as
\texttt{/article/bdy}), were generalized to more abstract
components (e.g., all different paragraph tags became a
\texttt{FRA} elements of type \texttt{text}), or were reordered
during correction (e.g., \texttt{p[2]/tbl[1]} became \texttt{tbl[1]},
further \texttt{p[i]} elements were changed to \texttt{p[i-1]}).
%
\item INEX topics had to be adapted according to the generic
document structure. This transformation included new \texttt{meta()}
predicates and renaming of element paths. Thus, some of the
initial topics could not be translated exactly.
%
\item INEX assessments (optimal results) included elements that were
derived automatically to the optimal recall base. For instance, if
\texttt{/article/bdy/sec} was a relevant result, both containers
\texttt{/article/bdy} and \texttt{/article} were also relevant to
some degree and got added. Many of these elements derived were only
of synthetic nature (e.g., \texttt{/article/bdy}.
%
During evaluation, all elements retrieved had to include the correct
path within the document. In X-DOSE, this path had, after retrieval,
to be reconstructed from the metadata information
(\texttt{sourcepath}) of the component. Obviously, this could only
be done for components that were in fact mapped. Since a large
number of INEX elements were not mapped one to one, such a
reconstruction was often not possible. During evaluation, such
results were -- unfortunately -- judged as missing, although these
elements never existed in the mapped documents. Especially in the
case of CO and COS topics, this large number of synthetic elements
not retrieved by X-DOSE led to that drop of performance.
%
\end{itemize}
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top-10 INEX 2005 systems (CO.Thorough)}
\label{eval:tab:exp8_inex_top_CO_Thorough}
\begin{tabular}{llllllllll}
%
\rowcolor{tableheadcolor} & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Rank} & \mcc{$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
1 & 0,3037 & 0,2771 & 0,1004 & ~~~~0,1189 & 0,1931 & 0,2546 & ~~~~0,3401 & 0,3167 & 0,2717 \tabularnewline%
2 & 0,2820 & 0,2654 & 0,0923 & ~~~~0,1174 & 0,1539 & 0,2529 & ~~~~0,3345 & 0,3032 & 0,2689 \tabularnewline%
3 & 0,2797 & 0,2634 & 0,0846 & ~~~~0,1079 & 0,1298 & 0,2480 & ~~~~0,3274 & 0,2855 & 0,2670 \tabularnewline%
4 & 0,2673 & 0,2573 & 0,0749 & ~~~~0,0974 & 0,1222 & 0,2448 & ~~~~0,2993 & 0,2814 & 0,2639 \tabularnewline%
5 & 0,2665 & 0,2570 & 0,0747 & ~~~~0,0921 & 0,1157 & 0,2377 & ~~~~0,2939 & 0,2782 & 0,2628 \tabularnewline%
6 & 0,2637 & 0,2552 & 0,0746 & ~~~~0,0910 & 0,1134 & 0,2350 & ~~~~0,2908 & 0,2777 & 0,2613 \tabularnewline%
7 & 0,2593 & 0,2539 & 0,0739 & ~~~~0,0855 & 0,1093 & 0,2343 & ~~~~0,2905 & 0,277 & 0,2593 \tabularnewline%
8 & 0,2574 & 0,2441 & 0,0615 & ~~~~0,0847 & 0,1092 & 0,2339 & ~~~~0,2816 & 0,2732 & 0,2468 \tabularnewline%
9 & 0,2561 & 0,2399 & 0,0598 & ~~~~0,0846 & 0,1049 & 0,2330 & ~~~~0,2806 & 0,2693 & 0,2423 \tabularnewline%
10 & 0,2552 & 0,2386 & 0,0538 & ~~~~0,0842 & 0,1035 & 0,2299 & ~~~~0,2756 & 0,2667 & 0,2411 \tabularnewline%
%
\end{tabular}
\end{table}
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top-10 INEX 2005 systems (CO.Focused)}
\label{eval:tab:exp8_inex_top_CO_Focused}
\begin{tabular}{llllllllll}
%
\rowcolor{tableheadcolor} & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Rank} & \mcc{$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
1 & 0,2688 & 0,2325 & 0,2190 & ~~~~0,1401 & 0,1543 & 0,1902 & ~~~~0,3118 & 0,2505 & 0,2380 \tabularnewline%
2 & 0,2561 & 0,2178 & 0,2122 & ~~~~0,1363 & 0,1513 & 0,1730 & ~~~~0,2942 & 0,2449 & 0,2371 \tabularnewline%
3 & 0,2458 & 0,2139 & 0,2084 & ~~~~0,1324 & 0,1432 & 0,1627 & ~~~~0,2763 & 0,2415 & 0,2279 \tabularnewline%
4 & 0,2538 & 0,2152 & 0,2085 & ~~~~0,1324 & 0,1442 & 0,1730 & ~~~~0,2859 & 0,2437 & 0,2305 \tabularnewline%
5 & 0,2349 & 0,2134 & 0,2078 & ~~~~0,1266 & 0,1294 & 0,1549 & ~~~~0,2729 & 0,2347 & 0,2274 \tabularnewline%
6 & 0,2316 & 0,2130 & 0,2031 & ~~~~0,1209 & 0,1095 & 0,1317 & ~~~~0,2729 & 0,2319 & 0,2264 \tabularnewline%
7 & 0,2313 & 0,2110 & 0,1998 & ~~~~0,1074 & 0,1077 & 0,1261 & ~~~~0,2664 & 0,2293 & 0,2165 \tabularnewline%
8 & 0,2290 & 0,2073 & 0,1985 & ~~~~0,0960 & 0,0997 & 0,1240 & ~~~~0,2612 & 0,2234 & 0,2079 \tabularnewline%
9 & 0,2275 & 0,2034 & 0,1924 & ~~~~0,0959 & 0,0887 & 0,1209 & ~~~~0,2588 & 0,2230 & 0,2077 \tabularnewline%
10 & 0,2244 & 0,2025 & 0,1914 & ~~~~0,0901 & 0,0886 & 0,1176 & ~~~~0,2428 & 0,2099 & 0,2012 \tabularnewline%
%
\end{tabular}
\end{table}
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top-10 INEX 2005 systems (COS.Thorough)}
\label{eval:tab:exp8_inex_top_COS_Thorough}
\begin{tabular}{llllllllll}
%
\rowcolor{tableheadcolor} & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Rank} & \mcc{$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
1 & 0,3153 & 0,2858 & 0,2665 & ~~~~0,0824 & 0,1050 & 0,1366 & ~~~~0,3375 & 0,3045 & 0,2792 \tabularnewline%
2 & 0,3111 & 0,2828 & 0,2607 & ~~~~0,0824 & 0,1050 & 0,1360 & ~~~~0,3373 & 0,3016 & 0,2775 \tabularnewline%
3 & 0,2766 & 0,2754 & 0,2583 & ~~~~0,0588 & 0,1027 & 0,1354 & ~~~~0,3078 & 0,3003 & 0,2747 \tabularnewline%
4 & 0,2747 & 0,2741 & 0,2542 & ~~~~0,0588 & 0,0956 & 0,1307 & ~~~~0,3044 & 0,2929 & 0,2655 \tabularnewline%
5 & 0,2690 & 0,2649 & 0,2490 & ~~~~0,0529 & 0,0912 & 0,0901 & ~~~~0,3021 & 0,2916 & 0,2626 \tabularnewline%
6 & 0,2666 & 0,2634 & 0,2462 & ~~~~0,0507 & 0,0771 & 0,0883 & ~~~~0,3017 & 0,2904 & 0,2601 \tabularnewline%
7 & 0,2659 & 0,2621 & 0,2367 & ~~~~0,0507 & 0,0736 & 0,0755 & ~~~~0,2939 & 0,2836 & 0,2493 \tabularnewline%
8 & 0,2650 & 0,2620 & 0,2361 & ~~~~0,0471 & 0,0736 & 0,0712 & ~~~~0,2893 & 0,2829 & 0,2491 \tabularnewline%
9 & 0,2625 & 0,2586 & 0,2347 & ~~~~0,0471 & 0,0660 & 0,0659 & ~~~~0,2881 & 0,2810 & 0,2416 \tabularnewline%
10 & 0,2607 & 0,2582 & 0,2343 & ~~~~0,0448 & 0,0576 & 0,0653 & ~~~~0,2795 & 0,2798 & 0,2364 \tabularnewline%
%
\end{tabular}
\end{table}
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top-10 INEX 2005 systems (COS.Focused)}
\label{eval:tab:exp8_inex_top_COS_Focused}
\begin{tabular}{llllllllll}
%
\rowcolor{tableheadcolor} & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Rank} & \mcc{$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
1 & 0,2908 & 0,2520 & 0,2439 & ~~~~0,1588 & 0,1996 & 0,2510 & ~~~~0,3194 & 0,2579 & 0,2598 \tabularnewline%
2 & 0,2860 & 0,2372 & 0,2375 & ~~~~0,1261 & 0,1788 & 0,1809 & ~~~~0,3036 & 0,2541 & 0,2441 \tabularnewline%
3 & 0,2767 & 0,2370 & 0,2315 & ~~~~0,1206 & 0,1308 & 0,1576 & ~~~~0,3014 & 0,2399 & 0,2306 \tabularnewline%
4 & 0,2637 & 0,2327 & 0,2179 & ~~~~0,1125 & 0,1235 & 0,1574 & ~~~~0,2906 & 0,2397 & 0,2251 \tabularnewline%
5 & 0,2534 & 0,2248 & 0,2147 & ~~~~0,1063 & 0,1213 & 0,1264 & ~~~~0,2878 & 0,2307 & 0,2182 \tabularnewline%
6 & 0,2462 & 0,2137 & 0,2101 & ~~~~0,0971 & 0,0975 & 0,1178 & ~~~~0,2513 & 0,2200 & 0,2141 \tabularnewline%
7 & 0,2457 & 0,2068 & 0,1936 & ~~~~0,0672 & 0,0908 & 0,0856 & ~~~~0,2499 & 0,2177 & 0,2101 \tabularnewline%
8 & 0,2427 & 0,1942 & 0,1931 & ~~~~0,0613 & 0,0815 & 0,0830 & ~~~~0,2495 & 0,2089 & 0,2086 \tabularnewline%
9 & 0,2368 & 0,1939 & 0,1889 & ~~~~0,0496 & 0,0684 & 0,0780 & ~~~~0,2441 & 0,2046 & 0,1963 \tabularnewline%
10 & 0,2286 & 0,1891 & 0,1816 & ~~~~0,0466 & 0,0637 & 0,0697 & ~~~~0,2417 & 0,2018 & 0,1804 \tabularnewline%
%
\end{tabular}
\end{table}
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Top-10 INEX 2005 systems (SSCAS)}
\label{eval:tab:exp8_inex_top_SSCAS}
\begin{tabular}{llllllllll}
%
\rowcolor{tableheadcolor} & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Rank} & \mcc{$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{~~~~$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
1 & 0,4730 & 0,4816 & 0,5192 & ~~~~0,4500 & 0,4278 & 0,4356 & ~~~~0,4810 & 0,4936 & 0,5233 \tabularnewline%
2 & 0,4699 & 0,4335 & 0,4211 & ~~~~0,4500 & 0,4278 & 0,4078 & ~~~~0,4780 & 0,4342 & 0,4299 \tabularnewline%
3 & 0,4031 & 0,3978 & 0,4194 & ~~~~0,3250 & 0,3956 & 0,4067 & ~~~~0,4053 & 0,4007 & 0,4299 \tabularnewline%
4 & 0,3643 & 0,3978 & 0,4194 & ~~~~0,3250 & 0,3956 & 0,3967 & ~~~~0,3719 & 0,3980 & 0,4288 \tabularnewline%
5 & 0,3288 & 0,3971 & 0,4138 & ~~~~0,2250 & 0,3839 & 0,3894 & ~~~~0,3243 & 0,3958 & 0,4177 \tabularnewline%
6 & 0,3246 & 0,3950 & 0,4094 & ~~~~0,2000 & 0,3839 & 0,3756 & ~~~~0,3215 & 0,3921 & 0,4157 \tabularnewline%
7 & 0,3147 & 0,3927 & 0,4076 & ~~~~0,1750 & 0,3200 & 0,3639 & ~~~~0,3163 & 0,3907 & 0,4140 \tabularnewline%
8 & 0,3071 & 0,3433 & 0,3959 & ~~~~0,1500 & 0,1817 & 0,3589 & ~~~~0,3157 & 0,3600 & 0,4111 \tabularnewline%
9 & 0,2995 & 0,3373 & 0,3951 & ~~~~0,1500 & 0,1856 & 0,3489 & ~~~~0,3140 & 0,3512 & 0,3961 \tabularnewline%
10 & 0,2862 & 0,3364 & 0,3845 & ~~~~0,1500 & 0,3100 & 0,3200 & ~~~~0,3140 & 0,3478 & 0,3891 \tabularnewline%
%
\end{tabular}
\end{table}
%
For better comparison, the best results achieved by X-DOSE were
ranked in Table~\ref{eval:tab:exp8_inex_x-dose}. The number between
parentheses in each cell indicates the rank of X-DOSE in relation to
other participating systems. As in the figure and tables above, the
results illustrated that X-DOSE was less competitive in the case of
CO and COS tasks. In contrast to that, it was ranked among the
top-10 systems in the case of SSCAS.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Best results of X-DOSE}
\label{eval:tab:exp8_inex_x-dose}
\begin{tabular}{llllllllll}
%
\rowcolor{tableheadcolor} & \mmcc{3}{gen $nxCG$} & \mmcc{3}{strict $nxCG$} & \mmcc{3}{genLifted $nxCG$} \tabularnewline%
\rowcolor{tableheadcolor} \mcc{Task} & \mcc{$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} & \mcc{$[10]$}& \mcc{$[25]$}& \mcc{$[50]$} \tabularnewline%
%
CO.Thorough & (40) & (42) & (38) & (33) & (40) & (41) & (40) & (39) & (37) \tabularnewline%
CO.Focused & (33) & (30) & (27) & (34) & (34) & (36) & (31) & (30) & (25) \tabularnewline%
COS.Thorough & (31) & (32) & (31) & (20) & (21) & (21) & (30) & (32) & (31) \tabularnewline%
COS.Focused & (21) & (23) & (23) & (26) & (21) & (23) & (23) & (23) & (21) \tabularnewline%
SSCAS & (4) & (15) & (15) & (1) & (3) & (7) & (4) & (13) & (13) \tabularnewline%
%
% CO.Thorough & 0,1522 (40) & 0,1400 (42) & 0,1541 (38) & 0,0231 (33) & 0,0270 (40) & 0,0440 (41) & 0,1845 (40) & 0,1669 (39) & 0,1777 (37) \tabularnewline%
% CO.Focused & 0,1368 (33) & 0,1268 (30) & 0,1250 (27) & 0,0192 (34) & 0,0277 (34) & 0,0264 (36) & 0,1506 (31) & 0,1399 (30) & 0,1418 (25) \tabularnewline%
% COS.Thorough & 0,1034 (31) & 0,0837 (32) & 0,0862 (31) & 0,0118 (20) & 0,0118 (21) & 0,0247 (21) & 0,1120 (30) & 0,0893 (32) & 0,0893 (31) \tabularnewline%
% COS.Focused & 0,1104 (21) & 0,0898 (23) & 0,0979 (23) & 0,0059 (26) & 0,0118 (21) & 0,0272 (23) & 0,1171 (23) & 0,0917 (23) & 0,0976 (21) \tabularnewline%
% SSCAS & 0,3646 (4) & 0,3115 (15) & 0,3484 (15) & 0,4500 (1) & 0,4039 (3) & 0,3689 (7) & 0,3825 (4) & 0,3350 (13) & 0,3647 (13) \tabularnewline%
%
\end{tabular}
\end{table}
%
However, taking the performance of the SSCAS runs as the most
accurate ones into account, X-DOSE seems to be a competitive
structured document retrieval system. Further evaluation tasks based
on other corpora and metrics would surely provide better insights in
the real performance of the system.
%
Unfortunately, there were no other corpora available that included
queries and assessments for evaluation purpose. Other performance
metrics than $nxCG$ and $ep/gr$ would not have been comparable to
systems of INEX 2005 participants.
\subsection{Experiment IX - Clustering Performance}
% experimental setting
Due to limited time, the evaluation of clustering performance was
restricted to the best performing single-term index ST09. In this
evaluation, clustering was used to improve answer times during
retrieval applying a document preselection. Each document in the
database was transformed into an XML tree representation that
contained the ST09 single-term content vectors in the leaf nodes.
Thus, document trees were always rooted in the \texttt{/DOC}
elements.
% settings
The complete set of documents was clustered using hierarchical
clustering. The parameter settings were chosen according to the
experiments described in Section~\ref{sec:clustering:evaluation}
($\alpha_{struct}=0,8$, $\beta_{parent}$=0,2). The two most similar
documents were merged into a supertree and got stored in the
database. During retrieval, the cluster hierarchy was kept in memory
and served as a filter. For a set of query terms, the cluster
hierarchy was traversed. The content of each cluster was used as a
filter. If its similarity with the query terms was above zero, both
child clusters were investigated recursively. The search within the
cluster hierarchy ended at the leaf nodes (single documents) or at
clusters that were completely dissimilar. As a result, a list of
documents that are at least similar to some degree was returned.
Based on that list, only components that were contained in one of
the documents on that list were matched.
% results
The clustering of the 16.819 documents took 50,6 hours. Since each
document tree (resp. supertree) is compared to each other,
792.956.310.739 comparisons were needed. However, caching of
pairwise similarities reduced this number to 282.845.123 `unique'
comparisons that had to be calculated.
%
The experiments given in Figure~\ref{fig:exp9_retrieval} show that
retrieval time was reduced to a high degree by exploiting the
hierarchical clusters computed.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{10_evaluation/figures/exp9_retrieval}
\caption{Retrieval times of clustered document components}
\label{fig:exp9_retrieval}
\end{figure}
On the average, retrieval time was reduced by more than one half for
all retrieval tasks. Since clustering was applied for preselection
only, neither the ranking, nor the similarities computed did change.
Summarizing these results, hierarchical clustering turned out to be
an excellent improvement of X-DOSE retrieval performance.
%\newpage
\subsection{Experiments not Conducted}
Two experiments planned were not conducted: A first one on query
expansion of acronyms and a second one on classification. While
query expansion would have had no effect because queries always
contained the short and the full form of an acronym, an evaluation
of classification performance was lacking pre-classified data.
\subsubsection{Query Expansion}
One possibility to evaluate the effect of query expansion is to add
terms included in the multi-term index to the query automatically.
This kind of expansion only would effect queries that contain
acronyms of both variants, short forms and full forms. However, all
INEX queries that included acronyms did (1) already include the
short and the long form of the acronym, or (2) were not included in
the index. Thus, query expansion would not have had any effect on
the results retrieved. Experiments on query expansion were left for
further X-DOSE improvements and evaluation tasks.
\subsubsection{Classification}
An evaluation of XML document classification was not conducted in
this work.
%
%Extra experiments evaluating the performance of XML document
%classification were not conducted in this work.
%
The reason for this is that no pre-classified test data was included
in the INEX 2005 document retrieval collection. Since classification
of similar document components into user-defined classes without any
comparison data is only a matter of subjective attitude, an
objective evaluation could not be conducted.
%\newpage
%\newpage
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
%\section{Discussion}
%\subsection{Research Questions and Answers}
%
%
%% research questions
%Based on the preliminaries and the identified functionalities of a
%successful structured document retrieval system the research
%questions this work tries to answer can be reformulated as follows:
%%
%\begin{compactitem}
%%
%\item Is it possible to transform heterogenous documents into a common generic schema without a critical loss of information? % prof: schwierig, das so zu sagen
% \begin{compactitem}
% \item How would such a schema look like and how can it be processed efficiently?
% \item How can metadata information be included in documents?
% \end{compactitem}
%%
%\item How can XML documents following this generic schema be stored efficiently?
%%
%\item To which extent is natural language processing able to improve structured document retrieval?
% \begin{compactitem}
% \item Which processing steps have a deep impact on retrieval performance?
% \item Which processing steps reduce the data load to a high degree?
% \item How is natural language processing influencing recall and precision?
% \item Are automatically generated multi-term indices improving the retrieval performance?
% \end{compactitem}
%%
%\item What is the contribution of classification and clustering in the context of structured document retrieval?
% \begin{compactitem}
% \item How are two XML trees compared to each other?
% \item How are class/cluster representants created from a set of XML document trees?
% \item In which ways is classification/clustering supporting structured document retrieval (e.g., reduction of search space, retrieval time improvements)? % prof: schwer zu beantworten
% \end{compactitem}
%%
%\item To which extent does structured document retrieval help satisfying certain user information needs?
% \begin{compactitem}
% \item What are appropriate retrieval units and how are they represented?
% \item How `intuitive' can a structured query language be and what are the users expected to know in advance?
% \item Which are the best elements to be returned to the users? % prof: relativieren
% \item Does XML retrieval performance outperform the results of traditional retrieval?
% \item How scalable are structured document retrieval approaches?
% \end{compactitem}
%%
%\end{compactitem}
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Summary}
This chapter described the evaluation of the X-DOSE system. For this
purpose, the document collection and queries of the INEX 2005
evaluation workshop were used. The experiments included a comparison
of twelve single-term indices, four multi-term indices, and
combinations of them. Five different INEX retrieval tasks were run
on X-DOSE applying numerous parameter settings.
The experiments clearly indicated that the size of the index is not
correlated directly with the performance achieved. The quality of
the index terms was essential. The four types of multi-term indices,
which were used in addition to the best performing single-term
index, turned out to improve performance considerably for all
content-oriented retrieval tasks. Important to information
retrieval, indexing times and retrieval times were measured and
interpreted for each of the experiments. Best performance in both
regards, retrieval quality and processing speed, was achieved by
token-type based index term extraction and stemming. Tagging turned
out to be not optimal because of wrongly identified word categories
and its computational complexity. Dynamic term space generation had
no impact on the retrieval results. A comparison of a former X-DOSE
version and the current version revealed the progress of development
and highlighted the improvements achieved.
%
Compared to other systems used at INEX 2005, X-DOSE was
not competitive in the Content-Only (CO and COS) tasks.
%
This was because the evaluation procedure used at INEX penalized
retrieval results computed on mapped documents and topics, as it is
proposed in this research.
%
In contrast, Content-And-Structure (CAS) queries were processed very
fast. Since CAS evaluation was done based on strict structural
matching (SSCAS), the results much better reflected the performance
of X-DOSE. For this task, X-DOSE even outperformed most other
systems.
|