Research @ Hekkas.Com

Linguistically Enhanced Information Retrieval of Structured Documents

Representation

"Science must begin with myths, and with the criticism of myths." Karl Popper

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Chapter 4 - Document Representation
%
% last change: 16.08.2004
% correction hamid: xx.xx.2004
% correction prof: xx.xx.2004
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\aphorism{Simplicity does not precede complexity, but follows it.}{Alan J. Perlis}%
\chapter{Document Format, Storage, and Representation}
\label{chapter:documentrepresentation}%

This chapter is concerned with the most simple XML document format
that could possibly work as a means to express structure, content,
and metadata of text documents. By transforming documents into this
generic schema the foundation for efficient storage and retrieval is
laid. Typing, associated with tailored search predicates, supports
sophisticated matching of information contained in different XML
components. In order to compare the content of document components,
the content of descendants is aggregated upwards contributing to the
content of the ancestors, allowing to apply traditional information
retrieval models. Therefore, two different modifications of the
vector space model, one using a static term space and another one
using dynamic term spaces, are presented. As a result documents of
any source and of any format are represented in the same way,
enabling fast and elaborate retrieval of documents and document
components.\footnote{Parts of this chapter have already been
published in~\cite{hassler05secure,hassler05searching}}

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Introduction}

% introduction and motivation
As soon as speaking of structured documents the question of `what is
structured' and `how is the structure expressed' comes up. Usually,
the logical structure of a document is covered by terms like
`chapter', `section', `paragraph', or `figure', where the order of
structural elements is of importance (i.e., the introduction always
comes before the conclusion). Nesting of elements is used to
summarize related contents under a common topic.

% no guidelines
Unfortunately, there exists no standardized set of terms that
describes the structure of a document. Also, no guidelines or
limitations for proper structuring -- as the number of allowed child
elements or the maximum depth of nesting -- are available. On the
one hand this leaves much room to individually structure documents,
on the other hand it complicates structured document retrieval
tasks.

% author related stuff
Furthermore, the structure is tightly coupled to the intensions of
an author in organizing the text. Important contents are introduced
at the beginning and recur several times as the text evolves. Deeper
nested contents provide more details, whereas contents at higher
levels overview a topic.
%
Different authors tend to structure their texts differently, so
there is no consistency inherent in a set of documents written by
different authors.
%
Even single authors rarely stick to their way of structuring
different documents over a longer period of time.
%
This structural heterogeneity often leads to inconsistencies and
ambiguities, especially in large-scale document management
systems~\cite[pp. 201]{manning08ir}.

 

% format related stuff
Besides heterogenous document structures, a wide variety of document
formats are used, calling for diverse access methods. Both,
heterogeneity of document structures and document formats make it
hard for retrieval systems to perform well. Besides constraints on
the content, queries addressing structured documents may contain
additional restrictions on the structure or context (span of
components).
%
In the context of XML documents another issue arises: Documents on
the net mostly come without any schema specification making it hard
to guess which structural element contains which content.

 

 

 

% proposal of single schema, advantages
In order to take care of these difficulties this work proposes to
map incoming documents of arbitrary structure and format onto a
single generic XML document format.
%
The advantages of this procedure is that most of the inconsistencies
and ambiguities can be resolved at a very early stage of retrieval.
The consequences include:
%
\begin{compactitem}
\item Homogenous documents regarding their structure and format (XML) exist.
\item Common access methods to address the content of XML components are possible.
\item Adoption of data types and sophisticated search predicates for specific contents.
\item Definition of index objects based on a components' name or data type.
\item Support of structural vagueness and semantic relativism of components~\cite{fuhr01xirql} due to the mapping onto single distinct elements.
\item Application- and domain-specific definition of retrieval units becomes possible.
\end{compactitem}

% what comes in this section
In the sequel a brief overview of related work is given. All
subsequent sections and chapters present the achievements obtained
in this work. In Section~\ref{sec:generic_xml_schema}, a minimal XML
document schema that fits these requirements is defined. Efficient
storage of the mapped documents using a relational database is
described in Section~\ref{sec:storage}. Based on the stored
information, representation issues of arbitrary document components
are outlined in Section~\ref{sec:representation}. A short summary
concludes the chapter.

 

 

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Related Work}

\subsection{Typing of Structural Entities}

G\"{o}vert~\cite{goevert01hyrex} describes several extensions of
search predicates that are added to the HyREX search engine. These
predicates are based on different data types organized in a type
hierarchy (see Figure~\ref{fig:typing_fuhr}).
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{04_document_representation/figures/typing_fuhr}\\
\caption[Inheritance hierarchy on data types]{Inheritance hierarchy on data types (cf.~\cite{goevert01hyrex})}
\label{fig:typing_fuhr}
\end{figure}
%
Each data type supports a set of search predicates which are
inherited by its subtypes. The predicates implement type-based
matching methods for names that sound similar or are abbreviated
forms, geographical closeness of locations, interval restrictions of
time and date specifications, similarities of chemical structures,
and several grades of text processing.

 

 

\subsection{Storage}

 

\subsubsection{Florescu and Kossmann}

Florescu and Kossmann~\cite{florescu99performance} describe eight
different approaches how XML documents can be mapped onto relational
database tables. They evaluated their approaches using approximately
80 megabytes of XML documents. Their results show that best
performance is achieved by using separate attribute tables in
combination with inlining the corresponding element values. Another
finding is that more sophisticated approaches may hurt retrieval
performance more than they help.

 

 

\subsubsection{Schmidt, Kersten, Windhouwer, and Waas}

Schmidt et al.~\cite{schmidt01efficient} propose a data model based
on complete binary fragmentation of the document tree. In order to
store the XML documents they rely on the relational Monet database
engine.

The Monet transform $M_t$ decomposes an XML document tree into
binary parent-child, node-attribute, and node-rank relations of the
same type. $M_t$ of a document $d$ is given by the tuple
\begin{equation*}
M_t(d) = (r,R,A,T)
\end{equation*}
where $r$ is the root of the document, $R$ is the set of binary
relations between parent and child nodes, $A$ is the set of binary
relations between nodes and attributes, and $T$ is the set of binary
relations between nodes and their rank. Each of the $R$, $A$, and
$T$ relations is stored in separate database tables that are further
decomposed according to the parent and child elements involved. The
decomposition process of a document is linear in time and space.

For evaluation purpose three different document collections are
used: ACM Anthology (46,6 MB), Shakespeare's Plays (7,9 MB), and
Webster's Dictionary (56,1 MB). In their experiments they show that
the size of the Monet XML format (44,2 MB, 8,2 MB, 95,6 MB) strongly
depends on the original XML documents. Higher heterogeneity of the
document structure increases the amount of database tables needed,
peaking in 2587 tables for the Webster's Dictionary. Quantitative
assessments show that the approach scales linear in the size of the
XML document collection regarding both, storage and retrieval.

 

 

 

\subsubsection{Grust}

Grust~\cite{grust02accelerating} describes a relational index
structure for XML documents. It allows a complete reconstruction of
stored documents. Table~\ref{tab:xpath_axes} overviews the thirteen
axes supported by XPath. The main focus of Grust settles on
supporting four major axes exploited for retrieval, namely
descendant, ancestor, following, and preceding relations.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption[Semantics of XPath axes]{Semantics of XPath axes~\cite{grust02accelerating}}
\label{tab:xpath_axes}
\begin{tabular}{L{3cm}L{7cm}}
\rowcolor{tableheadcolor} \color{white} Axis & \color{white} Result forest of a node $v$ \tabularnewline%
% axis result fores
\texttt{child} & direct element child nodes of $v$ \tabularnewline%
\texttt{descendant} & recursive closure of \texttt{child} \tabularnewline%
\texttt{descendant-or-self} & like \texttt{descendant}, plus $v$ \tabularnewline%
\texttt{parent} & direct parent node of $v$ \tabularnewline%
\texttt{ancestor} & recursive closure of \texttt{parent} \tabularnewline%
\texttt{ancestor-or-self} & like \texttt{ancestor}, plus $v$ \tabularnewline%
\texttt{following} & nodes following $v$ in document order \tabularnewline%
\texttt{preceding} & nodes preceding $v$ in document order \tabularnewline%
\texttt{following-sibling} & like \texttt{following}, same parent as $v$ \tabularnewline%
\texttt{preceding-sibling} & like \texttt{preceding}, same parent as $v$ \tabularnewline%
\texttt{attribute} & attribute nodes of $v$ \tabularnewline%
\texttt{self} & $v$ \tabularnewline%
\texttt{namespace} & namespace nodes of node $v$ \tabularnewline%
\end{tabular}
\end{table}

\newpage
Therefore, he suggests a five-dimensional descriptor
%
\begin{equation*}
desc(v) = \langle pre(v),post(v),par(v),att(v),tag(v) \rangle
\end{equation*}
%
to encode structural elements, where $pre$ (resp. $post$) is an ID
assigned through a preorder (resp. postorder) traversal of the
document tree, $par$ is the parent node, $att$ is a boolean value
distinguishing XML attributes ($att=true$) and XML elements
($att=false$), and $tag$ is the actual name of the XML attribute or
XML element. One has to note that both pre- and postorder are unique
within a document and can be used as primary keys. Both numbers are
initialized with 1, so the root node of the document starts with
$pre=1$. The parents preorder ID $par$ is stored for the sake of
ancestor and descendant detection (root nodes are defined as
$par=nil$). $par$, in combination with $pre$ and $post$, allows
efficient calculation of the following and preceding sibling. Both,
$att$ and $tag$ are stored for name testing.

Grust points out that insertion of documents into the database is
linear in time and size of the input source. By using an event-based
parsing framework for XML documents (e.g., SAX) only very limited
scratch space during storing is needed.

In order to store the content of the elements two, different
solutions are feasible: (1) Inlining the content as a child node of
its parent node (like an attribute); or (2) storing the content in a
separate database table using ($pre$,$cdata$) tuples where $pre$ is
a foreign key pointing at the corresponding elements preorder ID.
%
Experiments by Florescu and Kossmann~\cite{florescu99performance}
identified the second solution as superior, which is adopted in this
work.
%Previous work~\cite{florescu99performance} showed that the second
%method is superior.

 

\subsubsection{Hiemstra}

Hiemstra~\cite{hiemstra02database} presents the fundamental ideas
and starting points for building an XML retrieval system based on
relational database systems. In his work he defines a basic language
model for term weighing and introduces database tables that hold
document term statistics, global term statistics, and XML document
structures. For indexing purpose he applies methods described by
Grust~\cite{grust02accelerating}. However, he substitutes the node
identification based on preorder and postorder (assigned during two
traversals) by starting and ending counters (assigned during a
single traversal), keeping up fast ancestor-descendant detection.

 

 

 

 

 

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{The Most Simple XML Document Format}
\label{sec:generic_xml_schema}

% my work
From now on, all methods and results described are achieved by the
author as part of this work. One of the primary tasks of information
retrieval is the handling of incoming information. Since this work
relies on structured input documents, the usage of a generic XML
document format that satisfies the requirements of structured
document retrieval is proposed.

% introduction
An XML schema allows to define valid document structures
independently of the XML document instances implementing it. The
goal of this section is to identify the minimal requirements
necessary to represent a structured document appropriately for
information retrieval purposes.
%
%
% kinds of information
This first thing to be aware of is the different kinds of
information coming along with structured documents:
%
\begin{description}
%
\item[Structure] refers to the logical structure of a document
usually covered by the terms `chapter', `section', `subsection',
etc. It defines a hierarchy where the order among the elements is of
importance (i.e., the introduction always comes before the
conclusion).
%
\item[Content] is associated with structural elements and defines what
an element is about. Content can be regarded either as (1) flat
content, typically referring to paragraphs, tables, and figures, or
alternatively as (2) composite content that consists of recursively
combined descendant contents of smaller granularity.
%
\item[Metadata] provides additional information which is -- like content --
assigned to structural elements. It describes the content more
clearly without being part of it. Examples of metadata are the
document's author(s), section titles, text languages, and figure
dimensions.
%
\end{description}

In order to achieve accurate retrieval results, structure, content,
and metadata have to be treated differently during both phases,
indexing and retrieval.

 

 

 

\subsection{Structure}

% introduction
The hierarchical structure of documents is usually covered by terms
like chapters, sections, and subsections. Sticking to these
definitions in a general domain leads to severe inconsistencies. For
instance, scientific research articles do not provide chapters but
consist of sections, subsections, and subsubsections only. Another
problem occurs if these identifiers are abbreviated (sec, ssec) or
cross-language documents are included (capitulo\footnote{Spanish for
chapter}, Unterkapitel or Abschnitt\footnote{German for section},
etc.). As a consequence more abstract definitions of content
containers seem to be necessary.

% three basic elements, example tree
Therefore, this work introduces a generic document structure
(defined through XML schema) that consists of only three elements:
\texttt{DOC} (document), \texttt{SEC} (section), and \texttt{FRA}
(fragment). These three structural concepts manage to reflect any
tree-like logical structure inherent in structured documents.

% explanation of the figure
Figure~\ref{fig:tree_1_generic} illustrates the transformed document
of the introductory example given in the previous chapter (see
Figure~\ref{fig:xml}). Titles are included for better comprehension.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.99\textwidth]{04_document_representation/figures/tree_1_generic}\\
\caption{Example of a transformed XML document}
\label{fig:tree_1_generic}
\end{figure}
%
%
% DOC element
The \texttt{DOC} element defines the root of the document. This
enables all kinds of documents like books, articles, or news
messages to be mapped onto this element. As content \texttt{DOC}
elements may only comprise \texttt{SEC} and \texttt{FRA} elements.
%
% SEC element
\texttt{SEC} is the main structural element of a document. Sections
are defined recursively consisting of other \texttt{SEC} elements
(e.g., subsections) and/or \texttt{FRA} elements (first child
element of `Section 2' in the Figure). This allows unlimited nesting
of structural elements (e.g., chapter, section, subsection,
subsubsection, segment).
%
% FRA element
\texttt{FRA} elements are the only elements that carry flat content,
defining the leaf nodes of the document schema. Fragments can be
understood as basic building blocks acting as elementary containers
for either plain text (which is the standard) or bytecode (e.g.,
inlined binary information like figures and multimedia objects).

 

% semantics and implication of fragment and granularity
From the information retrieval point of view fragments denote the
atomic units of a structured document. Thus, they define the
smallest units for indexing and retrieval. Their granularity
strongly depends on the application area and domain requirements. In
this work fragments are defined to fit the size of meaningful
portions of texts that are retrievable. Some domains may require a
deeper structure with much smaller fragments comprising sentences or
even single terms, while others need fragments that fit even long
text passages. In any case the granularity of fragments has a major
impact on the number of elements in the transformed XML document and
thus on the performance during indexing and retrieval.

 

% links
Within documents links are supported as a means to express relations
among elements. There exist two types of links, internal and
external links. Internal links are links within the same document
including citations (pointing to entries in the bibliography
section), figure or table references within the text (pointing to
the actual figure/table), or the table of contents (pointing to the
beginning of sections/subsections). External links refer from one
document to another like reference entries in the bibliography
section (pointing to other documents), hyperlinks (pointing to URLs
or email addresses), and file locations (pointing to specific
files/directories).

% content and metadata block
Each of the three elements (\texttt{DOC}, \texttt{SEC},
\texttt{FRA}) is viewed as a tuple (\emph{metadata},
\emph{content}), where \emph{metadata} refers to descriptive
information of the node itself, while \emph{content} refers to the
content of the element (dotted rectangles in
Figure~\ref{fig:tree_2_meta_content}). Generally, the first type of
nodes requires database-supported matching during retrieval, while
the second type is subject to partial matching (e.g., VSM).
%
%
% figure of metadata and content block
\begin{figure}[ht]
\centering
\includegraphics[width=0.99\textwidth]{04_document_representation/figures/tree_2_meta_content}\\
\caption{Expanded example of a transformed XML document}
\label{fig:tree_2_meta_content}
\end{figure}
%
%
% implication of metadata and content blocks
The strict distinction between metadata and content enables
different methods to match queries and document components. Thus,
special retrieval focus can be put on each of these information
sources.

 

 

 

\subsection{Metadata}

% general info about the metadata block
The metadata block of a node provides element-based complementary
information. Examples of metadata information are a documents author
or a sections title. A full list of the metadata fields for each
structural element is given in Table~\ref{tab:metadata}.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Supported metadata information}
\label{tab:metadata}
\begin{tabular}{L{3cm}L{7cm}}
\rowcolor{tableheadcolor} \color{white} Element & \color{white} Metadata \tabularnewline%
% element metadata
Document (\texttt{DOC}) & \texttt{sourcepath}, \texttt{doc\_type}, \texttt{title}, \texttt{author}, \texttt{subtitle}, \texttt{publisher}, \texttt{proceeding}, \texttt{year}, \texttt{keywords}, \texttt{isbn}, \texttt{price} \tabularnewline%
Section (\texttt{SEC}) & \texttt{sourcepath}, \texttt{sec\_type}, \texttt{title} \tabularnewline%
Fragment (\texttt{FRA}) & \texttt{sourcepath}, \texttt{fra\_type}, \texttt{title}, \texttt{language} \tabularnewline%
\end{tabular}
\end{table}

% sourcepath
Each of the elements (\texttt{DOC}, \texttt{SEC}, \texttt{FRA})
provides a \texttt{sourcepath} field that keeps a reference to the
source of the information (e.g., URL). By this means the connection
between a retrieved element of the transformed XML document and its
original source document is established. In cases where a reference
is not available (i.e., a section within an Adobe Acrobat PDF
document cannot be addressed directly) the field may remain empty.
%
% type
Another metadata field available to all structural elements is the
content type (\texttt{doc\_type}, \texttt{sec\_type},
\texttt{fra\_type}) which is explained in the next subsection.
%
% general fields are self explaining
The remaining metadata fields do not need further explanations. The
reason for the \texttt{title} in the \texttt{FRA} metadata (usually
paragraphs without titles) is that fragments may also refer to
figures, tables, or multimedia objects.

% extension
Of course the proposed metadata fields have to be chosen according
to the application area and domain. A natural extension of this
metadata model is to allow multiple entries with the same name and
sub-structured metadata. Although not exploited in this work,
complex metadata information can be easily integrated (see
Figure~\ref{fig:typing_structured_metadata}).
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.5\textwidth]{04_document_representation/figures/typing_structured_metadata}\\
\caption{Structured metadata example}
\label{fig:typing_structured_metadata}
\end{figure}

 

 

\subsection{Content}

% general info
The content block of \texttt{DOC} and \texttt{SEC} elements only
supports \texttt{SEC}s and \texttt{FRA}s as valid child elements. In
contrast, the content of FRA elements is not further nested.
%
% no mixed content nodes
Such a definition completely avoids mixed content nodes containing
both, content and substructures. This in turn eases the
implementation of retrieval systems regarding the storage, indexing,
and retrieval functionalities.

% content of fragments
As already mentioned, the proposed structure defines a fragment as
the smallest structural unit which is in most cases a plain text
block. However, restricting the content to paragraphs, enumerations,
lists, tables, definitions, theorems, proofs, references,
algorithms, equations, formulae, and code listings might seem too
restrictive. It is important that a fragment could also refer to
figures, pictures, videos, sounds, and other multimedia material. At
this point the XML document format offers a possibility to include
inlined binary data in form of bytecode in the document. Although
such information has to be interpreted before being used it enables
a single document to store any kind of contents.

% possible sub-structure
Within a fragments textual content, semantic markup is useful to
support additional
%
\begin{compactitem}
\item layout information
\item mathematical environments
\item linkage
\item footnotes
\item etc.
\end{compactitem}
%
% EMPH
Layout information is not exploited for retrieval purpose. It is
reduced to a single \texttt{<emph>}-tag to express emphasis on a
term or text passage.
% MATH
Mathematical expressions and formulae within a text are marked by
\texttt{<math>}-tags.
%LINK
Both kinds of links, internal and external, are supported by a
\texttt{<link>}-tag that encloses the text to be linked. This tag
also specifies the type (\texttt{internal} or \texttt{external}) and
the \texttt{destination} (anchor within the document or address to
the external resource).
% FOOTNOTE
Footnotes are resolved as internal links to special footnote
sections inserted at the end of a document. A footnote is realized
as a section because it may contain more than a single paragraph.
% etc.
This concept can easily be extended to fit sub-structures of more
complex contents (e.g., table-like HTML markup), enabling more
appropriate analysis and matching mechanisms.
%
% common
Although these sub-structure elements are queryable the smallest
retrievable unit remains the whole fragment.

% content block not mandatory
However, the content block of elements is not mandatory. This allows
the inclusion of contents by exploiting their metadata information
(i.e., if content is not read- or analyzable) only. Further, it can
be used to outsource external contents by referring to it (e.g., a
picture or dynamically generated content somewhere on the net),
which in turn reduces the size of the XML documents.

 

\subsection{Typing of Structural Elements}

Proper retrieval results always involve a certain level of content
interpretation and structural knowledge. It plays a central role in
satisfying the users needs during retrieval. Therefore, semantic
typing of content and metadata is applied to enhance retrieval
performance. The proposed types in this section are developed to fit
the context of this work. They were never regarded as a complete set
and are intended to be extended according to the application domain.

 

\subsubsection{Typing of Metadata}

% general introduction
Being XML elements, metadata fields contain ordinary strings.
Although XML schemata allow for rudimentary type specification of
nodes (e.g., \texttt{xls:type='integer'}), generally the very same
content is valid in many different fields. For instance, both the
title and the ID of a document may be defined as simple strings.
Clearly these fields have to be treated differently when being
compared to a query. Thus, semantic information about the content of
metadata fields is needed to match them appropriately. In order to
achieve this, metadata types are exploited that are assigned to
single metadata fields.

% type hierarchy
To allow a semantic interpretation of the content of an XML element,
a type hierarchy is proposed by G\"overt~\cite{goevert01hyrex}.

A similar model is applied for typing of metadata fields. The type
hierarchy used is depicted in Figure~\ref{fig:typing_meta}. There,
types are derived from a common base element. The first level in the
hierarchy (gray-shaded rectangles) corresponds to database supported
data types. Thus, they can be used to assign types to the columns of
database tables. More specialized types in subsequent levels in the
hierarchy have one of the basic database types as supertype (e.g.,
\texttt{PersonName} is a String). Accordingly, the assigned types
also imply certain restrictions on the content and formatting (e.g.,
\texttt{DateTime} format, \texttt{URL} format) of metadata fields.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.7\textwidth]{04_document_representation/figures/typing_meta}\\
\caption{Hierarchical metadata types}
\label{fig:typing_meta}
\end{figure}

% predicates
Associated with the metadata types are custom-built implementations
of predicates for comparison. This allows special searching of
single metadata fields on different structural levels (e.g., section
titles, phone numbers, author names).

% names
For instance, the type \texttt{PersonName} identifies the content of
a metadata field such as the name of a person. Generally, names
include a family name and one or more forenames. In user queries,
multiple forenames or unknown forenames are often abbreviated by
their initials. Although family names and forenames cannot be
distinguished automatically, initial letters of names can be matched
during retrieval.
%
Consider documents written by an author named \texttt{Albert
Einstein} and a user query containing \texttt{A. Einstein}. In this
case typing allows a similarity calculation ($0 \le sim \le 1$)
between these two strings, resulting in a higher similarity value
(e.g., $~0,9$) than a comparison with \texttt{H. Einstein} (e.g.,
$~0,6$). This is also true for abbreviated family names like
\texttt{E. Albert}. However, more detailed metadata information
(i.e., a \texttt{PersonName} that consists of a
\texttt{PersonFamilyName} and \texttt{PersonForeName}s) is able to
handle even complex comparisons in a correct manner (see
Figure~\ref{fig:typing_structured_metadata}).
%
% keywords
\texttt{Keywords} separated by commas can be split and compared in a
boolean manner.
%
% others
\texttt{Title}s might be analyzed and compared in more detail using
stopword removal or (shallow) syntactic parsing. Also, high-level
similarities of \texttt{Phone} numbers based on their area code or
geographic closeness of \texttt{Location} types can be realized by
these predicates.

% normalization
Another important factor to find commonalities of queries and
metadata information is the normalization of content. By
transforming special types (e.g., times and dates, phone numbers,
prices) to a common format, similarity computation becomes more
exact. It is also simplified and speeded up.

 

\subsubsection{Typing of Content}

For indexing and retrieval purposes the content of \texttt{DOC},
\texttt{SEC}, and \texttt{FRA} is classified according to several
content types. These types can be addressed within queries and
enable to focus on particular subsets of elements. For instance,
only \texttt{book}s, sections of type \texttt{introduction}, or
\texttt{references} might be searched.
Figure~\ref{fig:typing_content} presents the content types supported
for each content container (gray-shaded rectangles). Again, the
proposed content types are not regarded as a final list and have to
be chosen according to both, the application and the domain.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.5\textwidth]{04_document_representation/figures/typing_content}\\
\caption{Content types}
\label{fig:typing_content}
\end{figure}

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Storage}
\label{sec:storage}

% introduction
A central issue of retrieval systems is their performance. This
includes the way documents are stored and accessed. Especially in
the context of structured documents, efficiency during retrieval of
components of any granularity must be provided.
%
% three approaches
Generally, three different alternatives for storing XML documents
have been proposed in the
literature~\cite{khan01evaluation,grabs02xmltm}:
%
\begin{itemize}
%
\item Special purpose database systems (e.g., Rufus~\cite{shoens93rufus},
Lore~\cite{abiteboul97lorel,mchugh97lore}, Strudel~\cite{fernandez98strudel}) work best
as they are scalable and meet the storage requirements to handle
huge data loads. Unfortunately, these systems are tailored for special
domains and applications. Thus, they are not very flexible.
%
\item Object oriented database systems (e.g., O$_2$~\cite{bancilhon88o2}) and
native XML stores (e.g., Timber~\cite{jagadish02timber}, eXist~\cite{meier03exist})
are optimally suited to store complex nested data structures. However,
when querying large amounts of data, these systems are not able to compete with
retrieval performance of special purpose database systems or relational
database management systems.
%
\item Relational database management systems have been well
proven in the information retrieval domain. They provide maturity, stability, portability, and
scalability~\cite{shanmugasundaram99relational}. Mapping XML documents
appropriately to fit the relational paradigm seems to be a promising
solution that meets the storage requirements for structured documents~\cite{grossman97integrating}.
%
\end{itemize}

% relation database
In this thesis a relational database is used to store the XML
documents. The goal is to accelerate the access to various
structural neighbors of a node in the document tree (descendants,
ancestors, and siblings).
%
% not all nodes are stored, metadata is summarized
In order to process the transformed XML documents efficiently not
all nodes are stored in the database. Pure artificial elements used
for content distinction, the \texttt{metadata} and the
\texttt{content} block, are neglected. Further, the set of metadata
elements is treated differently as described in the sequel.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% structure %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% pre- and postorder
This work departs from the idea of opening ($pre$) and closing
($post$) node identifiers introduced
in~\cite{grust02accelerating,hiemstra02database}. The access
efficiency comes from the fact that $pre$ and $post$ descriptors are
unique for a given document and, therefore, can be used conjointly
with the ID of the document as primary key in the mapped relational
schema. Both, $pre$ and $post$ are assigned straight-forward to the
nodes of a document in a single preorder traversal (root first, then
children from left to right) by keeping track of opening and closing
tags. Figure~\ref{fig:tree_3_pre_post} depicts an example with $pre$
(number to the left) and $post$ (number to the right) IDs assigned
to each node.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.90\textwidth]{04_document_representation/figures/tree_3_pre_post}\\
\caption{Transformed XML document with $pre$ and $post$ identifiers}
\label{fig:tree_3_pre_post}
\end{figure}

% pre- and postorder advantages
During retrieval, $pre$ and $post$ identifiers imply a number of
useful features:
%
\begin{compactitem}
%
\item
The root \texttt{DOC} always starts with $pre=1$.
%
\item
Child nodes of the same parent node are continuously numbered, where
$pre$ of a subsequent element equals $post+1$ of the preceding
element (e.g., the three uppermost \texttt{SEC} nodes:
2-5,6-23,24-27).
%
\item
The number of all descendant nodes is given by
$\frac{post-pre-1}{2}$.
%
\item
\texttt{FRA} elements define the leaf nodes of the tree and are
identified through the equation $post-pre=1$.
%
\item
Containment of two nodes (ancestor-descendant relation) is defined
by $pre$\_ancestor $<$ $pre$\_descendant and $post$\_ancestor $>$
$post$\_descendant. Thus, no recursive computation to detect
ancestor-descendant relations is needed.
%
\end{compactitem}

 

% structural entries
Table~\ref{tab:pre_post_db} shows the structural entries for the
document given in Figure~\ref{fig:tree_3_pre_post}. A structural
entry is described by the tuple ($doc$, $pre$, $post$, $parent$,
$tag$, $path$). $doc$ refers to the document, $pre$ and $post$ are
the numeric node identifiers, $parent$ is the $pre$ identifier of
the parent node in the same document, $tag$ is the XML tag naming
the component, and $path$ denotes the common path (see
Definition~\ref{def:common_path}) to the root of the document. The
root element ($pre=1$) has per definition no parent. $tag$ is
included for fast name lookup and access. For the sake of retrieval
performance each entry includes its common $path$. For queries
specifying the path of a document component, a great deal of
recursive path generations using the $parent$ and $tag$ attributes
is avoided.
%
\begin{definition}[Common path]
\label{def:common_path}
The common path of an XML component is
defined as the XPath to the root of the document (e.g.,
\texttt{/DOC[1]/SEC[3]/FRA[1]}) without positional information
(e.g., \texttt{/DOC/SEC/FRA}).
\end{definition}
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Relational entries for the structure}
\label{tab:pre_post_db}
\begin{tabular}{cccccL{4cm}}
\rowcolor{tableheadcolor} \color{white} doc & \color{white} pre & \color{white} post & \color{white} parent & \color{white} tag & \color{white} path \tabularnewline%
%doc pre post parID tag path
$d_1$ & 1 & 28 & - & \texttt{DOC} & \texttt{/DOC} \tabularnewline%
%
$d_1$ & 2 & 5 & 1 & \texttt{SEC} & \texttt{/DOC/SEC} \tabularnewline%
$d_1$ & 3 & 4 & 2 & \texttt{FRA} & \texttt{/DOC/SEC/FRA} \tabularnewline%
%
$d_1$ & 6 & 23 & 1 & \texttt{SEC} & \texttt{/DOC/SEC} \tabularnewline%
%
$d_1$ & 7 & 10 & 6 & \texttt{SEC} & \texttt{/DOC/SEC/SEC} \tabularnewline%
$d_1$ & 8 & 9 & 7 & \texttt{FRA} & \texttt{/DOC/SEC/SEC/FRA} \tabularnewline%
%
$d_1$ & 11 & 22 & 6 & \texttt{SEC} & \texttt{/DOC/SEC/SEC} \tabularnewline%
%
$d_1$ & 12 & 13 & 11 & \texttt{FRA} & \texttt{/DOC/SEC/SEC/FRA} \tabularnewline%
%
$d_1$ & 14 & 17 & 11 & \texttt{SEC} & \texttt{/DOC/SEC/SEC/SEC} \tabularnewline%
$d_1$ & 15 & 16 & 14 & \texttt{FRA} & \texttt{/DOC/SEC/SEC/SEC/FRA} \tabularnewline%
%
$d_1$ & 18 & 21 & 11 & \texttt{SEC} & \texttt{/DOC/SEC/SEC/SEC} \tabularnewline%
$d_1$ & 19 & 20 & 18 & \texttt{FRA} & \texttt{/DOC/SEC/SEC/SEC/FRA} \tabularnewline%
%
$d_1$ & 24 & 27 & 1 & \texttt{SEC} & \texttt{/DOC/SEC} \tabularnewline%
$d_1$ & 25 & 26 & 24 & \texttt{FRA} & \texttt{/DOC/SEC/FRA} \tabularnewline%
\end{tabular}
\end{table}

%% time complexity
%Inserting documents into the database is linear in time and size of
%the input source. By using an event-based parsing framework for XML
%documents like SAX~\cite{url_sax} only very limited temporary space
%requirements during storing is
%guaranteed~\cite{grust02accelerating}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% content %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% introduction
There are two possible ways to store the content of document
elements \cite{grust02accelerating}: (1) inline the content as a
child node of its parent (like an attribute); or (2) store the
content in a separate table with a foreign key pointing to the
correct element. The second method has been proven to be
superior~\cite{florescu99performance} and is adopted in this work.

% separate content table
The content of nodes (in particular of fragments) is kept in a
separate database table (see Table~\ref{tab:content_db}). Contents
of inner nodes (\texttt{SEC} and \texttt{DOC} elements) do not have
to be stored explicitly because they can be recovered dynamically
from the descendants' contents. Finding a balance between retrieval
performance and storage space, dynamically recovered inner node
contents can be additionally (or temporarily) preserved in the
content table. This allows to retrieve the content of an inner node
by a single database query instead of computing it recursively from
multiple descendant nodes.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Relational entries for the content}
\label{tab:content_db}
\begin{tabular}{ccL{5.3cm}}
\rowcolor{tableheadcolor} \color{white} doc & \color{white} pre & \color{white} data \tabularnewline%
%doc pre data
$d_1$ & 3 & This work ... \tabularnewline%
$d_1$ & 8 & XML is one ... \tabularnewline%
$d_1$ & 12 & This section ... \tabularnewline%
$d_1$ & 15 & These days ... \tabularnewline%
$d_1$ & 19 & Based on ... \tabularnewline%
$d_1$ & 25 & Crucial for ... \tabularnewline%
\end{tabular}
\end{table}

% multiple contents
The concept of separating content and structure opens up the
possibility to maintain multiple contents for each structural
element. Hence content can be represented in various forms, this is
a major advantage for information retrieval. At least two
(independent) content tables are needed: one for saving the original
plain content and another one for saving its representation.

% process of indexing/storage
During indexing of a new XML document, plain contents restricted to
leaf nodes are stored in the database first. Afterwards, content
representations of the stored leafs are computed (and recalculated
on demand) without accessing the XML file again. Additionally,
contents and content representations of inner nodes can be
calculated and stored in the database. Therefore, not every content
in the database must have a corresponding content representation
(i.e., inner nodes).

% possible extension
Since this work focuses on the representation of natural language
texts, contents such as figures or formulae are unsuited for textual
representation. However, additional representations of these
contents (e.g., bitmaps) can be integrated easily by adding new
representation tables.

%After the plain content of an XML file is stored in the database,
%representations can be dynamically computed and recalculated on
%demand without accessing the original information directly.

%As a consequence not every original content may have a corresponding
%representation entry in every table. Some content entries (e.g.,
%figures) are even unsuitable to become included in any of the tables
%used for textual representation.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% metadata %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In order to improve retrieval performance metadata handling is
completely shifted to the database. This is achieved by grouping all
metadata fields according to their element. Instead of having
multiple entries in the structure and content tables, a single tuple
($doc$, $pre$, $meta_1$, \ldots, $meta_n$) lumps together the set of
metadata fields in a single database entry. Metadata of elements
(\texttt{DOC}, \texttt{SEC}, \texttt{FRA}) is stored in separate but
similar tables as shown in Table~\ref{tab:metadata_db} for sections.
%
\begin{table}[ht]
\centering
\sffamily \footnotesize
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Relational entries for metadata (section level)}
\label{tab:metadata_db}
\begin{tabular}{ccL{2cm}L{2cm}L{0.5cm}}
\rowcolor{tableheadcolor} \color{white} doc & \color{white} pre & \color{white} title & \color{white} sec\_type & \color{white} ... \tabularnewline%
%doc pre title sec_type
$d_1$ & 2 & Abstract & abstract & \tabularnewline%
$d_1$ & 6 & Introduction & introduction & \tabularnewline%
$d_1$ & 7 & Section 1 & & \tabularnewline%
$d_1$ & 11 & Section 2 & & \tabularnewline%
$d_1$ & 14 & SubSec 2.1 & & \tabularnewline%
$d_1$ & 18 & SubSec 2.2 & & \tabularnewline%
$d_1$ & 24 & Outlook & conclusion & \tabularnewline%
\end{tabular}
\end{table}

Although the same elements on different hierarchical levels may
implement different sets of metadata (i.e., sections in proceedings
may state an author but subsections do not), the approach taken in
this work assumes that all elements of the same type (\texttt{DOC},
\texttt{SEC}, \texttt{FRA}) have the same metadata fields. This may
lead to numerous $NULL$ values (unavailable metadata for some
elements) in the database. However, the whole set of metadata can be
accessed by a single select operation. This both simplifies and
speeds up querying of metadata considerably.

 

\newpage
%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Document Representation}
\label{sec:representation}

% introduction
The representation of structured documents relies on XML documents
that implement the generic schema introduced in
Section~\ref{sec:generic_xml_schema}. As a starting point, the three
structural components document (\texttt{DOC}), section
(\texttt{SEC}), and fragment (\texttt{FRA}) serve as index objects.

% representation of leafs and inner nodes, indexing
Since fragments are the leafs in the document tree, their content
has to be indexed first. Therefore, methods from the field of
natural language processing transform the textual content (e.g.,
paragraphs, lists) into term frequency vectors as described in the
next two chapters. In order to index inner nodes (sections and
documents) the representations of descendants are recursively merged
using a propagation of term statistics (see
Section~\ref{sec:sdr_representation}). This operation is equivalent
to process the concatenated contents of the descendant nodes.

% retrieval
The actual weighing of the term frequency vectors is carried out
during retrieval using the traditional vector space model
(Equations~\ref{eq:vsm_w}--\ref{eq:vsm_sim}). Based on the weighed
feature vectors the similarity between document components and user
queries is computed.
%
% idf difficulty
However, weighing includes the inverse document frequency
(Equation~\ref{eq:vsm_idf}) as a means to exploit corpus-based
statistics.
%
In the domain of structured document retrieval, the notion of
`document' has to be reconsidered: What elements are regarded as a
valid `document'? Is it the \texttt{DOC} component which comes
closest to the original definition? Perhaps the fragments containing
the content should be counted? Or, are all components (\texttt{DOC},
\texttt{SEC}, \texttt{FRA}) treated as separate `documents'?
Obviously, this choice has a major impact on weighing and ranking,
and thus on the retrieval result.
%
%
%
% two versions
In this work two approaches are explored.
%
\begin{itemize}
%
\item The first approach treats all components (\texttt{DOC},
\texttt{SEC}, \texttt{FRA}) as equal units or `documents'. Their
contents are represented within the same term space, which is
defined as the union of all terms stated in all components. Thus, it
implements a single static term space.
%
\item The second approach is more elaborated, utilizing different term spaces
for components on different hierarchical levels. In order to group
components according to their term space, the common path (see
Definition~\ref{def:common_path}) of an element is consulted. This
approach is referred to as applying dynamic term spaces.
%
\end{itemize}
%
Both approaches are described in the next two subsections. Since the
term `document' is no longer valid for structured documents, the
term inverse document frequency $idf$ is substituted by inverse
element frequency $ief$~\cite{grabs02vsm}. A final comparison of the
two approaches outlines their behaviors.

 

 

\subsection{Static Term Space}

% introduction
In a static term space all component representations are treated as
equal. Neither a distinction between document, section, and fragment
components is made, nor is the level in the document hierarchy
considered. All components are processed as a collection of
`documents' assumed to be equally important to a query.

% vsm adaptation
Taking this point of view, the vector space model needs only slight
changes (Equations~\ref{eq:vsm_tf_static}--\ref{eq:vsm_sim_static}),
where $N$ (resp. $n_i$) is the total number of document components
(resp. components stating term $i$).
%
\begin{eqnarray}
tf_{i,j} &=& \frac{freq_{i,j}}{\max freq_{l,j}} \label{eq:vsm_tf_static}\\
ief_i &=& \log \frac{N}{n_i} \label{eq:vsm_ief_static}\\
w_{i,j} &=& tf_{i,j} \cdot ief_i \label{eq:vsm_w_static}\\
%
sim(d_j,q) &=& \frac{\overrightarrow{d_j} \bullet \overrightarrow{q}} {|\overrightarrow{d_j}| \cdot |\overrightarrow{q}|}
= \frac{\sum^t_{i=1} w_{i,j} \cdot w_{i,q}} {\sqrt{\sum^t_{i=1} w^2_{i,j}} \cdot \sqrt{\sum^t_{j=1} w^2_{i,q}}} \label{eq:vsm_sim_static}
\end{eqnarray}

% length normalization
Because the cosine similarity implies a certain normalization in the
space of document components~\cite[pp. 28]{baeza-yates99ir}, the
arbitrary length of the contents is accounted for. Term weights
computed on the document level are unequal to term weights achieved
by applying the traditional formulae on the flat content of the same
document. The reason for that is that the inverse element frequency
is computed using a much higher number of document components than
the inverse document frequency using the number of documents.

% database
In order to support the calculation of $ief_i$ the database stores
($doc$,$term$,$n$) tuples of the documents. As the number of entries
in this table becomes quite large, a condensed table holds
($term$,$n$) tuples to improve performance during retrieval. If
documents or document components are added, altered, or removed both
data structures are updated accordingly.

 

\subsection{Dynamic Term Spaces}

% introduction
Another approach is to exploit structural information to define
multiple term spaces that better represent components on the same
level in the hierarchy. Two assumption are that these components (1)
are of similar granularity and length, and (2) use only a portion of
the complete term space. To achieve this, the common path (see
Definition~\ref{def:common_path}) is used to cluster the components
accordingly.

% context of a node
Based on the common path, the \emph{context of a node} is defined as
the set of all components having the same path (all chapters, all
sections, all fragments within subsections, etc.).
%
% vsm adaptation
To fit this requirement, the vector space model adaptation is given
by the Equations~\ref{eq:vsm_tf_dynamic}--\ref{eq:vsm_sim_dynamic}.
The inverse element frequency $ief_{i,c}$ is calculated dynamically
based on a node's context $c$, where $N_c$ (resp. $n_{i,c}$) is the
number of document components in context $c$ (resp. components in
the context $c$ stating term $i$).
%
\begin{eqnarray}
tf_{i,j} &=& \frac{freq_{i,j}}{\max freq_{l,j}} \label{eq:vsm_tf_dynamic}\\
ief_{i,c} &=& \log \frac{N_c}{n_{i,c}} \label{eq:vsm_ief_dynamic}\\
w_{i,j} &=& tf_{i,j} \cdot ief_{i,c} \label{eq:vsm_w_dynamic}\\
%
sim(d_j,q) &=& \frac{\overrightarrow{d_j} \bullet \overrightarrow{q}} {|\overrightarrow{d_j}| \cdot |\overrightarrow{q}|}
= \frac{\sum^t_{i=1} w_{i,j} \cdot w_{i,q}} {\sqrt{\sum^t_{i=1} w^2_{i,j}} \cdot \sqrt{\sum^t_{j=1} w^2_{i,q}}} \label{eq:vsm_sim_dynamic}
\end{eqnarray}

% different ief values
For the sake of term weighing, different $ief_{i,c}$ values of the
same term $i$ in different contexts $c$ are used. As a consequence,
this strategy weighs the same term with the same term frequency
differently depending on context $c$ (e.g., chapter versus
subsection). Clearly this approach puts more attention on the actual
context during retrieval. If the unit of retrieval is defined
explicitly, elements in this context are focused. Representations of
elements in other contexts do not influence the result. This allows
to perform focused retrieval on any level in the document tree.

% term space containment
Generally, term spaces of different contexts imply a containment
constraint: The term space of components of larger granularity
subsumes the term space of components of smaller granularity. This
becomes clear since the \texttt{/DOC} term space obviously is the
superset of the \texttt{/DOC/SEC} term space, which in turn is the
superset of the \texttt{/DOC/SEC/SEC} term space. However, fragments
on different levels in the hierarchy influence the term space of
their ancestors considerably.
%
% same as flat documents on document level
A nice feature of this approach is that in contrast to a single
static term space, the weighed feature vectors of \texttt{DOC}
components exactly match the representations using the vector space
model on flat documents.

% database
Dynamic $ief_{i,c}$ calculation is accomplished by the database
keeping track of ($doc$,$path$,$term$,$n_c$) tuples in a
\texttt{localIEF} table. Compared to a static term space, the amount
of database entries in this table is extraordinary high. To
circumvent performance losses, a compressed data structure of
($path$,$term$,$n_c$) tuples is pre-computed and stored in a
\texttt{globalIEF} table. Both tables \texttt{localIEF} and
\texttt{globalIEF} are used to compute dynamic as well as static
inverse element frequencies. More details are given in
Chapter~\ref{chapter:system} describing the system implemented.

 

\subsection{An Example: Static versus Dynamic Term Spaces}

% introduction of the example
The best way to illustrate the differences of a static term space
and multiple dynamic term spaces is by means of an example. Consider
the query and document given in Figure~\ref{fig:static_dynamic_1}.
The document tree contains three fragments and their associated term
weights. For demonstration purpose term weights in fragments are
assumed to be equal ($0,5$). Due to the propagation mechanism, term
weights at higher levels are decreased (by subtracting $0,1$ for
each level). Furthermore, the query contains a term \texttt{x} that
never occurs in any of the document components.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.45\textwidth]{04_document_representation/figures/static_dynamic_1}\\
\caption{Initial document and query}
\label{fig:static_dynamic_1}
\end{figure}

% general todo
In order to compute the similarity $sim(d_j,q)$ of a component $d_j$
and the query $q$, both representations have to be mapped onto a
common term space (static or dynamic). Terms $i$ not stated in a
component $j$ are weighed as zero ($w_{i,j}=0$). Terms in the query
that are not included in any of the components are neglected because
they do not lie in the term space. The resulting feature vectors of
$d_j$ and $q$ are of equal length and can be matched according to
the $sim(d_j,q)$ formula (Equations~\ref{eq:vsm_sim_static}
and~\ref{eq:vsm_sim_dynamic}).

% the example figure
Besides its impact on the term weights due to different inverse
element frequencies (which is masked in the example)
Figure~\ref{fig:static_dynamic} shows the effect of using either a
static or dynamic term spaces.
%
% example
\begin{figure}[ht]
\centering
\subfloat[Static Term Space]{\includegraphics[width=0.4\textwidth]{04_document_representation/figures/static_dynamic_2}\label{fig:static_dynamic_2}}
\qquad~\qquad%
\subfloat[Dynamic Term Spaces]{\includegraphics[width=0.4\textwidth]{04_document_representation/figures/static_dynamic_3}\label{fig:static_dynamic_3}}\\
\caption{Static versus dynamic term spaces}
\label{fig:static_dynamic}
\end{figure}

% static term space
In a static term space all components and the query are represented
using a vector size of six terms. On the one hand this supports
faster mapping (fixed length) and weighing (only a single $ief_i$).
On the other hand feature vectors become sparser because the largest
term space containing all terms of all components is used, resulting
in a drop of performance (longer feature vectors) and similarity of
smaller elements (see Figure~\ref{fig:static_dynamic_2}).

% dynamic term space
In contrast, using dynamic term spaces means that components and the
query are represented according to the context currently being
matched. In the figure, four term spaces are maintained:
\texttt{/DOC}, \texttt{/DOC/FRA}, \texttt{/DOC/SEC}, and
\texttt{/DOC/SEC/FRA} (the latter two are the same).

% discussion
The resulting similarities of both approaches are given in
Table~\ref{tab:static_dynamic}.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Similarities using static and dynamic term spaces}
\label{tab:static_dynamic}
\begin{tabular}{L{2cm}cc}
\rowcolor{tableheadcolor} \color{white} Context & \color{white} static term space & \color{white} dynamic term spaces \tabularnewline%
%context static dynamic
\texttt{DOC} & 0,770 & 0,770 \tabularnewline%
\texttt{SEC} & 0,289 & 0,500 \tabularnewline%
\texttt{FRA 1} & 0,816 & 1,000 \tabularnewline%
\texttt{FRA 2} & 0,333 & 0,577 \tabularnewline%
\texttt{FRA 3} & 0,000 & 0,000% \tabularnewline%
\end{tabular}
\end{table}
%
The first thing to note is that the ranking of the components
according to their similarity stays the same. But -- as expected --
the similarities of smaller components (especially fragments) gain
much higher values using the dynamic term spaces. This improves
focused retrieval by ranking more specific components of smaller
granularity higher without effecting (decreasing) the similarity of
larger components. A performance evaluation of both approaches using
real world documents is carried out in
Chapter~\ref{chapter:evaluation}.

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Summary}

This chapter introduced a generic document format for information
retrieval and its XML implementation applied throughout this work.
The generic format consists of only three components (document,
section, and fragment), which serve as both index and retrieval
objects. Documents from heterogenous sources are first mapped onto
this schema for further processing. In order to store the documents,
a relational database approach has been chosen. Finally,
representation issues are discussed. The focus is put on weighing
and ranking, providing two strategies based on the vector space
model: A single static term space approach and a more elaborated
approach reverting on multiple dynamic term spaces are presented.
%
An evaluation comparing these two approaches is conducted in
Chapter~\ref{chapter:evaluation}.

 

 

 

 

 

 

 

 

 

%