Research @ Hekkas.Com

Linguistically Enhanced Information Retrieval of Structured Documents

Structured Document Retrieval: An Overview

"Information is the currency of democracy." Thomas Jefferson

This chapter introduces structured document retrieval and provides the theoretical background of this work. The focus is put on preconditions and difficulties of applying information retrieval on structured documents.

 

Introduction

Information retrieval has become an important discipline in computer science. A vast amount of textual information and data is freely available to everyone, demanding for automatic methods that help to find and navigate through it. Additionally, information is no longer plain and unstructured. It has become a conglomerate of content, structure, and metadata. The main challenge now is to apply information retrieval techniques that fit this kind of information. The discipline that aims at resolving these difficulties is called structured document retrieval.

The main idea of structured document retrieval is to exploit the structure together with the content of documents to improve retrieval performance. This allows users to additionally retrieve parts of documents, so-called document components, that better fit their information needs instead of whole documents (e.g., an entire book). Thus, the aim of structured document retrieval is to return

  • document components of varying granularity (e.g., the entire document, a chapter, a subsection, a single paragraph/table/figure) that are
  • relevant to the user regarding their content, structure, and metadata.

Thus, focused retrieval -- as structured document retrieval is often called -- is concerned with finding the best entry points in structured documents~\cite{kazai02focussed}. These entry points are serving as starting points for further search and browse activities by the user.

The advantages of applying structured document retrieval are obvious: Large documents often mask relevant information because most of the entire document is irrelevant to a user query. Thus, parts of documents often provide better and more focused answers to queries regarding a certain topic. Moreover, precise queries can be formulated addressing not only the content, but also the structure and available metadata. Numerous studies have been conducted which highlight the improvement of retrieval effectiveness applying structured document retrieval~\cite{wilkinson94effective,kazai02focussed,kotsakis02structured}.

Text documents can be regarded as structured in several ways: The simplest (implicit) structure is according to the linear order of words, sentences, and paragraphs. Second, a structure is given by the links a document contains (e.g., hyperlinks, cross-references, citations). Also, temporal or spatial relationships within multimedia documents define a certain structure. Anyway, the most obvious way to define a documents' structure is to utilize the logical structure inherent in the documents' content (e.g., chapter, section, paragraph). Structural elements are defined as labeled nodes (optionally containing content) that are organized in a hierarchy. In structured document retrieval, this latter logical structure is exploited to define the units used for indexing, storing, and retrieving. According to the logical structure, textual information can be distinguished into three different categories~\cite{baeza-yates99ir,pal06survey}:

  • Unstructured information refers to a raw text, optionally containing markup for syntactic purpose only (e.g., formatting tags in HTML files). Every piece of information can be placed everywhere within the document. Placement in certain positions and/or markup tags do not indicate any semantic information about the content. Search results are whole contents (documents) that are relevant to a query to a certain degree.
  • Structured information refers to text which is formatted in a strict manner (e.g., database-like records in CSV files). Additional metadata explicitly defines the type, length, and other attributes of the content. Here, a piece of information is allowed only in a certain place associated with a certain semantic expressed in the metadata. Search results are data objects (e.g., data tuples) that are exact answers to queries addressing the content according to the metadata. Without knowledge about the underlying metadata and thus about the document structure, relevant information cannot be extracted.
  • Semi-structured information strikes a balance between these categories defined above (e.g., XML files). As with structured information, extra metadata describes the format of the content. However, it allows partial matching and missing elements. User-defined markup tags identify the meaning of the encapsulated content. Relationships between content containers are specified via nesting and references. Thus, the structure of a text is expressed without being too restrictive with content attributes (type or length restrictions, arbitrary nesting) and placing of different elements (e.g., sections, paragraphs, figures). Query results are parts of documents that match constraints on both, the content and the structure.

Retrieval models combining information on textual content and the logical structure are called structured text retrieval models~\cite[pp. 62]{baeza-yates99ir}. When speaking of structured text documents, in the remainder of this thesis we rather mean semi-structured documents. It is widely accepted by the scientific community that these terms are used synonymously. The reason for this is that in contrast to strictly formatted database systems textual documents are either considered structured or unstructured at all. In this work the term structured text document is used, although it more correctly refers to semi-structured text document.

While the logical structure provides documents with hierarchical levels of granularity, and hence more precision can be achieved by means of focused retrieval, it does, however, imply more requirements on the representation, storage, indexing, matching, retrieval, ranking, and graphical user interface. Because documents and queries are structured, standard retrieval models such as the vector space model (see Section~\ref{sec:vsm}) are not adequate anymore. Of course structured document retrieval involves the same tasks as traditional information retrieval. However, several aspects of information retrieval applied to flat documents have to be reconsidered and cannot be straight-forwardly applied on structured documents. Rather these methods have to be adapted to fit the structured document paradigm. In the past, structured document retrieval approaches focused on one of three issues~\cite{fuhr04xirql}:

  • The structural approach enriches traditional content-based approaches by including a structural constraint component. Documents are understood as an ordered set of independent nodes. This allows to restrict terms to certain document nodes by specifying additional constraints on the structure. However, these models are based on the boolean retrieval model and do not support weighing of index terms and ranking. A survey of these approaches is given in~\cite{navarro97proximal}.
  • The content-based approach represents a document as a sequence of plain text segments. A first attempt in this direction is passage retrieval~\cite{burkowski92passage,wilkinson94effective,kaszkiel97passage}, which is closely related to traditional content-based information retrieval. Only few researchers combined explicit structural information and content-based retrieval~\cite{fuhr04xirql}.
  • The tree-matching approach represents both documents and queries as ordered labeled trees. Thus, document retrieval is understood as an approximate tree-matching problem. Work about this approach can be found in~\cite{schlieder00result,schlieder02querying}.

Like in traditional information retrieval it is necessary to first identify and define the major tasks of structured document retrieval. These include the following aspects:

  1. Document format: Crucial for the performance of any retrieval system is its underlying document collection and the document format used. Besides processing and performance issues, it further has to support textual and binary (e.g., multimedia data) contents, fit the structural expressiveness requirements, and optionally include metadata at certain structural levels.
  2. Representation: Not all nodes in a document provide a meaningful portion of information because they might be too small or relevant only in a certain context. As a first step so-called indexing nodes are identified, defining the basic components that are able to be retrieved. Textual content of the hierarchically structured documents is generally restricted to the leave nodes. Hence, mechanisms to represent the content of inner indexing components have to be defined.
  3. Ranking: Related to the previous aspect, a scoring function to match document components and queries expressing their similarity and thus their relevance is needed. Based on the score the final results are ranked and returned to the user.
  4. Retrieval granularity: An important question is whether the retrieval units must be known ahead of time or are dynamically decided by the system itself. Based on the above score, the elements to be returned have to be decided.
  5. Query language: In order to formulate a users' information need a query language has to support complex constraints on the content, structure, and metadata.
  6. Result presentation: The way results are presented is a key issue and has to be considered early in the design phase. Once ranked, the results are displayed showing their context of appearance together with their relevance score. Browsing as a means to further explore documents containing multiple result components is inevitable.
  7. Evaluation metrics: Measuring the performance of a newly proposed retrieval system is essential to show its benefits. To be able to compare the results achieved against similar systems, a well-defined document collection, a set of user queries, valid results of the queries, and proper evaluation metrics are required.

In the sequel this chapter discusses these aspects.

 

XML Document Format

The XML document format~\cite{url_w3c_xml} has become an emerging standard for the representation of text documents. Simply being a text document, the content is marked up using self-defined tags which can be further enriched by metadata via attributes. The tags themselves must not overlap, providing a hierarchical structure of the document. From this point of view an XML document can be seen as a semi-structured text document that can be described as a tree of nodes. The root node corresponds to the first XML element. Each child node recursively starts a new branch of the tree. Generally, the content of the document is contained in the leaf nodes. However, XML allows nodes to contain both content and child nodes. These nodes are referred to as mixed content nodes.

XML documents may contain any kind of contents in their nodes, even bytecode. It further allows to combine metadata and content information in a single document. Based on their application area, XML documents can be separated into two different categories~\cite{fuhr01xirql}:

  • Data centric XML stores fully structured information in a database-like style. It is mainly used for data interchange. Two parties are enabled to exchange data based on their defined document scheme (DTD or XML schema) and an eXtensible Stylesheet Language Transformation (XSLT)~\cite{url_w3c_xslt} file. The XSLT file is used to translate documents instantiating one schema into documents instantiating a different one. Generally, the order among XML elements in such documents is not crucial.
  • Document centric XML, in contrast, deals with hierarchically structured full-text documents. As with data centric documents, a DTD or XML schema specifies valid document structures. But this kind of documents is not supposed to be transformed. Rather the schema is used to check whether a document is valid or not. In general, the order of XML elements plays a central role (e.g., the introduction always comes before the conclusion).

The difference between data and document centric documents is not a sharp one. Data centric documents may involve an order among their child nodes, whereas document centric documents may also contain unordered metadata like the author's name or section titles. But commonly it is clear whether XML documents are used in a data or document centric way. This work focuses on processing document centric XML documents, where the order of elements is of importance. However, special types of nodes dedicated to store metadata are interpreted in a data centric point of view.

Using XML for structuring documents yields several advantages. (1) XML allows to mirror domain and application specific document structures using a set of self-defined tags. (2) These tags can be further extended by metadata information in the form of self-defined attributes. (3) Moreover, XML supports internal and external linkage (XLink~\cite{url_w3c_xlink} and XPointer~\cite{url_w3c_xpointer}). (4) XML elements in the structure are addressed via their unique path (XPath~\cite{url_w3c_xpath}) starting at the root element of the document.

The definition of the structure is located in a DTD~\cite{url_w3c_dtd} or in a newer XML schema file~\cite{url_w3c_schema}, to which XML document instances (the actual XML documents) are referring. Based on the structural specifications, XML instances can be checked automatically whether they are valid or not according to the schema given. In contrast to DTD, XML schemata also support the definition of various data types and formats to restrict the content within XML document structures.

Listing 2.1 shows a simple XML document and its representation as an ordered tree (Figure 2.1), where the order among child nodes is of importance. In the first two lines of code the specification file for valid structures, here an XML Schema file named bookdef.xsd, is referenced.

<book xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="bookdef.xsd">
  <author>M. Hassler</author>
  <title>XML Retrieval</title>
  <abstract>This work ...</abstract>
  <chapter title="Introduction">
    <section title="Section 1">XML is one ...</section>
    <section title="Section 2">
      This section ...
      <subsection title="SubSec 2.1">These days ...</subsection>
      <subsection title="SubSec 2.2">Based on ...</subsection>
    </section>
  </chapter>
  ...
  <chapter title="Outlook">Crucial for ...</chapter>
</book>
Listing 2.1: XML example code
Figure 2.1: XML representation as an ordered tree

Gray shaded nodes in the XML tree refer to the plain content of the document. All other nodes contain only structural information. One might note that only a single node (Section 2) contains both, content and structure. Such a node is called a mixed XML node. Anyway, the content of XML structured documents is generally restricted to the leaf nodes. Additional information in form of metadata comes with the title attributes in the chapter, section, and subsection nodes. In the example the elements author and title are not implemented as attributes of the book element, which would also be a possible solution. The XPath expression to address the content of the first subsection SubSec 2.1 is given by /book[1]/chapter[1]/section[2]/subsection[1].

Generally, there are no standards or even guidelines available that describe how documents are properly structured. Even the tagsets used for markup vary strongly from one domain to another. This simple example already indicates the difficulties one has to face when confronted with XML documents from heterogenous sources or different structure definitions (e.g., mixed nodes, attribute versus element, etc.).

Exploiting the XML format for information retrieval purposes has several advantages~\cite{luk02survey}: (1) more precise search by providing additional information in the elements; (2) unified access to documents from heterogenous sources; (3) powerful search paradigm using structural, content, and metadata specification; (4) data and information interchange to share resources and to support cooperative search.

Within the structured document retrieval paradigm, the structure provided by XML documents is consulted to define the units for indexing and retrieval. That is the reason for using XML retrieval as synonym for structured document retrieval or focused retrieval. In this context the goal of XML retrieval can be reformulated as retrieving document components (XML elements) of arbitrary granularity that are relevant to a certain user query instead of whole documents only.

 

Related Work

This section outlines several aspects and work related to structured document retrieval. It covers historic developments and depicts currently probed approaches in this field.

 

Database and Information Retrieval Perspective

There are two main communities who have concentrated their efforts on the computation of semi-structured XML documents: the database community and the information retrieval community~\cite{luk00survey,pal06survey}.

The database-oriented approach tries to tackle the problem of querying structured information in a database like manner. Most of the information on the internet exists in form of HTML documents and/or is stored in relational databases. Both information sources can be represented using XML (e.g., see~\cite{url_rdb}) as an intermediate document format. This opens the door to apply database-like search techniques which are further extended to support proximity search and ranking.

Simply speaking, XML documents are plain text files. Thus, information retrieval approaches can be applied straightforward to search and retrieve information from these documents. As soon as XML turned out to be closely related to HTML, the information retrieval community focused their research on this area~\cite{pal06survey}. A first attempt was to simply ignore the additional markup, which led to lower retrieval performance~\cite{luk00survey}. More elaborated approaches include extra indices of the structure, support regular expression matching of XML paths and contents, and affiliate basic semantics with certain structural elements. However, these extensions were not as straightforward as assumed, leaving much space for further research.

 

Fragment and Passage Retrieval

Early work on returning parts of documents instead of full documents in response to user queries was done by John O'Connor~\cite{connor75passage,connor80passage}. The idea is to segment a text into so-called text passages and then to apply standard indexing and retrieval models. In this context a text passage is defined as a set of `consecutive words', a sequence of words in `reading order'. According to this approach, documents are considered as linear sequences of text fragments. Passages are mostly defined as non-overlapping and do not exactly match the underlying document structure. Later work by Mittendorf and Schäuble~\cite{mittendorf94document} focuses on query-independent text segmentation using Hidden Markov Models.

Another work, although based on passages but closer related to the notion of logical structure, was proposed by Burkowski in \cite{burkowski92passage}. In this approach a document is represented using multiple lists of non-overlapping passages, where each list corresponds to a certain level in the hierarchical structure expressed through tags (e.g., headlines, sections, paragraphs, authors). Queries are formulated as retrieval command strings, implementing a text algebra based on a defined containment model. The final result set is ranked applying a statistical ranking strategy based on document length, term frequency, and inverse document frequency.

Salton et al.~\cite{salton93approaches} probed several passage sizes which are evaluated using the SMART retrieval system. Similar experiments are conducted by Callan~\cite{callan94passagelevel} who uses paragraph-sized and window-sized passages in the INQUERY retrieval system. His evaluation results show that a combination of document- and passage-level outperforms the others.

Further experiments done by Wilkinson~\cite{wilkinson94effective} confirm that combined results of whole documents and parts of documents enhance retrieval. In \cite{wilkinson94comparison} he and his colleague found that a page size of about 1000 bytes define useful fragments.

Kaszkiel and Zobel~\cite{kaszkiel97passage} came to the same conclusion as Wilkinson, showing that a fixed length passage approach is efficient and robust. However, in subsequent experiments they found that passages of arbitrary length improve retrieval performance significantly.

Based on these findings, classical passage retrieval methods assume a window of fixed length which is slid stepwise over the whole text of a document~\cite{chiaramella01sdr}. In each position the distribution of words is analyzed. Passage boundaries are found if the distribution of words within two subsequent window positions changes significantly. Having identified the passages, it is straightforward to apply standard information retrieval strategies on the corpus of text passages instead of the documents. However, passages do not exactly reflect the logical structure of documents intended by their authors. Thus, retrieval performance strongly depends on the applied segmentation of a text into passages. Also, splitting texts into passages predefines all possible units for indexing and retrieving in a static way. This is one of the reasons why the logical structure of documents is believed to improve the efficiency of retrieval systems~\cite{chiaramella01sdr}.

Grossman and Frieder~\cite{grossman04ir} briefly review initial work in passage retrieval and discuss approaches as marker-based passages, dynamic passage partitioning, and merging passage-based similarity measures.

 

Retrieval Models

 

Language Models

Ogilvie and Callan~\cite{ogilvie03language} propose a tree-based generative language model approach for ranking document components. In their work they use probabilistic context free grammars to estimate the probability of parse trees for sentences within certain components. The probability of a specific parse tree, and thus its ranking, is computed as the product of the probabilities of all rules applied in creating this (sub)tree. For each leaf node in the document tree the language model is estimated directly from the text attached to the node. Therefore, three different ranking models, the Kullback-Leibler Divergence, the Generative Language Model, and the Maximum-Likelihood Estimate are evaluated. In order to represent inner nodes, linear interpolation is applied, where the parameters for combining language models of children to ancestor nodes are learned from a large data repository. These parameters conform to the augmentation factor described by Fuhr et al.~\cite{fuhr98dolores}. Smoothing, the re-estimation of the probabilities in a language model, improves the estimates by including knowledge based on sample data.

Kamps et al.~\cite{kamps04length} suggest to support language-model based ranking strategies by priors, smoothing functions, and cut-off values. While priors allow non-content features such as the length of a document component to be included in the scoring mechanism, smoothing is used to increase the relevance score of small components containing only a small subset of the query terms. Without smoothing, larger components containing more query terms are favored. In order to avoid too small components to be retrieved they propose cut-off values for the minimal length of retrieved elements. Their experiments show that extreme length normalization is an important issue in XML retrieval as a means to improve retrieval performance.

 

Bayesian Models

Piwowarski et al.~\cite{piwowarski03bayesian} use Bayesian networks for the sake of XML retrieval. Complex queries are decomposed and translated into elementary subqueries, where each subquery refers to a single component. Subqueries are implemented as Bayesian networks (directed acyclic graphs). In the graph, nodes represent XML components (tag names) that are interconnected via arcs representing relations between the components. The joint probability of a set of $i$ components $\{x_i\}$ is given by

\begin{equation*}
P(\{x_i\}) = \prod_i P(x_i | x_i~parents)
\end{equation*}

where $x_i~parents$ denotes the parents of node $x_i$ in the network. The content of document components is represented using external models. The relevance score $P(d|q)$ of a document $d$ with respect to query $q$ is given by

\begin{equation*}
P(d|q) = \sum\limits_{i=1..n} P(x_1,x_2,...,x_n | q)
\end{equation*}

In order to answer a user query, the evidence is added to the network as shown in Figure~\ref{fig:bayesian_network}.

\begin{figure}[h]
\centering
\includegraphics[width=0.5\textwidth]{03_xml_retrieval/figures/bayesian_network}
\caption[Structured document retrieval using Bayesian networks]{Structured document retrieval using Bayesian networks~\cite{piwowarski03bayesian}}
\label{fig:bayesian_network}
\end{figure}

For every target element to be retrieved, the set of elementary networks is connected to compute its global score. As an example, the relevance of $sec[1]$ in the figure is given by:

\begin{equation*}
P(sec[1]|q) = \sum\limits_{d} P(d|q)P(sec[1]|d,q)
\end{equation*}

Additional connection parameters are estimated from the corpus using the Estimation Maximation algorithm~\cite{dempster77maximum}.

 

The Vector Space Model

Introduced by Salton in 1968, the vector space model~\cite{salton68indexing,salton71smart,salton75vsm} provides a solid framework that allows to compute a degree of similarity between a document and a query. The model represents documents and queries as term vectors with associated term weights. Term weights~$w$ are calculated by combining term statistics (term frequency $tf$) and corpus-based statistics (inverse document frequency $idf$) using the Equations~\ref{eq:vsm_tf}--\ref{eq:vsm_w}~\cite[pp. 27--30]{baeza-yates99ir}.

\begin{eqnarray}
tf_{i,j} &=& \frac{freq_{i,j}}{\max freq_{l,j}} \label{eq:vsm_tf}\\
idf_i &=& \log \frac{N}{n_i} \label{eq:vsm_idf}\\
w_{i,j} &=& tf_{i,j} \cdot idf_i \label{eq:vsm_w}
\end{eqnarray}

The actual weight $w_{i,j}$ of a term $i$ in a document $j$ is computed as the normalized term frequency $tf_{i,j}$ times the inverse document frequency $idf_i$. $freq_{i,j}$ refers to the raw term frequency of term $i$ in document $j$, and $\max freq_{l,j}$ is the maximum raw frequency of any term $l$ in document $j$. $idf_i$ can be seen as a terms' discriminative power among the documents. In the formulae $N$ is the total number of documents in the system, whereas $n_i$ stands for the number of documents containing term $i$. The effect of the inverse document frequency is to downweigh terms that occur frequently throughout the whole collection, and the other way round, to increase the importance of terms that are rare.

Queries are represented and weighed in the same way. The similarity of a document $d_j$ and a query $q$ is defined as the cosine measure between the two weighed feature vectors (Equation~\ref{eq:vsm_sim}). $t$ denoted the total number of terms.

\begin{equation}\label{eq:vsm_sim}
sim(d_j,q)
= \frac{\overrightarrow{d_j} \bullet \overrightarrow{q}} {|\overrightarrow{d_j}| \cdot |\overrightarrow{q}|}
= \frac{\sum^t_{i=1} w_{i,j} \cdot w_{i,q}} {\sqrt{\sum^t_{i=1} w^2_{i,j}} \cdot \sqrt{\sum^t_{j=1} w^2_{i,q}}}
\end{equation}

 

Vector Space Model based Approaches

Grabs and Schek~\cite{grabs02vsm,grabs03powerdb} propose a model based on vector spaces generated on-the-fly for flexible XML retrieval. The main idea is that content on different levels in a document tree is considered to be of different importance with respect to a query. Therefore, index nodes are identified that are indexed and retrieved. More distant nodes are treated as less important than nodes closer to the index nodes. Term weighing is done using the traditional $tf \cdot idf$ formula, where the notion of $idf$ is extended to inverse element frequency. For their experiments at INEX 2002 Grabs and Schek distinguish single-category retrieval, multi-category retrieval, and nested retrieval (which they call flexible retrieval), reflecting the complexity of the queries (single-category $\le$ multi-category $\le$ nested).

Single-category retrieval refers to retrieval of index nodes addressed through a single path expression (e.g., paragraphs in main sections /ARTICLE/SEC/P).

Multi-category retrieval extends retrieval to disjunct sets of index nodes (e.g., paragraphs and figures in main sections /ARTICLE/SEC/P or /ARTICLE/SEC/FIG).

The aim of nested retrieval is to provide efficient and consistent retrieval over arbitrary combinations and nestings of XML components. It enables queries that are restricted to whole subtrees (e.g., everything within main sections /ARTICLE/SEC/*). Subtrees may comprise besides multiple categories additional non-index nodes at higher levels and complex containment relations. Augmentation is applied to downgrade term weights according to Fuhr et al.

Indices and statistics are preprocessed and stored for leaf elements only. Representations of inner nodes are computed on-the-fly when needed.

Another approach based on the vector space model is proposed by Mass and Mandelbrod in 2004~\cite{mass04ranking}. In order to overcome the difficulties occurring in nested components, they create separate indices for components on the same level (e.g., all documents, all sections, all paragraphs). For each of these indices, formulae from the vector space model are used for weighing and for similarity computation. During query processing, all indices are searched, resulting in multiple result sets of varying granularity. As a final step, a document pivot factor is applied to combine and re-rank the retrieved result sets. In their experiments Mass and Mandelbrod show a 30\%--50\% improvement of mean average precision compared to previous retrieval results achieved in~\cite{mass03relevant}.

Liu et al.~\cite{liu04indexing} also propose a $tf \cdot idf$ model for structured document retrieval. XML documents are represented using Ctrees, which is a compact tree representation for the whole document collection. Queries are expressed in the NEXI language (see Section~\ref{sec:sdr_rel_work_ql}) used at the INEX workshop in 2003 and represented as query trees. The content of XML elements is represented as weighed feature vectors applying a combination of term frequency, inverse element frequency, label weights, and weights for query term modifiers. During retrieval, the query is decomposed into a set of retrieval components which are grouped according to their structural specificities. XML components matching a subquery are evaluated independently and relevance scores are assigned. The overall relevance of an XML component is obtained by merging all subquery relevance results to a single score. Finally, a ranked list of XML elements is returned to the user.

 

Retrieval Units

Hatano et al. \cite{hatano03unit} addresses the problem of determining proper retrieval units of XML documents. By defining XML elements as retrievable (meaningful) or not retrievable (meaningless), the number of targeted portions of XML elements can be reduced. Kazai et al. refers to meaningless contexts as stop contexts \cite{kazai02scalable} (e.g., XML elements carrying only layout information). This distinction leads to improvements during indexing, matching, and retrieval.

The remaining (meaningful) portions of XML documents are called coherent partial documents. They are identified automatically based only on the topology of a document tree. Thus, no explicit definition of the structure in form of a DTD or XML schema file is needed. A context node is defined to be an ancestor node which does not have sibling nodes of the same type (element name). To avoid too small contexts of text nodes, a minimum distance of two nodes (grandparent) is demanded.

 

Query Languages

XIRQL~\cite{fuhr00xirql,fuhr01xirql,fuhr03interface,fuhr04xirql} is an XML query language based on XQL~\cite{url_w3c_xql} which is extended by information retrieval concepts for weighing, data types and vague predicates, relevance-oriented search, and semantic relativism. These functionalities are implemented via special XIRQL operators. Boolean operators enable the user to combine subqueries to more complex ones. The final XIRQL query is transformed into an underlying path algebra defined in~\cite{fuhr01xirql,fuhr04xirql}, which is then optimized and executed.

Theobald and Weikum~\cite{theobald02indexbased} propose XXL (fleXible XML search Language) for querying structured documents. In XXL a query is internally represented as a directed labeled data graph, including a set of wildcard placeholders for path constraints and single element names. Besides the standard set of operators on strings and other simple data types, a special tilde operator $\sim$ is used to enable semantic similarity matching based on ontologies. A graphical interface named Visual XXL GUI helps formulating complex user queries.

Proposed by O'Keefe and Trotman~\cite{okeefe03query}, NEXI (Narrowed Extended Xpath I)~\cite{trotman04nexi1} is developed as a query language used at INEX since 2004. Since XPath seems to be too complex for end-users to formulate their information needs~\cite{pal06survey}, NEXI implements only a subset of XPath to address certain XML elements (hence its name narrowed). Because XPath does not support information retrieval like matching, NEXI is extended by an about() predicate that enables similarity-based content matching.

For example, the NEXI query /doc[about(., computer)]//sec[about(./par, apple)] searches all documents <doc> that contain the term computer and returns sections <sec> with paragraphs <par> that contain the term apple. The semantics of NEXI queries is not defined; instead, the retrieval engine has to deduce it, which is why NEXI differs from database-oriented query languages. Introducing NEXI at INEX significantly improved retrieval results: The error rate of experts formulating complex queries dropped from 63\% (in 2003) to 12\% (in 2004)~\cite{trotman04nexi2}.

Geva et al.~\cite{geva06xor} introduced XOR, the XML Oriented Retrieval language, which is intended to replace NEXI as query language in INEX. XOR is based on the previous NEXI syntax and therefore backwards compatible. Extra functionalities are added to support advanced information retrieval tasks. These extensions include a negation operator for content matching (-about(...)), logical operators for query combination (AND, OR, NOT, ANDNOT, SUPPORT), path matching extensions (strict, vague), term extensions (e.g., part-of-speech tags, case of letters), logical operator qualifiers (strict, vague), and additional predicates (e.g., LinkTo(), LinkFrom(), Contains(), lt(), eq(), gt()). Based on these definitions, an open-source parser for the XOR language was developed. The parser is able to translate XOR queries into reverse polish notation supporting the implementation of back-end processors. However, XOR is not supposed to be used as an end-user query language. It is meant to be an underlying language model supported by natural language query interfaces and query generators. Reported results of four different baseline systems using XOR instead of NEXI show significant retrieval improvements.

 

Performance

Fuhr and Gövert~\cite{fuhr02index,fuhr02compression} address an issue regarding the performance of structured document retrieval systems. They suggest index compression to scale down storage requirements for indexing XML documents. In their work, inverted files with compressed paths are used to speed up database access. Further, a new data structure called XS tree is proposed, which stores structural information of a document in a very compact form.

 

Index Objects

The first step of searching structured documents is to define the way documents are indexed. This index provides the basis for applying weighing and ranking formulae, which allow similarity-based matching of XML components to queries. Generally, there are two possible options: the development of a completely new weighing scheme; or the adaptation or generalization of existing ones (e.g., language models, Bayesian models, vector space model).

According to Fuhr and Großjohann~\cite{fuhr01xirql}, the long experience with existing weighing and ranking strategies in traditional information retrieval favors the second option. Thus, the atomic units or index objects of a document must be identified. According to Fuhr and Großjohann~\cite{fuhr01xirql} this results in two benefits: first, traditional information retrieval models can be applied for processing these elements; and second, only index objects are returned to the user.

Starting with the assumption that textual contents are restricted mainly to the leaf nodes of the document tree, these leafs are obvious candidates for atomic units. However, their granularity might be to fine grained as retrieval result~\cite{fuhr01xirql}. Just assume that a query searching for M. Hassler returns only the component <author>M. Hassler</author>, not including any additional information about the document he wrote. Without any context, this turns the result completely useless. Hence, Fuhr and Großjohann propose to rely on a hand-crafted definition of disjunct index nodes (also called index objects or contexts). These index nodes are identified either by (1) analyzing the underlying DTD or XML schema, or alternatively (2) according to the type of content (e.g., chapter elements).

The root of a document itself is defined as the uppermost index node. The contents of all remaining non-index nodes are indexed at their nearest ancestor index node.

Figure~\ref{fig:indexing_hassler} shows an example document and its corresponding index nodes (dashed boxes). In order to get the complete content of an arbitrary index node, the contents of all descendant index nodes are propagated upwards (arrows) and are combined with the content of the current index node itself.

\begin{figure}[h]
\centering
\includegraphics[width=0.99\textwidth]{03_xml_retrieval/figures/indexing_hassler}\\
\caption{Example XML document tree with disjoint index objects}
\label{fig:indexing_hassler}
\end{figure}

An automatic identification of appropriate index (and retrieval) nodes is proposed by Hatano and his colleagues~\cite{hatano03unit}. The idea is to select only nodes that have sibling nodes at the same level in the hierarchy. These nodes are characterized by the same path starting at the root node. Figure~\ref{fig:indexing_hatano} depicts the identified indexing paths /book/chapter, /book/chapter/section, and /book/chapter/section/subsection of the previous example.

\begin{figure}[h]
\centering
\includegraphics[width=0.99\textwidth]{03_xml_retrieval/figures/indexing_hatano}\\
\caption{Automatic identification of disjoint indexing units}
\label{fig:indexing_hatano}
\end{figure}

Again, the root node is defined as prime index node. In order to avoid unidentified index nodes in document instances where only a single element is implemented (e.g., only a single chapter within a book), the identification can be shifted to an analysis of the DTD or XML schema instead. However, in cases where a definition of the structure is missing this option might not be available. Thus, multiple chapters in documents are indexed whereas a single chapter is not. In the above example one can see that applying this approach misses at least the <abstract> component as an index object. If other nodes similar to <abstract> occur frequently within the document domain, an automatic approach seems not to be the best solution.

The reason for indexing only a selection of nodes results in two main advantages:

  • The number of nodes which have to be indexed is reduced. Especially in the context of structured document retrieval this increases performance during the indexing, matching, and retrieval processes.
  • Meaningful portions of information are defined independently of user queries. This ensures that the system does not return XML components that are too small to be interpreted without contexts.

 

Content Representation and Weighing

In order to represent structured documents for retrieval purposes, many approaches have been proposed~\cite{fuhr01xirql,kazai02focussed,carmel01juru,kotsakis02structured}. Good surveys are given by Luk et al.~\cite{luk02survey}, Pal~\cite{pal06survey}, and Pinel-Sauvagnat and Boughanem~\cite{pinel07survey}.

The representation is always guided by the efficiency of processing queries and retrieving relevant parts of documents. While many structured retrieval systems rely on persistent XML databases, the representation varies from one system to another. Furthermore, mechanisms for indexing and ranking document components during retrieval are different. Approaches proposed by the information retrieval community are similar to those investigated by the database community to some degree. However, while the latter class of approaches aims at dealing with the boolean model, the former is concerned with more advanced models that allow to handle both structure and content of documents. From this point of view it is reasonable to combine both paradigms, that is fast database processing and sophisticated similarity computation to efficiently and accurately compute retrieval results.

Having identified the index objects described in the previous section, methods borrowed from flat document retrieval can be applied to represent and weigh the content of these nodes. Traditionally, content-based retrieval systems rely either on the boolean model or on the vector space model~\cite{salton68indexing,salton71smart} to represent the content as a bag of words. Extensions of these models have been proposed, e.g., the fuzzy boolean model, knowledge-aware models, or the extended vector space model~\cite{baeza-yates99ir}.

However, all of these indexing models ignore the logical organization of texts expressed by the documents' structure. In order to apply existing models to structured document retrieval the content of components is represented and stored using flat indexing. Beyond that mechanisms dealing with the hierarchical relationships among document components have to be applied.

For illustration purpose this section relies on the vector space model (described in Subsection~\ref{sec:vsm}) to represent the content as a weighed feature term vector. The actual weight of a term is calculated by combining term statistics and corpus-based statistics (Equations~\ref{eq:vsm_w} -- \ref{eq:vsm_idf}). The challenge that arises at this point is to aggregate multiple disjoint feature vectors of descendant index objects to a single feature vector of their parent index object. For this purpose two different solutions are feasible~\cite{abolhassani04hyrex}:

  • Propagation of term statistics To compute the weight of terms for a given node, raw term statistics (e.g., term frequency, document length) of its descendants are accumulated (i.e., summed up). Weighing formulae are applied afterwards to yield the final term weights (on-the-fly weighing~\cite{grabs02vsm}). In other words, term weights are calculated independently of the weights associated with the node's descendants. Thus, no recombination of term weights (using empirical parameters) is needed and existing formulae can be applied without being altered.
  • Propagation of term weights In contrast to the propagation of term statistics, term weights of a node are computed by aggregating the weights of its direct descendants without explicitly referring to their term statistics. Therefore, mechanisms for combining multiple term weights have to be defined.

    A concept proposed by Fuhr and his colleagues to combine the representations of lower components with higher ones, first described in~\cite{fuhr98dolores}, is called augmentation. The idea is to augment the content of a component by the contents of its children. Figure~\ref{fig:augmentation} explains the concept of augmentation in a brief example.

    \begin{figure}[h]
    \centering
    \subfloat[Initial term weigths]{\includegraphics[width=0.31\textwidth]{03_xml_retrieval/figures/augmentation_1}\label{fig:augmentation_1}}
    \quad%
    \subfloat[Aggregated term weights]{\includegraphics[width=0.31\textwidth]{03_xml_retrieval/figures/augmentation_2}\label{fig:augmentation_2}}
    \quad%
    \subfloat[Aggregated term weights with augmentation factor of $0.6$]{\includegraphics[width=0.31\textwidth]{03_xml_retrieval/figures/augmentation_3}\label{fig:augmentation_3}}
    \caption[Propagation of term weights]{Propagation of term weights (cf.~\cite{fuhr04xirql})}
    \label{fig:augmentation}
    \end{figure}

    The left tree represents an XML document consisting of three components (Figure~\ref{fig:augmentation_1}). The root element contains the term XQL and consists of two children elements (mixed content node). The first child states the term example and the the second child mentions the terms XQL and syntax. The numbers in front of the terms are the independently computed terms weights of each component. The root node of the tree in the center contains already the propagated term weight for XQL (Figure~\ref{fig:augmentation_2}). Here, the weight of child components adds to the ancestors weight by applying the inclusion-exclusion formula presented in~\cite[pp. 20]{billingsley79probability}:

    \begin{equation*}
    P(e) = P(C_1 \vee ... \vee C_n)
    = \sum_{i=1}^n (-1)^{i-1} \left ( \sum_{i \le j_1 \le ... \le j_i \le n} P(C_{j_1} \wedge ... \wedge C_{j_i}) \right )
    \end{equation*}

    However, this example shows already the dilemma: the term weight of the ancestor always exceeds the weight of the child. The solution proposed by Fuhr et al. is to introduce so-called augmentation weights ($\in [0;1]$) for downgrading propagated weights. In the right tree (Figure~\ref{fig:augmentation_3}) an augmentation factor of $0,6$ is applied. While applying this factor still increases the weight of the ancestors term, it still keeps the descendants weight higher. By allowing different augmentation weights in different index nodes, domain and collection dependent tuning of term weight propagation is possible.

    Since term weights already include corpus-based statistics such as the inverse document frequency, representations of root nodes applying augmentation diverge from flat document representations on the same content using the same weighing formulae.

In this work the first approach, namely the propagation of term statistics, is followed. This decision is mainly taken for one reason: The content representation of the root node exactly matches the traditional flat information retrieval approach. Due to this correspondence it is possible to test and compare the performance of structured document retrieval to traditional information retrieval engines on the document level. Whichever propagation is used, one has to be aware of its most important implication: Due to the aggregation of information from descendant nodes to ancestor nodes, a relevant child component implies relevance of all its ancestor components -- at least to a certain degree. The calculation of XML component representations carried out in this work can be summarized as a four-step process:

  1. Disjoint index objects are identified
  2. Contents of index objects are represented and stored in form of term statistics (local representation)
  3. Starting at the leaf nodes, term statistics are propagated upwards the document hierarchy (global representation)
  4. The final representations are computed by weighing the propagated term statistics using corpus-based or path-based statistics (weighed feature vector)

 

Querying and Ranking

Representing index objects as described in the previous section is a basic requirement of querying and subsequent ranking of document components. The concept of searching within structured documents is based on the comparison of each components' representation with a query that is represented in the same way. In contrast to flat documents, each query is compared multiple times (once for each component) to a single structured document. Hence, scalability depends on efficient algorithms to manage these vast amount of comparisons.

Sticking with the vector space model (see Subsection~\ref{sec:vsm}), the similarity of a document component and a query representation is defined as the cosine measure between the two weighed feature vectors (Equation~\ref{eq:vsm_sim}). However, computing the similarity of single XML component contents stored in the database independently is not sufficient. It is important to include the contents of descendant nodes as well. To illustrate this issue consider the example in Figure~\ref{fig:querying_tree_1}. The same document with its aggregated representations indicated is given in Figure~\ref{fig:querying_tree_2}. For illustration purpose the term weights are passed upwards without being changed. Note that this refers to maximal weights at the ancestors, hence propagation is meant only to reduce propagated weights.

\begin{figure}[h]
\centering
\subfloat[Initial tree]{\includegraphics[width=0.4\textwidth]{03_xml_retrieval/figures/querying_tree_1}\label{fig:querying_tree_1}}
\qquad%
\subfloat[Aggregated tree]{\includegraphics[width=0.4\textwidth]{03_xml_retrieval/figures/querying_tree_2}\label{fig:querying_tree_2}}\\
%
\subfloat[Cosine similarities]{\includegraphics[width=0.6\textwidth]{03_xml_retrieval/figures/querying_tree_3}\label{fig:querying_tree_3}}\\
\caption{Query matching in structured documents}
\label{fig:querying_tree}
\end{figure}

If someone wants to know something about pets, this document (Figure~\ref{fig:querying_tree_2}) is queried for instance with the weighed terms 0,90 cats and 0,80 dogs. Figure~\ref{fig:querying_tree_3} gives the similarities (cf. Equation~\ref{eq:vsm_sim}) of all document components and this query, showing an interesting result: Both sections are relevant to the query reaching a similarity of 74,7\% and 66,4\%. Chapter 2 is less relevant than both sections (65,5\% similarity). From a focused retrieval point of view this comes quite unexpected, because the second chapter is the smallest component that fully covers the query. Thus, it is expected to be the optimal (top-ranked) result to this query. The whole book itself is relevant to a degree of 44,9\%, which again meets a users' expectation.

From this example it seems to be necessary to include further mechanisms that reflect structural relationships more accuratly. In structured document retrieval every index object is a potential query result item. Hence relevant descendant components imply that ancestor components are also relevant to the same query (even to a less degree), overlap of query results is unavoidable. This ability puts additional requirements on the retrieval mechanism, which has not only to return relevant components but also components of the correct granularity~\cite{kamps03retrieve}. Neither should the components be too small nor should they be too large. From this point of view, the highest rank should be assigned to the second chapter in Figure~\ref{fig:querying_tree}. One possible solution is to redesign the propagation strategy. However, this may lead to weighing inconsistencies at the representation level. Another approach combines the computed similarity values of ancestors and descendants directly. The final score, often called Retrieval Status Value (RSV), has to take account of these issues.

As stated above the retrieved units in structured document retrieval are not whole documents but document components. This leads to difficulties in query matching, because a query is no longer matched against one (large) document feature vector. Now it is compared to a lot of (respectively small) leaf representations, as well as to many inner nodes with accumulated -- though not necessarily small -- feature vectors. Using a term space that consists of all document terms, smaller portions of text consequently lead to sparser representations. This sparseness of feature vectors has to be considered, because single term matching results in high relevance values. Generally, natural language processing techniques have not significantly improved the performance of flat document retrieval~\cite[pp. 227]{cole97survey}. However, in the domain of structured documents dealing with small portions of text retrieval quality is believed to benefit from deeper linguistic analysis.

 

Query Language

In the past several languages for querying XML documents have been proposed (XPath~\cite{url_w3c_xpath}, XQL~\cite{url_w3c_xql}, XQuery~\cite{url_w3c_xquery}, XML-QL~\cite{url_w3c_xmlql}). An overview of the features of five of these languages (LOREL, XML-QL, XML-GL, XSL, and XQL) is provided by Bonifati and Ceri in~\cite{bonifati00query}. Two other surveys are given by Deutsch et al.~\cite{deutsch99querying} and Luk et al.~\cite{luk02survey}.

However, these languages are still lacking some important features. Missing concepts range from ignoring data types and absence of term weighing functionality to unsupported similarity matching~\cite{fuhr00xirql}. Also, the complexity of these languages is too high and the syntax is too complicated for users to express their information needs~\cite{woodley06nlp}. Anyway, these languages provide basic building blocks and good starting points for the development of more appropriate information retrieval query languages. The requirements put on a query language can be summarized as~\cite{fuhr03interface,fuhr04xirql}:

  • Weighing Document and query term weights enhance retrieval performance.
  • Relevance-oriented search If no specific result elements are specified, the system should return the most specific document components.
  • Data types and vague predicates Data types refer to semantic categories that explain the content of an XML component more clearly (e.g., PersonName, Title, PlainText). Data types allow for better search results in case these are addressed in a correct manner. Special search predicates defined on certain data types provide sophisticated matching functionality.
  • Structural vagueness If a user searches for a certain value without caring about the underlying document structure, the system should be able to generalize structures to fit the users information need. This includes attribute versus element distinction, generalization of ancestor-descendant relations, as well as similarity of element and attribute names. Furthermore, data types provide another kind of generalization regarding elements and attributes.

Based on these requirements more appropriate higher-level query languages like XIRQL, XXL, NEXI, or XOR (see Section~\ref{sec:sdr_rel_work_ql}), to name just some of them, have been developed. Intended to bridge the gap between pure structure-oriented and content-based retrieval, all of these languages have their own syntax and semantic, but in general, their basic functionalities remain the same. However, only NEXI and XOR are supposed to be interpreted by a retrieval engine only. Listing 2.2 presents an example query in XOR.

//book[about(.,XML) ANDNOT about(.,XSLT)]//*//subsection[about(.,These days)]
AND
//book[eq(.//author,M. Hassler) OR about(.//*,Retrieval)]
Listing 2.2: XOR query example

In the listing, the XOR query is composed of two main subqueries (line 1 and 3). The first subquery retrieves subsection elements containing These days which occur in books that treat XML but not XSLT. The second subquery retrieves book elements where the author is exactly M. Hassler or where any element is about Retrieval. Finally, the two subqueries are combined via an AND condition. The order of AND-combined subqueries is implicitly defined: The subquery addressing components of larger granularity (line 2 returns book elements) functions as filter for the subquery addressing smaller components (line 1 returns subsections). As a result, a set of subsection elements within book elements is returned to the user.

This example shows that formulating such queries is definitely not an easy task and even error-prone for expert users. Hence, this language is not intended to be applied directly by users. It is supposed to be supported by query generators offering an (interactive) user interface to formulate queries that are both, syntactically and semantically correct.

Novel approaches for querying XML documents are presented by Woodley et al.~\cite{woodley06nlp}. In their work the goal is to transform natural language queries into the NEXI query language automatically. Although different, the approaches described comprise four common query processing steps: (1) detection of structure and content constraints; (2) mapping of structural constraints onto corresponding XML tags; (3) deriving of NEXI-conform content requirements; and (4) NEXI query formulation. Three different approaches were tested in the natural language query track at the INEX workshop in 2005~\cite{tannier06translating}. An approach proposed by Hassler is based on template matching of words and part-of-speech tags. Therefore, patterns are used to extract two kinds of rules handling the structure (s_rule) and the content (t_rule) (see Figure~\ref{fig:nlq_hassler}).

\begin{figure}[h]
\centering
\includegraphics[width=0.7\textwidth]{03_xml_retrieval/figures/nlq_output_hassler}\\
\caption[Query analysis with template matching]{Query analysis with template matching~\cite{tannier06translating}}
\label{fig:nlq_hassler}
\end{figure}

Woodley and Geva suggest a shallow syntactic analysis before applying similar template matching. Tannier includes deep syntactic analysis complemented by semantic rules concerning the structure and content. Their experiments showed promising results, even outperforming the performance of a baseline system.

 

Result Presentation and Browsing

In contrast to traditional retrieval results (i.e., a ranked list of documents), the returned items in structured document retrieval (i.e., a ranked list of document components) are no longer independent of each other. The result set may include:

  • elements of varying granularity (e.g., whole documents, sections, single paragraphs or figures)
  • multiple disjunct elements of the same document
  • overlapping elements that are in structural relationships (ancestor - descendant of the same document)

Of course, this puts special requirements on the visualization of these results. A structured document retrieval system should foremost retrieve the most specific component of a document (its best entry point) answering a given query~\cite{chiaramella96model}. Starting from this element, other relevant elements in the document are explored by browsing.

In~\cite{grossjohann02interface,grossjohann02visualization}, Gro{\ss}johann et al. propose a user interface for both, query formulation and result presentation. The query formulation interface called HyGate~\cite{fuhr02hyrex1,fuhr04xirql} aims at constructing queries in the XIRQL query language in a query-by-example way. The result presentation interface displays the retrieved document components to the user, reflecting relationships among elements stemming from the same document. The resulting components are depicted in a 2D graphic called TreeMap, where each element is displayed as a rectangle. Child components are drawn as nested within their ancestors rectangle (see Figure~\ref{fig:xml_result_treemap}). Brightness (white means not relevant, dark means highly relevant) is used to reflect a components achieved score. The concept of TreeMaps is further advanced to Partial TreeMaps (see Figure~\ref{fig:xml_result_treemap_2}), where elements that are not included in the result set are removed for better readability.

\begin{figure}[h]
\centering
\includegraphics[width=0.5\textwidth]{03_xml_retrieval/figures/xml_result_treemap}
\caption[XML tree and its TreeMap]{XML tree and its TreeMap~\cite{grossjohann02interface}}
\label{fig:xml_result_treemap}
\end{figure}
%
\begin{figure}[h]
\centering
\includegraphics[width=0.7\textwidth]{03_xml_retrieval/figures/xml_result_treemaps_2}
\caption[Result presentation with Partial TreeMaps]{Result presentation with Partial TreeMaps~\cite{fuhr03interface}}
\label{fig:xml_result_treemap_2}
\end{figure}

This approach is extended by Fuhr et al.~\cite{fuhr03interface} to cover two additional aspects: (a) structural relationships among result components (from the same document), and (b) size of the result components. TextBars (see Figure~\ref{fig:xml_result_textbar}), as these graphical representations are called, visualize a document as a bar that is segmented according to its underlying structure. Above the bar the title of the document is given. The length of the bar reflects the size of the document (number of words). According to the size of an XML component, red separators indicate the borders of different components (index nodes). Segments are colored using different shade factors reflecting their relevance to the query, where white means not relevant at all and black means totally relevant.
%
\begin{figure}[h]
\centering
\includegraphics[width=0.7\textwidth]{03_xml_retrieval/figures/xml_result_textbars_2}\\
\caption[Result presentation with TextBars]{Result presentation with TextBars~\cite{fuhr03interface}}
\label{fig:xml_result_textbar}
\end{figure}

 

Retrieval Evaluation

A key issue in information retrieval, and thus in structured document retrieval, is the evaluation of the retrieval performance. A solid evaluation of any IR-related system requires (1) a predefined document test collection (corpus), (2) a set of tasks a system has to perform (queries, topics), and (3) precisely defined evaluation metrics reflecting the retrieval quality. These preconditions enable system developers to determine the performance of a system and create the basis for an impartial comparison of heterogenous systems dedicated to the same tasks.

Currently much research is done in the field of structured document retrieval evaluation. An initiative that is devoted solely to this topic is the INitiative for the Evaluation of XML retrieval (INEX)~\cite{INEX:02,INEX:03,INEX:04,INEX:05,INEX:06,INEX:07}, organized by the European DELOS Network of Excellence for Digital Libraries.

One of the main tasks of INEX is to provide a framework for querying and retrieving XML document components not only via content but also via structural constraints. It includes

  • a large collection of real-world XML documents,
  • a set of user queries called topics,
  • relevance-assessed results of experts for each topic, and
  • evaluation measures.

Every year the INEX workshop held in Dagstuhl Castle in Germany brings together experts from around the world, which are discussing aspects of retrieval methodology and models, evaluation measures and strategies, query languages, and future research directions. Each participating research group evaluates their system and reports the achieved results at the workshop. The workshop is structured into several tracks. The last workshop held in December 2007 embraced the following tracks~\cite{INEX:07}:

  • Ad Hoc Retrieval Standard INEX track evaluating the retrieval performance of structured document retrieval systems.
  • Book Search Investigates book-specific relevance ranking, user interfaces and behavior, special issues such as book indexes, and linkage to external sources such as metadata catalogue information.
  • Document Mining Tests machine learning methods for structured documents and evaluates their performance, focusing on XML classification and XML clustering.
  • Entity Ranking Compares and evaluates techniques for returning ranked XML components.
  • Heterogenous Collections Treats interoperability issues of documents from different sources (different structure, different tags, different coding).
  • Link-The-Wiki Analyzes the approaches to automatic link discovery.
  • Multimedia track Focuses on the retrieval of multimedia components.

INEX was first organized in 2002 and involved around 50 participating groups from around the world. In the beginning INEX classified the approaches into three categories~\cite{goevert02inex}: IR model-oriented, DB-oriented, and XML-specific. One year later the categorization was changed into model-oriented and system-oriented approaches~\cite{fuhr03inex}. Table~\ref{tab:inex} briefly summarized the historical development of INEX reflecting its increasing approval and importance throughout the scientific community.

\begin{table}
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Development of INEX}
\label{tab:inex}
\begin{tabular}{BccL{1.9cm}L{7cm}}
\rowcolor{tableheadcolor} & \color{white} Groups & \color{white} Papers & \color{white} Collection & \color{white} Tracks \tabularnewline%
% year part.groups pap. collection tracks
INEX 2002 & 49 & 23 & 12107 docs \\ 494 MB \\ 60 topics & Ad Hoc Retrieval \tabularnewline%
INEX 2003 & 46 & 28 & 12107 docs \\ 494 MB \\ 60 topics & Ad Hoc Retrieval,
Interactive,
Heterogeneous Collection,
Relevance Feedback,
Natural Language\tabularnewline%
INEX 2004 & 59 & 34 & 12107 docs \\ 494 MB \\ 60 topics & Ad Hoc Retrieval,
Interactive,
Heterogenous Collection,
Relevance Feedback,
Natural Language\tabularnewline%
INEX 2005 & 64 & 59 & 16819 docs \\ 764 MB \\ 87 topics & Ad Hoc Retrieval,
Interactive,
Heteterogenous Collection,
Relevance Feedback,
Natural Language,
Document mining,
Multimedia\tabularnewline%

INEX 2006 & 85 & 57 & 659388 docs \\ 4,6 GB \\ 130 topics & Ad Hoc Retrieval,
Interactive,
Heterogeneous Collection,
Relevance Feedback,
Natural Language,
Document Mining,
Multimedia,
Use case studies,
XML Entity Ranking \tabularnewline%

INEX 2007 & 105 & 60 & 659388 docs \\ 4,6 GB \\ 130 topics & Ad Hoc Retrieval,
Heterogeneous Collection,
Document Mining,
Multimedia,
Entity Ranking,
Link the Wiki,
Book Search\tabularnewline

\end{tabular}
\end{table}

 

Corpus

In the year 2002 a test collection containing 12.107 documents was created by INEX. The documents stemmed from 12 magazines and 6 transactions of the IEEE Computer Society covering a period of 1995--2002, with a total size of 494 megabytes of raw XML files containing no pictures or multimedia content. 169 different XML tags were used for markup. For two years the collection stayed nearly the same, only with some work spent on the correction of inconsistent syntax and structure of the XML files. On average each document contained 1.532 XML elements with an average nested depth of 6,9. In 2005, the collection was extended by 4712 new articles (16.819 documents) from the period 2002--2004, reaching a total size of 764 megabytes. Also, the number of unique tags increased to 192.

Preliminary results of this work were evaluated and presented at the INEX workshop in 2005. In order to be comparable and to show improvements achieved, this work sticks to the INEX 2005 document collection throughout all experiments.

In 2006 INEX adopted the Wikipedia document collection which is freely available on the internet comprising 659.388 XML documents. The documents, in contrast to the previous collection, make heavy use of internal and external linkage. In total the collection size is about 4,6 gigabytes of plain XML files, where each document on average consists of 161,35 XML elements in a nested depth of 6,72. The documents are structured according to the Wikipedia template, making use of about 5.000 different tags.

 

Topics

Each year the INEX topics (queries) used for the currently organized workshop are created by the participating groups themselves. Each group hands in a set of potential topic candidates from which the most appropriate are selected for evaluation. An example INEX 2005 topic is given in Listing 2.3. It consists of an initial topic statement, a title, a short description, a long description explaining the idea and the expected result, and its corresponding NEXI query.

<inex_topic topic_id="231" query_type="CO+S" ct_no="98">
  <InitialTopicStatement>
    I'm interested in the applications of markov chains in graph theory.
  </InitialTopicStatement>
  <title>
    markov chains in graph related algorithms
  </title>
  <description>
    Retrieve information about the use of markov chains in graph theory and in
    graphs-related algorithms.
  </description>
  <narrative>
    I have just finished my Msc. in mathematics, in the field of stochastic
    processes. My research was in a subject related to Markov chains. My aim is
    to find possible implementations of my knowledge in current research. I'm mainly
    interested in applications in graph theory, that is, algorithms related to
    graphs that use the theory of markov chains. I'm interested in at least a short
    specification of the nature of implementation (e.g. what is the exact theory
    used, and to which purpose), hence the relevant elements should be sections,
    paragraphs or even abstracts of documents, but in any case, should be part of
    the content of the document (as opposed to, say, vt, or bib).
  </narrative>
  <castitle>
    //(sec|p|abs)[about(.,+"markov chains" application "graph theory")]
  </castitle>
</inex_topic>
Listing 2.3: INEX topic example (2005)

INEX distinguishes two types of topics:

  • Content-Only (CO) queries that contain only constraints on the content. No structural information where to find the information nor any hint which elements should be returned is provided. The system has to search the collection and decides on its own which elements to return and how the ranking is done.
  • Content-And-Structure (CAS) queries allow to specify constraints on the content, the structure, or in most cases both. These queries require profound knowledge about the underlying document structure and are expressed in the NEXI language (<castitle> element in Listing 2.3).

Whereas the ranking for the CO queries is only a matter of content, CAS queries put much more effort on the ranking mechanism in order to combine multiple relevance values regarding the content and structure. For evaluation purpose CAS queries are interpreted in two ways: strict (SCAS), where only the specified target components are allowed to be returned; and vague (VCAS), where specified target components are treated more like preferences than strict conditions.

The optimal human answers are assessed manually by the participating groups. Each group evaluates two to three topics which takes about two weeks per topic. With the use of an online XML browser and markup tool, the whole collection is skimmed through for a topic and relevant answer elements are marked up and classified with respect to their exhaustivity and specificity. Results of participants assessing the same topic are cross checked to minimize inaccuracies. After finalizing the topics assigned, INEX grants access to all available topic assessments.

 

Metrics

In order to measure the performance of information retrieval systems one needs to define formal metrics. In traditional information retrieval, the standard metrics recall and precision are used~\cite{baeza-yates99ir}. In contrast, result sets in structured document retrieval consist of document components of varying granularity which may contain overlapping elements. Thus, these metrics are not appropriate to reflect the retrieval quality in all its facets~\cite{trotman04nexi1}.

INEX proposes a set of metrics for the evaluation of such systems to deal with that difficulty. Hence the retrieval task in INEX is defined as returning XML components that are most specific and exhaustive~\cite{goevert02inex}, these two aspects have to be defined more clearly~\cite{fuhr03inex}. Exhaustivity expresses the extent to which an element covers the topic, and specificity means how focused an element is on the topic. Both dimensions are rated using a four-degree scale: ($0$) not, ($1$) marginally, ($2$) fairly, and ($3$) highly. Out of the 16 possible combinations, 10 meaningful $ES$ tuple (Exhaustivity-Specificity) are taken~\cite{pal06survey}.

\begin{equation*}
ES = \{(0,0),(1,1),(1,2),(1,3),(2,1),(2,2),(2,3),(3,1),(3,2),(3,3)\} %\notag
\end{equation*}

In order to apply different evaluation metrics, the two relevance dimensions are mapped onto a single relevance scale by a quantization function $f_{quant}(e,s): ES \rightarrow [0,1]$. INEX 2002 used two variants of quantization functions~\cite{goevert02inex}, a strict one (Equation~\ref{eq:f_quant_strict}) and a generalized one (Equation~\ref{eq:f_quant_gen}). Later on, additional variants of $f_{gen}$ are introduced with more focus put on either of the two dimensions.

\begin{eqnarray}
f_{strict}(e,s) &=& \left\{
\begin{array}{l}
1, if (e,s) = (3,3)\\
%
0, otherwise
\end{array}
\right.
\label{eq:f_quant_strict}\\
%
f_{gen}(e,s) &=& \left\{
\begin{array}{l}
1, if (e,s) = (3,3)\\
%
0,75, if (e,s) \in \{(2,3),(3,2)\}\\
%
0,50, if (e,s) \in \{(1,3),(2,2),(2,1)\}\\
%
0,25, if (e,s) \in \{(1,1),(1,2)\}\\
%
0, if (e,s) = (0,0)
\end{array}
\right.
\label{eq:f_quant_gen}
\end{eqnarray}

Additional metrics proposed by and used at INEX (precall, XCG, nXCG, T2I, and PRUM) are described in detail in \cite{kazai05measure,pal06survey}.

 

Summary

Online document repositories and digital libraries increasingly rely on structured documents. Hence, structured document retrieval methods become more and more important. This chapter introduced the challenges of searching within structured documents and provided answers how upcoming difficulties can be handled. A brief overview of ongoing research in the area is given. The main tasks of information retrieval, namely indexing and retrieval, were discussed in detail. As the presentation of retrieval results poses several difficulties one does not face in traditional information retrieval, two graphical approaches, treemaps and TileBars, were presented. Most important to prove the success of a new structured document retrieval approach is an extensive evaluation and comparison to similar systems. To do this, this work reverts to INEX, which is dedicated to the evaluation of XML retrieval systems. To assure comparability INEX provides a document collection, topics (queries and their assessments), and evaluation metrics.