Structured Document Retrieval: An Overview
"Information is the currency of democracy." Thomas Jefferson This chapter introduces structured document retrieval and provides the theoretical background of this work. The focus is put on preconditions and difficulties of applying information retrieval on structured documents.
IntroductionInformation retrieval has become an important discipline in computer science. A vast amount of textual information and data is freely available to everyone, demanding for automatic methods that help to find and navigate through it. Additionally, information is no longer plain and unstructured. It has become a conglomerate of content, structure, and metadata. The main challenge now is to apply information retrieval techniques that fit this kind of information. The discipline that aims at resolving these difficulties is called structured document retrieval. The main idea of structured document retrieval is to exploit the structure together with the content of documents to improve retrieval performance. This allows users to additionally retrieve parts of documents, so-called document components, that better fit their information needs instead of whole documents (e.g., an entire book). Thus, the aim of structured document retrieval is to return
Thus, focused retrieval -- as structured document retrieval is often called -- is concerned with finding the best entry points in structured documents~\cite{kazai02focussed}. These entry points are serving as starting points for further search and browse activities by the user. The advantages of applying structured document retrieval are obvious: Large documents often mask relevant information because most of the entire document is irrelevant to a user query. Thus, parts of documents often provide better and more focused answers to queries regarding a certain topic. Moreover, precise queries can be formulated addressing not only the content, but also the structure and available metadata. Numerous studies have been conducted which highlight the improvement of retrieval effectiveness applying structured document retrieval~\cite{wilkinson94effective,kazai02focussed,kotsakis02structured}. Text documents can be regarded as structured in several ways: The simplest (implicit) structure is according to the linear order of words, sentences, and paragraphs. Second, a structure is given by the links a document contains (e.g., hyperlinks, cross-references, citations). Also, temporal or spatial relationships within multimedia documents define a certain structure. Anyway, the most obvious way to define a documents' structure is to utilize the logical structure inherent in the documents' content (e.g., chapter, section, paragraph). Structural elements are defined as labeled nodes (optionally containing content) that are organized in a hierarchy. In structured document retrieval, this latter logical structure is exploited to define the units used for indexing, storing, and retrieving. According to the logical structure, textual information can be distinguished into three different categories~\cite{baeza-yates99ir,pal06survey}:
Retrieval models combining information on textual content and the logical structure are called structured text retrieval models~\cite[pp. 62]{baeza-yates99ir}. When speaking of structured text documents, in the remainder of this thesis we rather mean semi-structured documents. It is widely accepted by the scientific community that these terms are used synonymously. The reason for this is that in contrast to strictly formatted database systems textual documents are either considered structured or unstructured at all. In this work the term structured text document is used, although it more correctly refers to semi-structured text document. While the logical structure provides documents with hierarchical levels of granularity, and hence more precision can be achieved by means of focused retrieval, it does, however, imply more requirements on the representation, storage, indexing, matching, retrieval, ranking, and graphical user interface. Because documents and queries are structured, standard retrieval models such as the vector space model (see Section~\ref{sec:vsm}) are not adequate anymore. Of course structured document retrieval involves the same tasks as traditional information retrieval. However, several aspects of information retrieval applied to flat documents have to be reconsidered and cannot be straight-forwardly applied on structured documents. Rather these methods have to be adapted to fit the structured document paradigm. In the past, structured document retrieval approaches focused on one of three issues~\cite{fuhr04xirql}:
Like in traditional information retrieval it is necessary to first identify and define the major tasks of structured document retrieval. These include the following aspects:
In the sequel this chapter discusses these aspects.
XML Document FormatThe XML document format~\cite{url_w3c_xml} has become an emerging standard for the representation of text documents. Simply being a text document, the content is marked up using self-defined tags which can be further enriched by metadata via attributes. The tags themselves must not overlap, providing a hierarchical structure of the document. From this point of view an XML document can be seen as a semi-structured text document that can be described as a tree of nodes. The root node corresponds to the first XML element. Each child node recursively starts a new branch of the tree. Generally, the content of the document is contained in the leaf nodes. However, XML allows nodes to contain both content and child nodes. These nodes are referred to as mixed content nodes. XML documents may contain any kind of contents in their nodes, even bytecode. It further allows to combine metadata and content information in a single document. Based on their application area, XML documents can be separated into two different categories~\cite{fuhr01xirql}:
The difference between data and document centric documents is not a sharp one. Data centric documents may involve an order among their child nodes, whereas document centric documents may also contain unordered metadata like the author's name or section titles. But commonly it is clear whether XML documents are used in a data or document centric way. This work focuses on processing document centric XML documents, where the order of elements is of importance. However, special types of nodes dedicated to store metadata are interpreted in a data centric point of view. Using XML for structuring documents yields several advantages. (1) XML allows to mirror domain and application specific document structures using a set of self-defined tags. (2) These tags can be further extended by metadata information in the form of self-defined attributes. (3) Moreover, XML supports internal and external linkage (XLink~\cite{url_w3c_xlink} and XPointer~\cite{url_w3c_xpointer}). (4) XML elements in the structure are addressed via their unique path (XPath~\cite{url_w3c_xpath}) starting at the root element of the document. The definition of the structure is located in a DTD~\cite{url_w3c_dtd} or in a newer XML schema file~\cite{url_w3c_schema}, to which XML document instances (the actual XML documents) are referring. Based on the structural specifications, XML instances can be checked automatically whether they are valid or not according to the schema given. In contrast to DTD, XML schemata also support the definition of various data types and formats to restrict the content within XML document structures. Listing 2.1 shows a simple XML document and its representation as an ordered tree (Figure 2.1), where the order among child nodes is of importance. In the first two lines of code the specification file for valid structures, here an XML Schema file named bookdef.xsd, is referenced.
Gray shaded nodes in the XML tree refer to the plain content of the document. All other nodes contain only structural information. One might note that only a single node (Section 2) contains both, content and structure. Such a node is called a mixed XML node. Anyway, the content of XML structured documents is generally restricted to the leaf nodes. Additional information in form of metadata comes with the title attributes in the chapter, section, and subsection nodes. In the example the elements author and title are not implemented as attributes of the book element, which would also be a possible solution. The XPath expression to address the content of the first subsection SubSec 2.1 is given by /book[1]/chapter[1]/section[2]/subsection[1]. Generally, there are no standards or even guidelines available that describe how documents are properly structured. Even the tagsets used for markup vary strongly from one domain to another. This simple example already indicates the difficulties one has to face when confronted with XML documents from heterogenous sources or different structure definitions (e.g., mixed nodes, attribute versus element, etc.). Exploiting the XML format for information retrieval purposes has several advantages~\cite{luk02survey}: (1) more precise search by providing additional information in the elements; (2) unified access to documents from heterogenous sources; (3) powerful search paradigm using structural, content, and metadata specification; (4) data and information interchange to share resources and to support cooperative search. Within the structured document retrieval paradigm, the structure provided by XML documents is consulted to define the units for indexing and retrieval. That is the reason for using XML retrieval as synonym for structured document retrieval or focused retrieval. In this context the goal of XML retrieval can be reformulated as retrieving document components (XML elements) of arbitrary granularity that are relevant to a certain user query instead of whole documents only.
Related WorkThis section outlines several aspects and work related to structured document retrieval. It covers historic developments and depicts currently probed approaches in this field.
Database and Information Retrieval PerspectiveThere are two main communities who have concentrated their efforts on the computation of semi-structured XML documents: the database community and the information retrieval community~\cite{luk00survey,pal06survey}. The database-oriented approach tries to tackle the problem of querying structured information in a database like manner. Most of the information on the internet exists in form of HTML documents and/or is stored in relational databases. Both information sources can be represented using XML (e.g., see~\cite{url_rdb}) as an intermediate document format. This opens the door to apply database-like search techniques which are further extended to support proximity search and ranking. Simply speaking, XML documents are plain text files. Thus, information retrieval approaches can be applied straightforward to search and retrieve information from these documents. As soon as XML turned out to be closely related to HTML, the information retrieval community focused their research on this area~\cite{pal06survey}. A first attempt was to simply ignore the additional markup, which led to lower retrieval performance~\cite{luk00survey}. More elaborated approaches include extra indices of the structure, support regular expression matching of XML paths and contents, and affiliate basic semantics with certain structural elements. However, these extensions were not as straightforward as assumed, leaving much space for further research.
Fragment and Passage RetrievalEarly work on returning parts of documents instead of full documents in response to user queries was done by John O'Connor~\cite{connor75passage,connor80passage}. The idea is to segment a text into so-called text passages and then to apply standard indexing and retrieval models. In this context a text passage is defined as a set of `consecutive words', a sequence of words in `reading order'. According to this approach, documents are considered as linear sequences of text fragments. Passages are mostly defined as non-overlapping and do not exactly match the underlying document structure. Later work by Mittendorf and Schäuble~\cite{mittendorf94document} focuses on query-independent text segmentation using Hidden Markov Models. Another work, although based on passages but closer related to the notion of logical structure, was proposed by Burkowski in \cite{burkowski92passage}. In this approach a document is represented using multiple lists of non-overlapping passages, where each list corresponds to a certain level in the hierarchical structure expressed through tags (e.g., headlines, sections, paragraphs, authors). Queries are formulated as retrieval command strings, implementing a text algebra based on a defined containment model. The final result set is ranked applying a statistical ranking strategy based on document length, term frequency, and inverse document frequency. Salton et al.~\cite{salton93approaches} probed several passage sizes which are evaluated using the SMART retrieval system. Similar experiments are conducted by Callan~\cite{callan94passagelevel} who uses paragraph-sized and window-sized passages in the INQUERY retrieval system. His evaluation results show that a combination of document- and passage-level outperforms the others. Further experiments done by Wilkinson~\cite{wilkinson94effective} confirm that combined results of whole documents and parts of documents enhance retrieval. In \cite{wilkinson94comparison} he and his colleague found that a page size of about 1000 bytes define useful fragments. Kaszkiel and Zobel~\cite{kaszkiel97passage} came to the same conclusion as Wilkinson, showing that a fixed length passage approach is efficient and robust. However, in subsequent experiments they found that passages of arbitrary length improve retrieval performance significantly. Based on these findings, classical passage retrieval methods assume a window of fixed length which is slid stepwise over the whole text of a document~\cite{chiaramella01sdr}. In each position the distribution of words is analyzed. Passage boundaries are found if the distribution of words within two subsequent window positions changes significantly. Having identified the passages, it is straightforward to apply standard information retrieval strategies on the corpus of text passages instead of the documents. However, passages do not exactly reflect the logical structure of documents intended by their authors. Thus, retrieval performance strongly depends on the applied segmentation of a text into passages. Also, splitting texts into passages predefines all possible units for indexing and retrieving in a static way. This is one of the reasons why the logical structure of documents is believed to improve the efficiency of retrieval systems~\cite{chiaramella01sdr}. Grossman and Frieder~\cite{grossman04ir} briefly review initial work in passage retrieval and discuss approaches as marker-based passages, dynamic passage partitioning, and merging passage-based similarity measures.
Retrieval Models
Language ModelsOgilvie and Callan~\cite{ogilvie03language} propose a tree-based generative language model approach for ranking document components. In their work they use probabilistic context free grammars to estimate the probability of parse trees for sentences within certain components. The probability of a specific parse tree, and thus its ranking, is computed as the product of the probabilities of all rules applied in creating this (sub)tree. For each leaf node in the document tree the language model is estimated directly from the text attached to the node. Therefore, three different ranking models, the Kullback-Leibler Divergence, the Generative Language Model, and the Maximum-Likelihood Estimate are evaluated. In order to represent inner nodes, linear interpolation is applied, where the parameters for combining language models of children to ancestor nodes are learned from a large data repository. These parameters conform to the augmentation factor described by Fuhr et al.~\cite{fuhr98dolores}. Smoothing, the re-estimation of the probabilities in a language model, improves the estimates by including knowledge based on sample data. Kamps et al.~\cite{kamps04length} suggest to support language-model based ranking strategies by priors, smoothing functions, and cut-off values. While priors allow non-content features such as the length of a document component to be included in the scoring mechanism, smoothing is used to increase the relevance score of small components containing only a small subset of the query terms. Without smoothing, larger components containing more query terms are favored. In order to avoid too small components to be retrieved they propose cut-off values for the minimal length of retrieved elements. Their experiments show that extreme length normalization is an important issue in XML retrieval as a means to improve retrieval performance.
Bayesian Models Piwowarski et al.~\cite{piwowarski03bayesian} use Bayesian networks
for the sake of XML retrieval. Complex queries are decomposed and
translated into elementary subqueries, where each subquery refers to
a single component.
Subqueries are implemented as Bayesian networks (directed acyclic
graphs). In the graph, nodes represent XML components (tag names)
that are interconnected via arcs representing relations between the
components. The joint probability of a set of $i$ components
$\{x_i\}$ is given by
The Vector Space Model Introduced by Salton in 1968, the vector space
model~\cite{salton68indexing,salton71smart,salton75vsm} provides a
solid framework that allows to compute a degree of similarity
between a document and a query. The model represents documents and
queries as term vectors with associated term weights. Term
weights~$w$ are calculated by combining term statistics (term
frequency $tf$) and corpus-based statistics (inverse document
frequency $idf$) using the
Equations~\ref{eq:vsm_tf}--\ref{eq:vsm_w}~\cite[pp.
27--30]{baeza-yates99ir}. Queries are represented and weighed in the same way. The similarity
of a document $d_j$ and a query $q$ is defined as the cosine measure
between the two weighed feature vectors (Equation~\ref{eq:vsm_sim}).
$t$ denoted the total number of terms.
Vector Space Model based Approaches Grabs and Schek~\cite{grabs02vsm,grabs03powerdb} propose a model
based on vector spaces generated on-the-fly for flexible XML
retrieval. The main idea is that content on different levels in a
document tree is considered to be of different importance with
respect to a query. Therefore, index nodes are identified that are
indexed and retrieved. More distant nodes are treated as less
important than nodes closer to the index nodes. Term weighing is
done using the traditional $tf \cdot idf$ formula, where the notion
of $idf$ is extended to inverse element frequency. For their
experiments at INEX 2002 Grabs and Schek distinguish single-category
retrieval, multi-category retrieval, and nested retrieval (which
they call flexible retrieval), reflecting the complexity of the
queries (single-category $\le$ multi-category $\le$ nested). Another approach based on the vector space model is proposed by Mass and Mandelbrod in 2004~\cite{mass04ranking}. In order to overcome the difficulties occurring in nested components, they create separate indices for components on the same level (e.g., all documents, all sections, all paragraphs). For each of these indices, formulae from the vector space model are used for weighing and for similarity computation. During query processing, all indices are searched, resulting in multiple result sets of varying granularity. As a final step, a document pivot factor is applied to combine and re-rank the retrieved result sets. In their experiments Mass and Mandelbrod show a 30\%--50\% improvement of mean average precision compared to previous retrieval results achieved in~\cite{mass03relevant}. Liu et al.~\cite{liu04indexing} also propose a $tf \cdot idf$ model for structured document retrieval. XML documents are represented using Ctrees, which is a compact tree representation for the whole document collection. Queries are expressed in the NEXI language (see Section~\ref{sec:sdr_rel_work_ql}) used at the INEX workshop in 2003 and represented as query trees. The content of XML elements is represented as weighed feature vectors applying a combination of term frequency, inverse element frequency, label weights, and weights for query term modifiers. During retrieval, the query is decomposed into a set of retrieval components which are grouped according to their structural specificities. XML components matching a subquery are evaluated independently and relevance scores are assigned. The overall relevance of an XML component is obtained by merging all subquery relevance results to a single score. Finally, a ranked list of XML elements is returned to the user.
Retrieval UnitsHatano et al. \cite{hatano03unit} addresses the problem of
determining proper retrieval units of XML documents. By defining XML
elements as retrievable (meaningful) or not retrievable
(meaningless), the number of targeted portions of XML elements can
be reduced. Kazai et al. refers to meaningless contexts as stop
contexts \cite{kazai02scalable} (e.g., XML elements carrying only
layout information). This distinction leads to improvements during
indexing, matching, and retrieval.
Query LanguagesXIRQL~\cite{fuhr00xirql,fuhr01xirql,fuhr03interface,fuhr04xirql} is an XML query language based on XQL~\cite{url_w3c_xql} which is extended by information retrieval concepts for weighing, data types and vague predicates, relevance-oriented search, and semantic relativism. These functionalities are implemented via special XIRQL operators. Boolean operators enable the user to combine subqueries to more complex ones. The final XIRQL query is transformed into an underlying path algebra defined in~\cite{fuhr01xirql,fuhr04xirql}, which is then optimized and executed. Theobald and Weikum~\cite{theobald02indexbased} propose XXL (fleXible XML search Language) for querying structured documents. In XXL a query is internally represented as a directed labeled data graph, including a set of wildcard placeholders for path constraints and single element names. Besides the standard set of operators on strings and other simple data types, a special tilde operator $\sim$ is used to enable semantic similarity matching based on ontologies. A graphical interface named Visual XXL GUI helps formulating complex user queries. Proposed by O'Keefe and Trotman~\cite{okeefe03query}, NEXI (Narrowed Extended Xpath I)~\cite{trotman04nexi1} is developed as a query language used at INEX since 2004. Since XPath seems to be too complex for end-users to formulate their information needs~\cite{pal06survey}, NEXI implements only a subset of XPath to address certain XML elements (hence its name narrowed). Because XPath does not support information retrieval like matching, NEXI is extended by an about() predicate that enables similarity-based content matching. For example, the NEXI query /doc[about(., computer)]//sec[about(./par, apple)] searches all documents <doc> that contain the term computer and returns sections <sec> with paragraphs <par> that contain the term apple. The semantics of NEXI queries is not defined; instead, the retrieval engine has to deduce it, which is why NEXI differs from database-oriented query languages. Introducing NEXI at INEX significantly improved retrieval results: The error rate of experts formulating complex queries dropped from 63\% (in 2003) to 12\% (in 2004)~\cite{trotman04nexi2}. Geva et al.~\cite{geva06xor} introduced XOR, the XML Oriented Retrieval language, which is intended to replace NEXI as query language in INEX. XOR is based on the previous NEXI syntax and therefore backwards compatible. Extra functionalities are added to support advanced information retrieval tasks. These extensions include a negation operator for content matching (-about(...)), logical operators for query combination (AND, OR, NOT, ANDNOT, SUPPORT), path matching extensions (strict, vague), term extensions (e.g., part-of-speech tags, case of letters), logical operator qualifiers (strict, vague), and additional predicates (e.g., LinkTo(), LinkFrom(), Contains(), lt(), eq(), gt()). Based on these definitions, an open-source parser for the XOR language was developed. The parser is able to translate XOR queries into reverse polish notation supporting the implementation of back-end processors. However, XOR is not supposed to be used as an end-user query language. It is meant to be an underlying language model supported by natural language query interfaces and query generators. Reported results of four different baseline systems using XOR instead of NEXI show significant retrieval improvements.
PerformanceFuhr and Gövert~\cite{fuhr02index,fuhr02compression} address an issue regarding the performance of structured document retrieval systems. They suggest index compression to scale down storage requirements for indexing XML documents. In their work, inverted files with compressed paths are used to speed up database access. Further, a new data structure called XS tree is proposed, which stores structural information of a document in a very compact form.
Index ObjectsThe first step of searching structured documents is to define the way documents are indexed. This index provides the basis for applying weighing and ranking formulae, which allow similarity-based matching of XML components to queries. Generally, there are two possible options: the development of a completely new weighing scheme; or the adaptation or generalization of existing ones (e.g., language models, Bayesian models, vector space model). According to Fuhr and Großjohann~\cite{fuhr01xirql}, the long experience with existing weighing and ranking strategies in traditional information retrieval favors the second option. Thus, the atomic units or index objects of a document must be identified. According to Fuhr and Großjohann~\cite{fuhr01xirql} this results in two benefits: first, traditional information retrieval models can be applied for processing these elements; and second, only index objects are returned to the user. Starting with the assumption that textual contents are restricted mainly to the leaf nodes of the document tree, these leafs are obvious candidates for atomic units. However, their granularity might be to fine grained as retrieval result~\cite{fuhr01xirql}. Just assume that a query searching for M. Hassler returns only the component <author>M. Hassler</author>, not including any additional information about the document he wrote. Without any context, this turns the result completely useless. Hence, Fuhr and Großjohann propose to rely on a hand-crafted definition of disjunct index nodes (also called index objects or contexts). These index nodes are identified either by (1) analyzing the underlying DTD or XML schema, or alternatively (2) according to the type of content (e.g., chapter elements). The root of a document itself is defined as the uppermost index node. The contents of all remaining non-index nodes are indexed at their nearest ancestor index node. Figure~\ref{fig:indexing_hassler} shows an example document and its corresponding index nodes (dashed boxes). In order to get the complete content of an arbitrary index node, the contents of all descendant index nodes are propagated upwards (arrows) and are combined with the content of the current index node itself. \begin{figure}[h] An automatic identification of appropriate index (and retrieval) nodes is proposed by Hatano and his colleagues~\cite{hatano03unit}. The idea is to select only nodes that have sibling nodes at the same level in the hierarchy. These nodes are characterized by the same path starting at the root node. Figure~\ref{fig:indexing_hatano} depicts the identified indexing paths /book/chapter, /book/chapter/section, and /book/chapter/section/subsection of the previous example. \begin{figure}[h] Again, the root node is defined as prime index node. In order to avoid unidentified index nodes in document instances where only a single element is implemented (e.g., only a single chapter within a book), the identification can be shifted to an analysis of the DTD or XML schema instead. However, in cases where a definition of the structure is missing this option might not be available. Thus, multiple chapters in documents are indexed whereas a single chapter is not. In the above example one can see that applying this approach misses at least the <abstract> component as an index object. If other nodes similar to <abstract> occur frequently within the document domain, an automatic approach seems not to be the best solution. The reason for indexing only a selection of nodes results in two main advantages:
Content Representation and WeighingIn order to represent structured documents for retrieval purposes, many approaches have been proposed~\cite{fuhr01xirql,kazai02focussed,carmel01juru,kotsakis02structured}. Good surveys are given by Luk et al.~\cite{luk02survey}, Pal~\cite{pal06survey}, and Pinel-Sauvagnat and Boughanem~\cite{pinel07survey}. The representation is always guided by the efficiency of processing queries and retrieving relevant parts of documents. While many structured retrieval systems rely on persistent XML databases, the representation varies from one system to another. Furthermore, mechanisms for indexing and ranking document components during retrieval are different. Approaches proposed by the information retrieval community are similar to those investigated by the database community to some degree. However, while the latter class of approaches aims at dealing with the boolean model, the former is concerned with more advanced models that allow to handle both structure and content of documents. From this point of view it is reasonable to combine both paradigms, that is fast database processing and sophisticated similarity computation to efficiently and accurately compute retrieval results. Having identified the index objects described in the previous section, methods borrowed from flat document retrieval can be applied to represent and weigh the content of these nodes. Traditionally, content-based retrieval systems rely either on the boolean model or on the vector space model~\cite{salton68indexing,salton71smart} to represent the content as a bag of words. Extensions of these models have been proposed, e.g., the fuzzy boolean model, knowledge-aware models, or the extended vector space model~\cite{baeza-yates99ir}. However, all of these indexing models ignore the logical organization of texts expressed by the documents' structure. In order to apply existing models to structured document retrieval the content of components is represented and stored using flat indexing. Beyond that mechanisms dealing with the hierarchical relationships among document components have to be applied. For illustration purpose this section relies on the vector space model (described in Subsection~\ref{sec:vsm}) to represent the content as a weighed feature term vector. The actual weight of a term is calculated by combining term statistics and corpus-based statistics (Equations~\ref{eq:vsm_w} -- \ref{eq:vsm_idf}). The challenge that arises at this point is to aggregate multiple disjoint feature vectors of descendant index objects to a single feature vector of their parent index object. For this purpose two different solutions are feasible~\cite{abolhassani04hyrex}:
In this work the first approach, namely the propagation of term statistics, is followed. This decision is mainly taken for one reason: The content representation of the root node exactly matches the traditional flat information retrieval approach. Due to this correspondence it is possible to test and compare the performance of structured document retrieval to traditional information retrieval engines on the document level. Whichever propagation is used, one has to be aware of its most important implication: Due to the aggregation of information from descendant nodes to ancestor nodes, a relevant child component implies relevance of all its ancestor components -- at least to a certain degree. The calculation of XML component representations carried out in this work can be summarized as a four-step process:
Querying and RankingRepresenting index objects as described in the previous section is a basic requirement of querying and subsequent ranking of document components. The concept of searching within structured documents is based on the comparison of each components' representation with a query that is represented in the same way. In contrast to flat documents, each query is compared multiple times (once for each component) to a single structured document. Hence, scalability depends on efficient algorithms to manage these vast amount of comparisons. Sticking with the vector space model (see Subsection~\ref{sec:vsm}), the similarity of a document component and a query representation is defined as the cosine measure between the two weighed feature vectors (Equation~\ref{eq:vsm_sim}). However, computing the similarity of single XML component contents stored in the database independently is not sufficient. It is important to include the contents of descendant nodes as well. To illustrate this issue consider the example in Figure~\ref{fig:querying_tree_1}. The same document with its aggregated representations indicated is given in Figure~\ref{fig:querying_tree_2}. For illustration purpose the term weights are passed upwards without being changed. Note that this refers to maximal weights at the ancestors, hence propagation is meant only to reduce propagated weights. \begin{figure}[h] If someone wants to know something about pets, this document (Figure~\ref{fig:querying_tree_2}) is queried for instance with the weighed terms 0,90 cats and 0,80 dogs. Figure~\ref{fig:querying_tree_3} gives the similarities (cf. Equation~\ref{eq:vsm_sim}) of all document components and this query, showing an interesting result: Both sections are relevant to the query reaching a similarity of 74,7\% and 66,4\%. Chapter 2 is less relevant than both sections (65,5\% similarity). From a focused retrieval point of view this comes quite unexpected, because the second chapter is the smallest component that fully covers the query. Thus, it is expected to be the optimal (top-ranked) result to this query. The whole book itself is relevant to a degree of 44,9\%, which again meets a users' expectation. From this example it seems to be necessary to include further mechanisms that reflect structural relationships more accuratly. In structured document retrieval every index object is a potential query result item. Hence relevant descendant components imply that ancestor components are also relevant to the same query (even to a less degree), overlap of query results is unavoidable. This ability puts additional requirements on the retrieval mechanism, which has not only to return relevant components but also components of the correct granularity~\cite{kamps03retrieve}. Neither should the components be too small nor should they be too large. From this point of view, the highest rank should be assigned to the second chapter in Figure~\ref{fig:querying_tree}. One possible solution is to redesign the propagation strategy. However, this may lead to weighing inconsistencies at the representation level. Another approach combines the computed similarity values of ancestors and descendants directly. The final score, often called Retrieval Status Value (RSV), has to take account of these issues. As stated above the retrieved units in structured document retrieval are not whole documents but document components. This leads to difficulties in query matching, because a query is no longer matched against one (large) document feature vector. Now it is compared to a lot of (respectively small) leaf representations, as well as to many inner nodes with accumulated -- though not necessarily small -- feature vectors. Using a term space that consists of all document terms, smaller portions of text consequently lead to sparser representations. This sparseness of feature vectors has to be considered, because single term matching results in high relevance values. Generally, natural language processing techniques have not significantly improved the performance of flat document retrieval~\cite[pp. 227]{cole97survey}. However, in the domain of structured documents dealing with small portions of text retrieval quality is believed to benefit from deeper linguistic analysis.
Query LanguageIn the past several languages for querying XML documents have been proposed (XPath~\cite{url_w3c_xpath}, XQL~\cite{url_w3c_xql}, XQuery~\cite{url_w3c_xquery}, XML-QL~\cite{url_w3c_xmlql}). An overview of the features of five of these languages (LOREL, XML-QL, XML-GL, XSL, and XQL) is provided by Bonifati and Ceri in~\cite{bonifati00query}. Two other surveys are given by Deutsch et al.~\cite{deutsch99querying} and Luk et al.~\cite{luk02survey}. However, these languages are still lacking some important features. Missing concepts range from ignoring data types and absence of term weighing functionality to unsupported similarity matching~\cite{fuhr00xirql}. Also, the complexity of these languages is too high and the syntax is too complicated for users to express their information needs~\cite{woodley06nlp}. Anyway, these languages provide basic building blocks and good starting points for the development of more appropriate information retrieval query languages. The requirements put on a query language can be summarized as~\cite{fuhr03interface,fuhr04xirql}:
Based on these requirements more appropriate higher-level query languages like XIRQL, XXL, NEXI, or XOR (see Section~\ref{sec:sdr_rel_work_ql}), to name just some of them, have been developed. Intended to bridge the gap between pure structure-oriented and content-based retrieval, all of these languages have their own syntax and semantic, but in general, their basic functionalities remain the same. However, only NEXI and XOR are supposed to be interpreted by a retrieval engine only. Listing 2.2 presents an example query in XOR.
In the listing, the XOR query is composed of two main subqueries (line 1 and 3). The first subquery retrieves subsection elements containing These days which occur in books that treat XML but not XSLT. The second subquery retrieves book elements where the author is exactly M. Hassler or where any element is about Retrieval. Finally, the two subqueries are combined via an AND condition. The order of AND-combined subqueries is implicitly defined: The subquery addressing components of larger granularity (line 2 returns book elements) functions as filter for the subquery addressing smaller components (line 1 returns subsections). As a result, a set of subsection elements within book elements is returned to the user. This example shows that formulating such queries is definitely not an easy task and even error-prone for expert users. Hence, this language is not intended to be applied directly by users. It is supposed to be supported by query generators offering an (interactive) user interface to formulate queries that are both, syntactically and semantically correct. Novel approaches for querying XML documents are presented by Woodley et al.~\cite{woodley06nlp}. In their work the goal is to transform natural language queries into the NEXI query language automatically. Although different, the approaches described comprise four common query processing steps: (1) detection of structure and content constraints; (2) mapping of structural constraints onto corresponding XML tags; (3) deriving of NEXI-conform content requirements; and (4) NEXI query formulation. Three different approaches were tested in the natural language query track at the INEX workshop in 2005~\cite{tannier06translating}. An approach proposed by Hassler is based on template matching of words and part-of-speech tags. Therefore, patterns are used to extract two kinds of rules handling the structure (s_rule) and the content (t_rule) (see Figure~\ref{fig:nlq_hassler}). \begin{figure}[h] Woodley and Geva suggest a shallow syntactic analysis before applying similar template matching. Tannier includes deep syntactic analysis complemented by semantic rules concerning the structure and content. Their experiments showed promising results, even outperforming the performance of a baseline system.
Result Presentation and BrowsingIn contrast to traditional retrieval results (i.e., a ranked list of documents), the returned items in structured document retrieval (i.e., a ranked list of document components) are no longer independent of each other. The result set may include:
Of course, this puts special requirements on the visualization of these results. A structured document retrieval system should foremost retrieve the most specific component of a document (its best entry point) answering a given query~\cite{chiaramella96model}. Starting from this element, other relevant elements in the document are explored by browsing. In~\cite{grossjohann02interface,grossjohann02visualization},
Gro{\ss}johann et al. propose a user interface for both, query
formulation and result presentation. The query formulation interface
called HyGate~\cite{fuhr02hyrex1,fuhr04xirql} aims at constructing
queries in the XIRQL query language in a query-by-example way. The
result presentation interface displays the retrieved document
components to the user, reflecting relationships among elements
stemming from the same document. The resulting components are
depicted in a 2D graphic called TreeMap, where each element is
displayed as a rectangle. Child components are drawn as nested
within their ancestors rectangle (see
Figure~\ref{fig:xml_result_treemap}). Brightness (white means not
relevant, dark means highly relevant) is used to reflect a
components achieved score. The concept of TreeMaps is further
advanced to Partial TreeMaps (see
Figure~\ref{fig:xml_result_treemap_2}), where elements that are not
included in the result set are removed for better readability. This approach is extended by Fuhr et al.~\cite{fuhr03interface} to
cover two additional aspects: (a) structural relationships among
result components (from the same document), and (b) size of the
result components. TextBars (see Figure~\ref{fig:xml_result_textbar}), as these
graphical representations are called, visualize a document as a bar
that is segmented according to its underlying structure. Above the
bar the title of the document is given. The length of the bar
reflects the size of the document (number of words). According to
the size of an XML component, red separators indicate the borders of
different components (index nodes). Segments are colored using
different shade factors reflecting their relevance to the query,
where white means not relevant at all and black means totally
relevant.
Retrieval EvaluationA key issue in information retrieval, and thus in structured document retrieval, is the evaluation of the retrieval performance. A solid evaluation of any IR-related system requires (1) a predefined document test collection (corpus), (2) a set of tasks a system has to perform (queries, topics), and (3) precisely defined evaluation metrics reflecting the retrieval quality. These preconditions enable system developers to determine the performance of a system and create the basis for an impartial comparison of heterogenous systems dedicated to the same tasks. Currently much research is done in the field of structured document retrieval evaluation. An initiative that is devoted solely to this topic is the INitiative for the Evaluation of XML retrieval (INEX)~\cite{INEX:02,INEX:03,INEX:04,INEX:05,INEX:06,INEX:07}, organized by the European DELOS Network of Excellence for Digital Libraries. One of the main tasks of INEX is to provide a framework for querying and retrieving XML document components not only via content but also via structural constraints. It includes
Every year the INEX workshop held in Dagstuhl Castle in Germany brings together experts from around the world, which are discussing aspects of retrieval methodology and models, evaluation measures and strategies, query languages, and future research directions. Each participating research group evaluates their system and reports the achieved results at the workshop. The workshop is structured into several tracks. The last workshop held in December 2007 embraced the following tracks~\cite{INEX:07}:
INEX was first organized in 2002 and involved around 50 participating groups from around the world. In the beginning INEX classified the approaches into three categories~\cite{goevert02inex}: IR model-oriented, DB-oriented, and XML-specific. One year later the categorization was changed into model-oriented and system-oriented approaches~\cite{fuhr03inex}. Table~\ref{tab:inex} briefly summarized the historical development of INEX reflecting its increasing approval and importance throughout the scientific community. \begin{table} INEX 2006 & 85 & 57 & 659388 docs \\ 4,6 GB \\ 130 topics & Ad Hoc Retrieval, INEX 2007 & 105 & 60 & 659388 docs \\ 4,6 GB \\ 130 topics & Ad Hoc Retrieval, \end{tabular}
CorpusIn the year 2002 a test collection containing 12.107 documents was created by INEX. The documents stemmed from 12 magazines and 6 transactions of the IEEE Computer Society covering a period of 1995--2002, with a total size of 494 megabytes of raw XML files containing no pictures or multimedia content. 169 different XML tags were used for markup. For two years the collection stayed nearly the same, only with some work spent on the correction of inconsistent syntax and structure of the XML files. On average each document contained 1.532 XML elements with an average nested depth of 6,9. In 2005, the collection was extended by 4712 new articles (16.819 documents) from the period 2002--2004, reaching a total size of 764 megabytes. Also, the number of unique tags increased to 192. Preliminary results of this work were evaluated and presented at the INEX workshop in 2005. In order to be comparable and to show improvements achieved, this work sticks to the INEX 2005 document collection throughout all experiments. In 2006 INEX adopted the Wikipedia document collection which is freely available on the internet comprising 659.388 XML documents. The documents, in contrast to the previous collection, make heavy use of internal and external linkage. In total the collection size is about 4,6 gigabytes of plain XML files, where each document on average consists of 161,35 XML elements in a nested depth of 6,72. The documents are structured according to the Wikipedia template, making use of about 5.000 different tags.
TopicsEach year the INEX topics (queries) used for the currently organized workshop are created by the participating groups themselves. Each group hands in a set of potential topic candidates from which the most appropriate are selected for evaluation. An example INEX 2005 topic is given in Listing 2.3. It consists of an initial topic statement, a title, a short description, a long description explaining the idea and the expected result, and its corresponding NEXI query.
INEX distinguishes two types of topics:
Whereas the ranking for the CO queries is only a matter of content, CAS queries put much more effort on the ranking mechanism in order to combine multiple relevance values regarding the content and structure. For evaluation purpose CAS queries are interpreted in two ways: strict (SCAS), where only the specified target components are allowed to be returned; and vague (VCAS), where specified target components are treated more like preferences than strict conditions. The optimal human answers are assessed manually by the participating groups. Each group evaluates two to three topics which takes about two weeks per topic. With the use of an online XML browser and markup tool, the whole collection is skimmed through for a topic and relevant answer elements are marked up and classified with respect to their exhaustivity and specificity. Results of participants assessing the same topic are cross checked to minimize inaccuracies. After finalizing the topics assigned, INEX grants access to all available topic assessments.
MetricsIn order to measure the performance of information retrieval systems one needs to define formal metrics. In traditional information retrieval, the standard metrics recall and precision are used~\cite{baeza-yates99ir}. In contrast, result sets in structured document retrieval consist of document components of varying granularity which may contain overlapping elements. Thus, these metrics are not appropriate to reflect the retrieval quality in all its facets~\cite{trotman04nexi1}. INEX proposes a set of metrics for the evaluation of such systems to deal with that difficulty. Hence the retrieval task in INEX is defined as returning XML components that are most specific and exhaustive~\cite{goevert02inex}, these two aspects have to be defined more clearly~\cite{fuhr03inex}. Exhaustivity expresses the extent to which an element covers the topic, and specificity means how focused an element is on the topic. Both dimensions are rated using a four-degree scale: ($0$) not, ($1$) marginally, ($2$) fairly, and ($3$) highly. Out of the 16 possible combinations, 10 meaningful $ES$ tuple (Exhaustivity-Specificity) are taken~\cite{pal06survey}. \begin{equation*} In order to apply different evaluation metrics, the two relevance dimensions are mapped onto a single relevance scale by a quantization function $f_{quant}(e,s): ES \rightarrow [0,1]$. INEX 2002 used two variants of quantization functions~\cite{goevert02inex}, a strict one (Equation~\ref{eq:f_quant_strict}) and a generalized one (Equation~\ref{eq:f_quant_gen}). Later on, additional variants of $f_{gen}$ are introduced with more focus put on either of the two dimensions. \begin{eqnarray} Additional metrics proposed by and used at INEX (precall, XCG, nXCG, T2I, and PRUM) are described in detail in \cite{kazai05measure,pal06survey}.
SummaryOnline document repositories and digital libraries increasingly rely on structured documents. Hence, structured document retrieval methods become more and more important. This chapter introduced the challenges of searching within structured documents and provided answers how upcoming difficulties can be handled. A brief overview of ongoing research in the area is given. The main tasks of information retrieval, namely indexing and retrieval, were discussed in detail. As the presentation of retrieval results poses several difficulties one does not face in traditional information retrieval, two graphical approaches, treemaps and TileBars, were presented. Most important to prove the success of a new structured document retrieval approach is an extensive evaluation and comparison to similar systems. To do this, this work reverts to INEX, which is dedicated to the evaluation of XML retrieval systems. To assure comparability INEX provides a document collection, topics (queries and their assessments), and evaluation metrics. |



