Introduction
| "Science must begin with myths, and with the criticism of myths." Karl Popper Over the past several years the amount of available online documents grew rapidly. Repositories like Wikipedia or the IEEE digital library provide a huge source of information which is constantly increasing and continually changing. To manage these data loads, automatic mechanisms for searching and organizing are indispensable, demanding sophisticated methods to represent, store, organize, and retrieve information from these sources. Therefore, methods from the fields of Information Retrieval (IR)~\cite{vanrijsbergen79information,baeza-yates99ir} and document mining~\cite{hand01datamining,larose04datamining} have been adapted to meet these requirements. Traditional information retrieval is concerned with unstructured documents. From a structural point of view these systems operate on a `flat' document structure, neglecting any kind of internal document organization (e.g., chapters, sections) and content distinction (e.g., texts, figures, tables). The whole content of a document is treated as a single piece of information which is represented by a single object (i.e., weighed feature vector). As a result, the retrieval process triggered by a user query ends up with a ranked list of documents corresponding to the query terms they contain (see Figure 1.1).
However, often a user is more interested in a specific piece of information that is only part of a document. For instance, information needs may range from a set of sections of a book to a single figure with a certain caption. Retrieving the whole book at once does not satisfy the users interest because the coverage of the full document is usually too broad, leading to further focused search and browse within the document. The danger here is to miss documents containing satisfying information, because the overall similarity of (rather long) documents might be too low given a specific request for information. The following scenario sketches this problem: Assume there is a collection of 1.000.000 documents available. Within this set there is a short newspaper article (~ 100 words) dealing about the recently discovered black hole that is nearest to earth, entitled `Nearest Black Hole found!'. Another document is written by Stephen Hawking about phenomena in the universe called `Black Holes and Baby Universes and Other Essays'. The length of this document is about 50.000 words, where chapter eleven (~ 4.000 words) describes black holes in detail, containing figures and tables that explain the anatomy of black holes. In order to search the collection, a search engine (e.g., google) analyzes all documents and offers a keyword-based query interface. A user who is interested in the general nature and functioning of black holes might enter the search terms `black hole physics functioning'. As a result the system returns a list of documents. Within this result the newspaper article is probably ranked as one of the top ten matches, whereas Stephen Hawking's book is located at the very end of the list. Why would that be the case? Because the estimated one hundred words of the newspaper contains twice the terms `black hole', on average 2/100. In contrast, the terms `black hole' are mentioned two hundred times in the chapter of the book, on average 200/50.000 = 2/500. As a consequence, the book is ranked\footnote{For the sake of illustration relative term frequencies are used instead of applying any standard weighing formula.} much lower than the newspaper article although the query terms occur more frequently in the book. Would chapter eleven be treated separately as a unit of about 4.000 terms, the average occurrence of `black hole' in the chapter 200/4.000 = 2/40 exceeds that of the newspaper article by far. Thus, the book chapter itself would be ranked higher than the newspaper article. Structured Document Retrieval (SDR) suggests a solution to this problem by including structural information inherent in documents to improve the results of search engines. The goal is to retrieve not only complete documents, but also parts of documents that are relevant to the given user query. These parts are explicitly defined by the logical structure of the document (e.g., document, chapter, section, paragraph, figure). Thus, the content of relevant parts irrespectively of their granularity is returned. In this context, the idea is to adapt and apply mechanisms from traditional information retrieval along with extra source of knowledge mainly of structural nature. Such additional structural information assists the retrieval process in three ways:
In many cases structural information is already included in documents and can be (semi-) automatically extracted from these sources: For instance, HTML~\cite{url_w3c_html} webpages contain tags like <h1>, <h2>, <p>, <img>. LaTeX~\cite{url_latex} sources contain markup such as section, subsection, figure, table. MS-Word~\cite{url_word} and PDF~\cite{url_pdf} documents support heading1, paragraph, embedded image formatting. However, these formats mix up structure and layout information. To cope with that problem, the eXtensible Markup Language (XML)~\cite{url_w3c_xml} came into existence. XML offers an appropriate format to represent hierarchically organized knowledge within semi-structured text documents. XML schemata~\cite{url_w3c_schema} and the former Document Type Definition (DTD)~\cite{url_w3c_dtd} provide a stable framework for defining a valid structure of XML documents. This allows to establish a standard to uniformly access information emanating from different sources. Since most structured document retrieval systems operate on XML documents, structured document retrieval and XML retrieval are used synonymously. Whereas classical information retrieval treats documents as atomic units, XML suggests a tree-like view. The root node represents the entry point to the document. The content of the document is (mainly) located at the leaf nodes of the tree. Inner nodes are understood as intermediate nodes in between the root and the leaf nodes. For example, consider a <document> node (root node) with several <section> nodes (inner nodes), where each section consists of multiple <paragraph> nodes (leaf nodes). From this point of view an XML document can be seen as a structured set of XML components or XML elements, where each component is accessible via a unique path starting at the root node. Besides the information retrieval discipline, document mining becomes more and more an issue in processing large information repositories. Automatic grouping of similar documents or parts of documents is able to enhance the performance in both regards, computational complexity and retrieval quality. Therefore, document classification and document clustering mechanisms need to be adapted to fit the paradigm of hierarchically structured documents. This chapter sketches the main challenges of structured document retrieval and discusses various aspects pertaining to XML retrieval. In line of these challenges, issues of representation, storage, retrieval, and organization of structured documents are covered. A brief overview of the structure of this work concludes the chapter.
Problem Identification and MotivationsIn addition to traditional information retrieval requirements, SDR systems face several new challenges inherent in structured documents only. Due to the internet, a real flood of quite diverse document formats (e.g., HTML, XML, PDF) became widely accepted. Also, the ways of structuring documents themselves have been evolving independently. As a consequence, XML documents of different sources are highly heterogenous due to their multiform structure. Till now, no standards and not even guidelines are available that conveniently describe document structures. Often, documents come without any schema specification, so possible (sub)structures can therefore only be intuitively guessed. In order to process XML documents efficiently without being too restrictive, retrieval systems have to consider these structural specificities. Having solved the problem of heterogeneity, the kind of storage has to be decided. In this context not only the content but also the structure and metadata of XML documents have to be considered in indexing and searching. The element hierarchy has to be retained, supporting fast access to document parts as well as to sets of parts (i.e., document subtrees). Several proposals for storing structured documents have been made, ranging from relational databases to native object-oriented XML databases. However, for the sake of structured document retrieval all of these proposals have to be adjusted to find a balance between functionality and performance~\cite{khan01evaluation}. Representation of the content and structure of documents strongly relies on the storage mechanisms. Since XML implements a hierarchical structure, the content is mostly (but not necessarily) restricted to the leaf nodes. The content of an inner node is recursively derived from the contents of its descendant nodes. For the indexing procedure, the level of granularity is defined implicitly by the document structure itself. During retrieval, the granularity of the result elements stands for a challenge. For instance, if a section and a paragraph within this section are both relevant to a query, which of them should be returned to the user -- the section, the paragraph, or both? Yet, a user searching for information about an author may not be satisfied by returned components referring to the authors' name only. Thus, the target components may differ from the components returned. Clearly, ranking based on relevance of individual components has to take the inclusion relationship and possibly the explicit user's preference into account. One strategy is to return the so-called Best Entry Points (BEP)~\cite{kazai02focussed} serving as starting points for further document browsing activities. Related to the granularity aspect is the context of XML components. A paragraph might be relevant to a query because of its preceding and succeeding paragraphs only. The same paragraph in another context may be completely irrelevant. Thus, context, even being not included in the answer directly, has to be accounted for. Retrieval systems often operate in highly dynamic environments where documents are constantly added, removed, and altered during runtime. Therefore, the indexing process has to be fast and the calculated indices have to be kept stable. Corpus-based weighing techniques such as the inverse document frequency of the vector space model (described in Section~\ref{sec:vsm}) need adaptations to avoid constant re-computations of millions of representations each time a document changes. In order to retrieve XML components that are relevant to a query, proper representations of both, the query content and the components' contents, are essential. Much work on Natural Language Processing (NLP) has already been carried out in the field of traditional information retrieval to enhance the quality of textual representations. However, in the context of flat document retrieval NLP methods have shown only minor effects on the retrieval quality but have had a high impact on the (computational) complexity. With structured documents, the portions of textual content are generally much smaller (e.g., paragraphs). This fragmentation offers new possibilities to apply NLP techniques that improve, extend, and refine indexing and matching procedures of short text passages. Statistically motivated patterns (e.g., multi-terms such as composite nouns or named entities) can be effectively used to make content representations more meaningful and matching more precise. Text mining provides methods that extract valuable patterns from large amounts of text automatically. A major benefit of exploiting structured documents is that users are able to express very specific information needs. For that purpose, SDR systems use query languages that support formulations of complex queries containing combinations of structure and content constraints. Initially, existing query languages were extended to handle structural information. Then, new query languages dedicated to SDR were developed. Very often, users must express their information need explicitly. In addition to the query language, users need to know the underlying structure of the target documents. This may be easy for domain experts, but not for general users. Another difficulty is that most of these languages are quite complex, making it hard to apply them directly~\cite{tannier06translating}. Newer approaches rely on graphical interfaces that assist users to formulate their queries. The query is generated automatically by the interface avoiding incorrect syntax and semantic errors. Once a query is formulated, the system computes the set of relevant XML components. The retrieval process consists of matching, ranking, and filtering of document components. While in traditional information retrieval a query is compared to a document only once, structured document retrieval has to match the query with each component of a document. These multiple similarity computations per document require efficient mechanisms that reduce the number of comparisons and ensure fast system response. One solution is that structural constraints are applied as a filter for XML components. Generally, sequencing filters does reduce the data load considerably. Processing of retrieval tasks can be supported and improved by document mining techniques in two ways: First, document classification enables users to create groups of related XML components, which allows browsing and searching such groups. Basic requirement to do so is the definition of a non-trivial similarity function that compares two document trees. Second, document clustering identifies groups of similar documents called clusters without human intervention. Comparing a query to these clusters instead of single documents also reduces the number of similarity computations considerably. Of particular importance in structured document retrieval is the ranking of the results. It has to reflect multiple matches of the structure, content, and metadata of single components as well as (i.e., structural match, content match) structural relationships among the results. Based on the tree-like representation, an adequate presentation of the results is needed to prevent users from getting lost. Because multiple components even of single documents may be retrieved, browsing functionality of documents containing relevant information is indispensable. Graphical interfaces assist users with highlighting and focusing relevant components while explaining the underlying structure and context. All of the aspects mentioned above are strongly related to each other. For instance, the representation of documents and their storage are interdependent. Queries are formulated on the basis of the underlying document structure, which in turn defines the result presentation during browsing. Thus, structured document retrieval systems have to consider each of these aspects to perform well. Having identified the crucial points of structured document retrieval, the next section provides an overview of the approach taken in this work.
Overview of the ApproachThe focus of this thesis is on processing XML structured documents for information retrieval, where the main issues include document representation, storage, content representation, indexing, retrieval, classification, and clustering of XML components based on their content and structure. As mentioned earlier, the formulation of complex queries requires the competence of the users in the documents domain, structure, available metadata, and query language. Hence, the main target user group of SDR systems is considered to be domain experts (e.g., academic research, jurisprudence area). But even without deeper knowledge (e.g., queries without any structural constraints), the retrieval results are more useful than complete documents for two reasons. First, parts of documents usually provide more precise answers to specific questions than whole documents. Second, parts of documents are already focused by the search engine so there is no need for further search and browse activities within documents. Bearing in mind the challenges presented in the previous section, a system for structured document retrieval has been developed. This system, X-DOSE, is based on a scalable client-server architecture which allows for distributed computing. Initially, new documents are mapped onto a generic document format and stored in a relational database. A graphical interface assists users in formulating their queries using XOR, an extended version of the XML-Oriented-Retrieval language~\cite{geva06xor}. Given a query, the system returns a result set of ranked XML components to the client. Users are able to browse documents, where relevant components are focused and highlighted applying relevance-based colors. A graphical representation helps users to focus on the results retrieved without losing track of the documents' structure. Adding additional constraints identified during inspection of results enables iterative query refinement that yields more focused results. In the sequel, brief descriptions of the basic system features are presented:
The system is evaluated using real world data provided by the INitiative for the Evaluation of XML Retrieval (INEX)~\cite{INEX:02}. The next section describes the logical organization of the work.
Thesis OutlineThe thesis comprises two main parts. The first part (Chapters 2 to 7) provides the theoretical background of structured document retrieval and introduces concepts and methodologies that are used in the development of a SDR system. The second part (Chapters 8 and 9) describes the system and its evaluation using real world data. Detailed information about indexing, retrieval, classification, clustering, and evaluation procedures are given. Each chapter comprises a related work section, reviewing previous research that has been conducted in this field. In particular, the chapters are presented as follows: Chapter 2 presents the basic concepts and methodologies structured document retrieval systems rely on. Key issues of successful systems are described and brought into context. The chapter tries to answer questions like: What are the proper units to be indexed and which units are to be retrieved? How is the content within a document tree represented? Where are the bottlenecks and possible solutions? Chapter 3 covers document format, storage, and representation issues. Focusing on document-centric XML documents, a generic document schema is proposed which supports highly sophisticated user queries addressing the content, structure, and metadata. In addition to that, the underlying relational MySQL database is explained. An adequate representation for XML documents is described, relying on various term space considerations. The goal of any retrieval system is to return relevant information. In written documents this information is expressed in the form of natural language text. Chapter 4 introduces fundamental natural language processing methods that transform these texts into a computable form. Analysis steps are described and brought in relation to each other, where one steps' output serves as input for a subsequent step. Tokenization, tagging, stemming, and stopword filtering are explained and extended to achieve optimal outputs. Finally, the autonomous tasks are assembled in a sequence that transforms natural language texts into weighed feature term vectors. Based on the previous chapter, Chapter 5 deals with more complex natural language processing tasks. It is concerned with high-level pattern mining to extract multi-terms (e.g., named entities) and to identify and resolve acronyms. The process chain described previously is extended to include these tasks. Due to this process, multi-term indices are created that improve retrieval performance. Classification as a means of document organization is introduced in Chapter 6. The main issue is how two document trees are compared to each other. Based on this similarity, k-Nearest-Neighborhood classification is used to assign unseen documents to pre-existing categories. Initial experiments using an extra dataset for XML document classification yielded good results. Chapter 7 is concerned with document clustering. In contrast to classification, clustering aims at finding commonalities among documents without user intervention. Thereby, similar documents are grouped into clusters. The difficulty is to define an appropriate representation of a cluster that accounts for all features of the documents assigned to it. In this work, an approach based on supertrees is proposed. Pruning of uncommon branches is applied to increase the similarity between a cluster representative and single documents. Several similarity measures are formulated and compared to each other. In order to validate the theoretical assumptions in this chapter, a set of experiments was run on a special dataset dedicated to the evaluation of XML document clustering. The preliminary results were promising. Further investigations on the improvement of retrieval tasks due to clustering are carried out in the evaluation chapter. In Chapter 8, the architecture of X-DOSE, the XML-Document Oriented Search Engine, is described in detail. X-DOSE is based on a client-server architecture using a MySQL database. While the client provides a user-friendly graphical interface, the server implements the processing procedures for indexing, classification, clustering, and retrieval. The whole system is written in Java, where the modular design supports replacements and extensions of single system components in a simple and neat manner. Chapter 9 provides an extensive evaluation of X-DOSE. An official set of about 17.000 IEEE computer science articles provided by INEX~\cite{INEX:05} serves as basis to measure the performance of the approach. Besides experiments that measure the accuracy due to natural language processing, comparisons to similar systems using the same dataset are carried out. Finally, Chapter 10 summarizes the theoretical and experimental results achieved. Contributions of the work in the field of structured document retrieval are highlighted and open issues are identified. An outlook of future research directions concludes the thesis. |



