Abstract
| This research sheds light on structured document retrieval and its
challenges. In this paradigm, documents are no longer perceived as
`flat' content containers. Rather, their content is structured into
a hierarchy of various levels of granularity. By considering parts
of documents instead of its entirety, the users' query can be
answered more precisely and with better focus. This implies that
traditional retrieval methods have to be enhanced by exploiting the
structure as additional source of information. However, the
structural dimension of documents demands for tailored models that
incorporate structure in the representation and retrieval process. The aim of this thesis is to develop a retrieval system that is capable of satisfying complex user queries containing both, constraints on the content and structure. Documents are first transformed into a generic XML document format that optimally supports retrieval tasks. It consists of three structural elements. These are the document, its (sub-)sections, and its fragments (smallest retrievable units). Each element includes a metadata and content part. A cascade of natural language processing steps analyzes the textual contents and extracts relevant index terms. The analysis involves extended tokenization, supporting token types and multi-tokens, and filtering of functional, content-related, and domain-specific stopwords. The single-term index is supplemented by multi-term indices of composite nouns, named entities, formulaic speech, and full forms of acronyms. These patterns plus additional processing rules relevant to information retrieval (as applied during tokenization) are extracted from the documents automatically. As a consequence of splitting a single document into numerous parts according to its structure, mechanisms, e.g. classification and clustering, to organize these parts are needed. Being a user-centered approach, classification automatically assigns them to pre-defined classes that may be created, populated, and navigated on demand. Clustering partitions document components that are not pre-classified into groups using similarity measures (e.g., edit distance). It is used to speed up retrieval by organizing clusters into a hierarchy. During search, clusters (and descendant clusters) consisting of irrelevant documents to the query are simply ignored. In order to validate the concepts developed within (the theoretical part of) this thesis, a prototype called X-DOSE -- XML-Document Oriented Search Engine -- has been implemented. Based on a client-server architecture, the system handles indexing, retrieval, classification, and clustering tasks. X-DOSE has been evaluated using the XML documents of the INEX collection. The empirical results indicate the appropriateness and importance of the various stages of processing proposed in the context of this dissertation. Moreover, other aspects have been investigated such as the pros and cons of natural language processing steps involved, enhancement of query languages for structured documents, and hybrid similarity measures. Keywords: Information Retrieval, Structured Document Retrieval, Search Engines, XML, Natural Language Processing, Text Mining, XML Classification, XML Clustering, Document Mining, Retrieval Evaluation, X-DOSE |


