Research @ Hekkas.Com

Linguistically Enhanced Information Retrieval of Structured Documents

Conclusion

"The best way to get a good idea is to get a lot of ideas." L. Pauling

This chapter summarizes the research done in this dissertation. In particular, it highlights the scientific contributions to the field of structured document retrieval and lessons learned. Moreover it sheds light on some perspectives and future research directions.

 

Concerns

This thesis introduces structured document retrieval as a further development of traditional information retrieval. Documents are no longer understood as flat content containers. Instead, they look like a hierarchical structure of elements (called document components). In other terms, the contents of a document are structured into various levels of granularity. Retrieval systems that take advantage of that additional information source have to be adjusted to fit the challenges coming along with this new paradigm.

XML is used as an appropriate means to represent structured documents. In order to generalize structural elements, documents are mapped onto a generic XML document format and are stored in a relational database. Clues about the structure, including available metadata, are used in combination with the content to satisfy a users' information need more accurately. This benefits retrieval in several ways: (1) Parts of the documents usually provide more focused answers to queries than complete documents; (2) The search engine focuses on relevant information avoiding further search and browsing activities by users; (3) Expert users are able to formulate highly specific queries.

Natural Language Processing (NLP) transforms the textual content of document components into term frequency vector representations. It includes an analysis of complex linguistic patterns that are extracted autonomously. This results in multiple representations of a text that are used conjointly and support each other. These representations of a text allow the adoption of modified versions of the traditional vector space model to match and rank components according to a user query.

During retrieval, the same NLP procedures are applied on the query terms to obtain a term frequency vector. Weighing of document and query terms is done on-the-fly. Content similarity, metadata similarity, and structural similarity are combined to compute the relevance of a document component to the user's query given a set of user-defined query parameters.

Classification of similar document components into user-defined categories is proposed as a means of enhancing and facilitating the organization and navigation in large document repositories. Because of the hierarchical structure, components are represented as ordered trees and compared using a distance measure. Providing the system with pre-assigned components and categories, classification is learned and unseen components are assigned automatically.

Computational performance is a critical issue of retrieval systems and strongly depends on the underlying data structures and matching mechanisms. In this work, hierarchical clustering is proposed to lower retrieval response time by creating clusters of closely related document components. During retrieval (result computation), these clusters are used to filter documents that do not correspond to any of the query constraints before weighing and matching are applied.

Owing to the challenges described, X-DOSE (XML-Document Oriented Search Engine) is developed. Based on a client-server architecture, the system processes various requests (indexing, retrieval, classification, and clustering) in parallel. Results are transferred back to the client, which displays them to the user. The client allows the user to browse and refine the query too.

The performance of X-DOSE is evaluated using an official collection of real-world XML documents provided by INEX. Well defined retrieval tasks and evaluation measures enable a comparison of its performance against other systems using the same data and retrieval tasks. The results demonstrate that linguistically-motivated text representations can improve retrieval quality efficiently. The evaluation also shows that, especially for complex retrieval tasks, X-DOSE is competitive to other systems and even outperforms them.

 

Contributions

The contributions achieved by this research are of two types, theoretical issues and tool aspects. Theoretical issues cover concepts inherent in structured document retrieval, natural language processing, text mining, and document mining (classification and clustering). Tool aspects include the implementation of a structured document retrieval system, X-DOSE, incorporating those concepts. It comprises natural language processing and text mining components. Further, linguistic resources relevant to information retrieval are generated. X-DOSE has been evaluated extensively with experiments focusing on retrieval performance. In addition to that, experimental results of classification and clustering improving retrieval complement the evaluation.

 

Theoretical Contributions

  • Background An extensive overview of structured document retrieval is presented. It includes a description of XML, the standard format for representing the structure within text-based documents, historical development of structured document retrieval, retrieval models, representation, indexing, storage of documents, matching and ranking, query languages, result representation, and retrieval evaluation. Along with each of these topics, the challenges are pinpointed and existing solutions are provided.
  • Storage The documents are stored in a relational database. In order to ensure fast access to the structure, content, and metadata information, an existing approach is extended to fit these requirements.
  • Dynamic term spaces Since structured documents consist not only of a single (kind of) content, dynamic term spaces are defined. According to different structural levels and varying content types, dynamic term spaces enable structure-related indexing and retrieval of document components of arbitrary granularity.
  • XOR query language In cooperation with colleagues of other universities, the XML Oriented Retrieval (XOR) language is defined. XOR supports query constraints on the structure and on the content, and distinguishes components searched and components retrieved. Further, boolean operators that combine subqueries, qualifiers for the structure and the query terms, and term operators such as quotes and negation are provided.
  • Extended tokenization Representation issues of natural language texts have been addressed in Chapter 4. In this context, extended tokenization is introduced as initial processing step that includes token typing and multi-term concepts on different levels. Based on language-dependent dictionaries and tokenization rules, texts are preprocessed for subsequent analysis steps without any interpretation of their content.
  • Multi-layered stopword model In this work, stopword filtering is applied to reduce the amount of index terms. A multi-layered stopword model of functional, content-related, and domain-specific stopwords is motivated and implemented. Formulae for ranking and extracting stopwords automatically are defined and tested.
  • Text mining Chapter 5 is concerned with corpus-based text mining tasks based on the output of extended tokenization. A process that generates tokenization rules and dictionaries of single-tokens and multi-tokens is defined and illustrated.
  • XML classification Aspects relevant to classification of XML components into predefined categories have been discussed in Chapter 6. In this context, a similarity function that compares two document trees is defined. Two measures that combine structural similarity and content similarity are proposed. A first approach based on tree edit distance focuses on the structural match of two document trees. In contrast, a second approach based on content matrix distance puts more emphasis on the content match. The evaluation shows that best performance is achieved when both measures are combined.
  • XML clustering Clustering of structured documents is covered in Chapter 7. A novel representation of cluster representatives based on supertrees is introduced. A merge operation that inserts a document tree into a supertree is defined. Two similarity functions, one that compares a document tree to a supertree, and another one that compares two supertrees to each other, are provided. Both similarity functions support ordered and unordered tree comparisons.
  • Result ranking Ranking, as proposed in this work, combines the similarities computed separately for the content, structure, and metadata information. User-defined query parameters allow to control the impact of each of these factors on the overall relevance of a result component. Therefore, a set of ranking formulae is defined.

 

Tool Aspects

  • Generic XML format A generic XML document format is developed together with a mapping mechanism that transforms incoming documents into that format. The motivation of exploiting a single format for retrieval is to guarantee (1) valid document structures, (2) uniform access to standardized components, and (3) efficient processing.
  • XOR parser In order to process queries formulated in the XOR query language correctly, a parser based on JavaCC is implemented. It parses a valid XOR query and transforms it into an Abstract Syntax Tree (AST).
  • JavaTok Referring to the objectives of extended tokenization, JavaTok, a tokenizer that covers these issues, is implemented. JavaTok is available as stand-alone software, Java class framework, and online version. Texts are processed in a pipelined architecture of eights steps. Each step is optional and can be parameterized via a configuration file. Token type assignment and multi-token creation are done by dictionary lookup and rule matching. UTF-16 support and multiple output formats such as HTML and XML facilitate the integration of JavaTok into other systems.
  • Stopword list In several experiments, a set of 804 categorized stopwords (functional, content-related, and domain-specific) is extracted from the experimental corpus used. In focused experiments, performance evaluations regarding the retrieval quality of each stopword layer are carried out.
  • Generation of natural language resources From the document collection, various natural language resources are generated to improve natural language text representation. 72 basic token types relevant to information retrieval are identified. Using that token types, 322 corpus-specific abbreviations, over 650.000 multi-terms (composite nouns, named entities, and formulaic speech), and about 17.000 acronyms with their full forms are extracted into separate dictionaries. In addition to that, a rule miner is implemented. For a given input pattern (a sequence of input tokens), it extracts statistically-motivated identification rules based on the left and right token context of the pattern. Using that rule miner, a set of complex multi-token rules was identified (e.g., phone numbers, citations, acronyms).
  • Experiments on XML classification Several experiments on five different data sets provided by INEX are conducted to evaluate the performance of XML classification. A comparison with other classification systems shows that the approach proposed is competitive compared to others.
  • Experiments on XML clustering Clustering performance is evaluated on six different data sets provided by INEX. In five experiments, the performance of $k$-Means and agglomerative hierarchical clustering is compared using several parameter settings.
  • X-DOSE prototype X-DOSE (XML-Document Oriented Search Engine) is implemented to measure the effectiveness of the retrieval approach proposed in this work. X-DOSE is based on a client-server architecture. It supports three types of queries, depending on the expertise of the user. Internally, queries are represented in an extended XOR syntax. Results are displayed by the client, providing browsing functionality of single documents. A class manager enables the classification of XML components retrieved into user-defined categories. XML clustering is integrated to speed up retrieval internally.
  • Result representation A novel graphical presentation of result components based on treeMaps and textBars is implemented. It enables navigation within the structure of a result document/component by supporting users with a relevance-based color schema that reflects content and metadata similarities.
  • Evaluation of X-DOSE An extensive evaluation of the X-DOSE retrieval performance focusing on computational complexity and retrieval quality is conducted. Relying on an official corpus of about 17.000 computer science documents, nine separate experiments focus on the quality of text representations, the interaction of structure, metadata, and content information, optimal parameter settings, and response time reduction. An evaluation metric based on cumulated gain ($nxCG$) is used to express the performance of each retrieval task. Performance comparisons of similar systems using the same data show that X-DOSE is competitive and even outperforms other systems.

 

Lessons Learned

Research in the field of structured document retrieval is up-to-date and just at its beginning. Only with fundamental knowledge about the internal processing and functioning of these systems, adequate solutions to the difficulties described become feasible. Especially, computational performance is an issue that researchers and developers must be aware of. Although hardware becomes more powerful, faster, and cheaper, it is surely not the answer to manage the complexity of processing structured documents. Hence, it might be interesting for other researchers in this field to draw some conclusions from the theoretical work and the experiments.

  • Text representation Experimental results demonstrate that even in structured documents content information is more important than structural or metadata information. Thus, improvements of text representations improve the retrieval quality of structured documents. Especially extended tokenization supporting token typing and multi-terms, as proposed in this work, turns out to be an effective technique to pre-process textual content efficiently. Subsequent processing steps such as tagging strongly rely on the correctness of the tokenization output. Filtering of functional, content-related, and domain-specific stopwords further improves the quality of index terms while reducing the size of the vocabulary.
  • Metadata matching Experiments show that metadata matching and content matching has to be performed differently. For instance, skipping stopword filtering for matching metadata leads to retrieval of a huge number of unwanted components (e.g., documents, sections, tables, figures). This is because functional stopwords (e.g., `a', `the', `and') are included frequently in metadata fields such as titles and captions. Retrieval based on metadata constraints has also to be reconsidered. A two-step procedure turned out to perform best: Metadata requests are transformed into SQL statements and processed by the database efficiently. For the results retrieved, similarity of metadata terms and query terms is computed using an overlap-based function on the two term sets.
  • Performance Indexing, retrieval, classification, and clustering of structured documents is more complex than it is in traditional information retrieval. Since performance is crucial for retrieval systems, processing must be efficient to ensure acceptable response times even for large amounts of data. This is especially true for structured document retrieval.
  • Dynamic aspects The possibility to keep term weights dynamic (on-the-fly computation during retrieval) becomes expensive in terms of processing performance. Although dynamic aspects are relevant and should be maintained, pre-computed and static representations stored might be necessary to speed up processing.
  • Data structures Related to the performance issue, appropriate data structures are the foundation of efficient processing. Especially tree data structures, as created by hierarchical clustering for document pre-selection, speeded up retrieval performance considerably. This concept might be transferred to improve the access to other data such as hierarchical text representations where common terms of two representations are raised one level higher in a representation tree.

 

Future Research Directions

There are a variety of exciting future directions in which the present work can be continued. Some of these directions were already presented at the end of Chapter 6 and Chapter 8. Other aspects worthy to be investigated further include:

  • Text mining Identification of textual patterns that go beyond multi-tokens such as named entities or citations is definitely a promising step to improve text analysis and, consequently, information retrieval. Besides the generation of static dictionaries, mechanisms that derive rules to dynamically identify these patterns in texts are an interesting issue. Applying these resources, for instance, in taggers, tagging statistics and tagging rules are reduced and simplified while the overall performance is increased regarding both, tagging quality and processing speed.
  • Shallow parsing The output of extended tokenization might be used to identify constituents of sentences such as noun phrases or verb phrases. For instance, the sequence $FS_i$ + $FS_j$ + $ALPHA\_COMMON\_LOWER$ (two functional stopwords followed by a lower case term) might be glued together as a single noun phrase (e.g., in the house, on the tree, of the road). Applying this mechanism recursively using different rule repositories, shallow parsing functionality based on token types is accomplished. Because of JavaTok's processing speed, this kind of parsing requires only little computational time. Hence, it is applicable to process even large amounts of data.
  • Additional experiments Most evaluations conducted in this research are based on the INEX document collection containing documents of a single domain. Further experiments using other corpora and queries are necessary to gain higher external validity. This includes that dictionaries and rules generated might also be enriched by patterns of other domains.
  • Document structure The generic XML format proposed is an initial step towards a general model for structuring (text) documents. From the retrieval perspective, the three main elements for structuring a document, (DOC, SEC, and FRA), seem to be sufficient. However, metadata fields for each component were chosen according to the information available in the source documents. These data fields have to be reconsidered for the general case. The generic format further supports sub-structured content within fragments (FRA elements). However, there are no standards available that ensure a correct syntax. According to the type of a fragment (e.g, table, list, figure, formula), schemata that define these substructures are needed. To realize this, there exist several alternatives such as the usage of HTML-like table syntax and MathML markup for formulae.
  • Structural similarity In the current approach, constraints on the structure are implemented as strict filters. This speeds up retrieval, but prohibits a vague matching of structural constraints (meant as structural hints). Assuming complex nested document structures, vague matching might be beneficial to both, query formulation and retrieval quality.

In sum, the goal of this thesis has been successfully achieved. In line with this research, a fully functioning structured document retrieval system called X-DOSE has been implemented. Of course, further developments may concentrate on the extension, optimization, and tuning of X-DOSE. Beyond that, the broadness and application area of structured document retrieval leaves much open space for future researchers.