"The best way to get a good idea is to get a lot of ideas." L. Pauling
This chapter summarizes the research done in this dissertation. In
particular, it highlights the scientific contributions to the field
of structured document retrieval and lessons learned. Moreover it
sheds light on some perspectives and future research directions.
Concerns
This thesis introduces structured document retrieval as a further
development of traditional information retrieval. Documents are no
longer understood as flat content containers. Instead, they look
like a hierarchical structure of elements (called document
components). In other terms, the contents of a document are
structured into various levels of granularity. Retrieval systems
that take advantage of that additional information source have to be
adjusted to fit the challenges coming along with this new paradigm.
XML is used as an appropriate means to represent structured
documents. In order to generalize structural elements, documents are
mapped onto a generic XML document format and are stored in a
relational database. Clues about the structure, including available metadata, are used in
combination with the content to satisfy a users' information need
more accurately. This benefits retrieval in several ways: (1) Parts
of the documents usually provide more focused answers to queries
than complete documents; (2) The search engine focuses on relevant
information avoiding further search and browsing activities by
users; (3) Expert users are able to formulate highly specific
queries.
Natural Language Processing (NLP) transforms the textual content of
document components into term frequency vector representations. It
includes an analysis of complex linguistic patterns that are
extracted autonomously. This results in multiple representations of
a text that are used conjointly and support each other. These
representations of a text allow the adoption of modified versions of
the traditional vector space model to match and rank components
according to a user query.
During retrieval, the same NLP procedures are applied on the query
terms to obtain a term frequency vector. Weighing of document and
query terms is done on-the-fly. Content similarity, metadata
similarity, and structural similarity are combined to compute the
relevance of a document component to the user's query given a set of
user-defined query parameters.
Classification of similar document components into user-defined
categories is proposed as a means of enhancing and facilitating the
organization and navigation in large document repositories. Because
of the hierarchical structure, components are represented as ordered
trees and compared using a distance measure.
Providing the system with pre-assigned components and categories,
classification is learned and unseen components are assigned
automatically.
Computational performance is a critical issue of retrieval systems
and strongly depends on the underlying data structures and matching
mechanisms.
In this work, hierarchical clustering is proposed to lower retrieval
response time by creating clusters of closely related document
components.
During retrieval (result computation), these clusters are used to
filter documents that do not correspond to any of the query
constraints before weighing and matching are applied.
Owing to the challenges described, X-DOSE (XML-Document Oriented
Search Engine) is developed. Based on a client-server architecture,
the system processes various requests (indexing, retrieval,
classification, and clustering) in parallel. Results are transferred
back to the client, which displays them to the user. The client
allows the user to browse and refine the query too.
The performance of X-DOSE is evaluated using an official collection
of real-world XML documents provided by INEX. Well defined retrieval
tasks and evaluation measures enable a comparison of its performance
against other systems using the same data and retrieval tasks. The results demonstrate that linguistically-motivated text
representations can improve retrieval quality efficiently.
The evaluation also shows that, especially for complex retrieval
tasks, X-DOSE is competitive to other systems and even outperforms
them.
Contributions
The contributions achieved by this research are of two types,
theoretical issues and tool aspects. Theoretical issues cover
concepts inherent in structured document retrieval, natural language
processing, text mining, and document mining (classification and
clustering). Tool aspects include the implementation of a structured
document retrieval system, X-DOSE, incorporating those concepts. It
comprises natural language processing and text mining components.
Further, linguistic resources relevant to information retrieval are
generated.
X-DOSE has been evaluated extensively with experiments focusing on
retrieval performance. In addition to that, experimental results of
classification and clustering improving retrieval complement the
evaluation.
Theoretical Contributions
- Background An extensive overview of structured document retrieval is
presented. It includes a description of XML, the standard format for
representing the structure within text-based documents, historical development
of structured document retrieval,
retrieval models, representation, indexing, storage
of documents, matching and ranking, query languages, result
representation, and retrieval evaluation. Along with each of these
topics, the challenges are pinpointed and existing solutions are provided.
- Storage The documents are stored in a relational database. In order to
ensure fast access to the structure, content, and metadata
information, an existing approach is extended to fit these
requirements.
- Dynamic term spaces Since structured documents consist not only of a single (kind of)
content, dynamic term spaces are defined.
According to different structural levels and varying content types,
dynamic term spaces enable structure-related indexing and retrieval
of document components of arbitrary granularity.
- XOR query language In cooperation with colleagues of other universities, the XML Oriented Retrieval
(XOR) language is defined. XOR supports query constraints
on the structure and on the content, and distinguishes components searched and
components retrieved. Further, boolean operators that combine
subqueries, qualifiers for the structure
and the query terms, and term operators such
as quotes and negation are provided.
- Extended tokenization Representation issues of natural language texts have been addressed
in Chapter 4. In this context, extended
tokenization is introduced as initial processing step that includes
token typing and multi-term concepts on different levels.
Based on language-dependent dictionaries and tokenization rules,
texts are preprocessed for subsequent analysis steps without any
interpretation of their content.
- Multi-layered stopword model In this work, stopword filtering is applied to reduce the
amount of index terms. A multi-layered stopword model of functional,
content-related, and domain-specific stopwords is motivated and
implemented. Formulae for ranking and extracting stopwords
automatically are defined and tested.
- Text mining Chapter 5 is concerned with
corpus-based text mining tasks based on the output of extended
tokenization. A process that generates tokenization rules and
dictionaries of single-tokens and multi-tokens is defined and illustrated.
- XML classification Aspects relevant to classification of XML components into predefined
categories have been discussed in
Chapter 6. In this context,
a similarity function that compares two document trees is defined.
Two measures that combine structural similarity and content similarity
are proposed. A first approach based on tree edit distance focuses on
the structural match of two document trees. In contrast, a second approach based
on content matrix distance puts more emphasis on the content match.
The evaluation shows that best performance is achieved when
both measures are combined.
- XML clustering Clustering of structured documents is covered in
Chapter 7. A novel
representation of cluster representatives based on supertrees is
introduced. A merge operation that inserts a document tree into a
supertree is defined. Two similarity functions, one that compares a
document tree to a supertree, and another one that compares two
supertrees to each other, are provided. Both similarity functions support
ordered and unordered tree comparisons.
- Result ranking Ranking, as proposed in this work, combines the similarities
computed separately for the content, structure, and metadata
information. User-defined query parameters allow to control the
impact of each of these factors on the overall relevance of a result
component. Therefore, a set of ranking formulae is defined.
Tool Aspects
- Generic XML format A generic XML document format is developed together with a mapping
mechanism that transforms incoming documents into that format.
The motivation of exploiting a single format for retrieval is
to guarantee (1) valid document structures, (2) uniform access to
standardized components, and (3) efficient processing.
- XOR parser In order to process queries formulated in the XOR query language correctly,
a parser based on JavaCC is implemented. It parses a valid XOR
query and transforms it into an Abstract Syntax Tree (AST).
- JavaTok Referring to the objectives of extended tokenization, JavaTok, a
tokenizer that covers these issues, is implemented. JavaTok is
available as stand-alone software, Java class framework, and online
version. Texts are processed in a pipelined architecture of eights
steps. Each step is optional and can be parameterized via a
configuration file. Token type assignment and multi-token creation
are done by dictionary lookup and rule matching. UTF-16 support and
multiple output formats such as HTML and XML facilitate the integration of
JavaTok into other systems.
- Stopword list In several experiments, a set of 804 categorized stopwords
(functional, content-related, and domain-specific) is extracted
from the experimental corpus used.
In focused experiments, performance evaluations regarding the retrieval
quality of each stopword layer are carried out.
- Generation of natural language resources From the document collection, various natural language resources
are
generated to improve natural language text representation.
72 basic token types relevant to information retrieval
are identified. Using that token types, 322 corpus-specific abbreviations,
over 650.000 multi-terms
(composite nouns, named entities, and formulaic speech), and about
17.000 acronyms with their full forms are extracted into separate dictionaries. In addition to that, a rule miner is implemented. For a given input pattern
(a sequence of input tokens), it extracts statistically-motivated
identification rules based on the left and right token context of the pattern. Using that rule miner, a set of complex multi-token rules
was identified (e.g., phone numbers, citations, acronyms).
- Experiments on XML classification Several experiments on five different data sets provided by INEX are conducted to
evaluate the performance of XML classification. A
comparison with other classification systems shows that the approach
proposed is competitive compared to others.
- Experiments on XML clustering Clustering performance is evaluated on six different data sets provided by INEX. In five
experiments, the performance of $k$-Means and agglomerative hierarchical clustering
is compared using several parameter settings.
- X-DOSE prototype X-DOSE (XML-Document Oriented Search Engine) is implemented to
measure the effectiveness of the retrieval approach proposed in this work. X-DOSE is
based on a client-server architecture. It supports three types of
queries, depending on the expertise of the user. Internally, queries
are represented in an extended XOR syntax. Results are displayed by
the client, providing browsing functionality of single documents.
A class manager enables the classification of XML components retrieved into
user-defined categories. XML clustering is integrated
to speed up retrieval internally.
- Result representation A novel graphical presentation of result components based on
treeMaps and textBars is implemented. It enables navigation within
the structure of a result document/component by supporting
users with a relevance-based color schema that reflects content
and metadata similarities.
- Evaluation of X-DOSE An extensive evaluation of the X-DOSE retrieval performance focusing on computational
complexity and retrieval quality is conducted. Relying on an
official corpus of about 17.000 computer science documents, nine
separate experiments focus on the quality of text representations,
the interaction of structure, metadata, and content information,
optimal parameter settings, and response time reduction. An
evaluation metric based on cumulated gain ($nxCG$) is used to
express the performance of each retrieval task. Performance comparisons of
similar systems using the same data show that X-DOSE is
competitive and even outperforms other systems.
Lessons Learned
Research in the field of structured document retrieval is up-to-date
and just at its beginning. Only with fundamental knowledge about the
internal processing and functioning of these systems, adequate
solutions to the difficulties described become feasible.
Especially, computational performance is an issue that researchers
and developers must be aware of. Although hardware becomes more
powerful, faster, and cheaper, it is surely not the answer to manage
the complexity of processing structured documents. Hence, it might be interesting for other researchers in this field
to draw some conclusions from the theoretical work and the
experiments.
- Text representation Experimental results demonstrate that even in structured documents content information
is more important than structural or metadata information. Thus,
improvements of text representations improve the retrieval quality
of structured documents.
Especially extended tokenization supporting token typing and
multi-terms, as proposed in this work, turns out to be an effective
technique to pre-process textual content efficiently.
Subsequent processing steps such as tagging strongly
rely on the correctness of the tokenization output.
Filtering of functional, content-related, and domain-specific
stopwords further improves the quality of index terms while reducing
the size of the vocabulary.
- Metadata matching Experiments show that metadata matching and content matching has
to be performed differently.
For instance, skipping stopword filtering for matching metadata
leads to retrieval of a huge number of unwanted components (e.g., documents,
sections, tables, figures).
This is because functional stopwords (e.g., `a', `the', `and')
are included frequently in metadata fields such as titles and captions.
Retrieval based on metadata constraints has also to be reconsidered.
A two-step procedure turned out to perform best: Metadata requests
are transformed into SQL statements and processed by the database
efficiently. For the results retrieved, similarity of metadata terms and
query terms is computed using an overlap-based function on the two term
sets.
- Performance Indexing, retrieval, classification, and clustering of structured
documents is more complex than it is in traditional information
retrieval. Since performance is crucial for retrieval systems,
processing must be efficient to ensure acceptable response times
even for large amounts of data. This is especially true for
structured document retrieval.
- Dynamic aspects The possibility to keep term weights dynamic (on-the-fly computation
during retrieval) becomes expensive in terms of processing performance.
Although dynamic aspects are relevant and should be maintained,
pre-computed and static representations stored might be necessary to
speed up processing.
- Data structures
Related to the performance issue, appropriate data structures are
the foundation of efficient processing. Especially tree data
structures, as created by hierarchical clustering for document
pre-selection, speeded up retrieval performance considerably.
This concept might be transferred to improve the access to other data
such as hierarchical text representations where common terms of two representations
are raised one level higher in a representation tree.
Future Research Directions
There are a variety of exciting future directions in which the
present work can be continued. Some of these directions were already
presented at the end of
Chapter 6 and
Chapter 8. Other aspects worthy to be investigated
further include:
- Text mining
Identification of textual patterns that go beyond multi-tokens such
as named entities or citations is definitely a
promising step to improve text analysis and, consequently, information retrieval.
Besides the generation of static dictionaries,
mechanisms that derive rules to dynamically identify these patterns in texts
are an interesting issue. Applying these resources, for instance, in taggers, tagging statistics
and tagging rules are reduced and simplified while the overall
performance is increased regarding both, tagging
quality and processing speed.
- Shallow parsing The output of extended tokenization might be used to identify
constituents of sentences such as noun phrases or verb phrases. For
instance, the sequence $FS_i$ + $FS_j$ + $ALPHA\_COMMON\_LOWER$ (two functional
stopwords followed by a lower case term) might be glued together as a
single noun phrase (e.g., in the house,
on the tree, of the road). Applying this mechanism
recursively using different rule repositories, shallow
parsing functionality based on token types is accomplished. Because
of JavaTok's processing speed, this kind of parsing requires only
little computational time. Hence, it is applicable to process even
large amounts of data.
- Additional experiments Most evaluations conducted in this research are based on the INEX
document collection containing documents of a single domain. Further
experiments using other corpora and queries are necessary to gain
higher external validity. This includes that dictionaries and
rules generated might also be enriched
by patterns of other domains.
- Document structure The generic XML format proposed is an initial step towards a general
model for structuring (text) documents.
From the retrieval perspective, the three main elements for
structuring a document,
(DOC, SEC, and FRA), seem to
be sufficient. However, metadata fields for each component were
chosen according to the information available in the source
documents. These data fields have to be reconsidered for the general case.
The generic format further supports sub-structured
content within fragments (FRA elements). However, there are no standards available that
ensure a correct syntax. According to the type of a fragment (e.g,
table, list, figure, formula), schemata that define these
substructures are needed. To realize this, there exist several alternatives
such as the usage of HTML-like table syntax and MathML markup for formulae.
- Structural similarity In the current approach, constraints on the structure are
implemented as strict filters. This speeds up retrieval, but prohibits
a vague matching of structural constraints
(meant as structural hints).
Assuming complex nested document structures, vague matching might be
beneficial to both, query formulation and retrieval quality.
In sum, the goal of this thesis has been successfully achieved. In
line with this research, a fully functioning structured document
retrieval system called X-DOSE has been implemented. Of course,
further developments may concentrate on the extension, optimization,
and tuning of X-DOSE.
Beyond that, the broadness and application area of structured
document retrieval leaves much open space for future researchers.
|