Research @ Hekkas.Com

Linguistically Enhanced Information Retrieval of Structured Documents

XML Classification

"Science must begin with myths, and with the criticism of myths." Karl Popper


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Chapter 6 - Classification of XML Documents
%
% last change: 16.08.2004
% correction hamid: xx.xx.2004
% correction prof: xx.xx.2004
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\aphorism{Controlling complexity is the essence of computer programming.}{B. Kernighan}%
\chapter{Classification of XML Documents}%
\label{chapter:xml_document_classification}%

\def\argmax{\mathop{\rm arg\,max}}

With increasing amounts of available XML documents, the task of
automatic knowledge discovery from the web becomes highly
significant. As an appropriate machinery, classification allows to
categorize documents to facilitate that task. In this chapter a
classification approach for structured documents is introduced. It
is based on the $k$-nearest neighborhood algorithm that relies on an
edit distance measure. The originality of the work lies in combining
both the content and the structure of XML documents to compute the
edit distance. The approach is empirically evaluated using
real-world XML collections provided by INEX.\footnote{Parts of this
chapter have already been published
in~\cite{bouchachia07xmlclassification} and \cite{hassler07xml}.}

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Introduction}

% general introduction
The eXtensible Markup Language has recently emerged as a standard
for developing many web applications dealing with document storage
and retrieval, e.g., digital libraries. XML was mainly developed to
achieve an enriched representation of documents and more retrieval
flexibility when searching information. Systems concerned with such
web applications are mainly augmented traditional information
retrieval systems. In contrast to traditional retrieval systems that
deal with flat documents, XML retrieval systems take the logical
structure of documents into account. In addition to the raw text
(pure content), this structure is considered as a further valuable
information source for document representation. It serves to refine
the search process and to improve the quality of the retrieval
results. Indeed, each document is represented as a tree of XML
nodes, where each node is associated with a label and a content (XML
component). The goal of retrieval systems is, therefore, to retrieve
only relevant components instead of the whole documents in response
to the user queries. As a matter of consequence, the retrieval
precision gets better.

% why classification
Form the structured document point of view, it is of high importance
to exploit the structure of documents in order to devise a
classification machinery. This latter is a very significant
mechanism in the context of XML retrieval for two reasons: (1) a
user query can be satisfied by means of different possible answers;
closely associated documents tend to be relevant to the same
requests; thus, returning all documents of classes with relevant
documents is feasible, and (2) retrieval systems and web mining
tools are generally operational in the same environment.

% what info is used
Classifying (and clustering) XML documents can be basically done in
three ways: (1) using exclusively the textual contents of documents
as usually done in traditional text categorization (and clustering)
systems, (2) using exclusively the structure of the XML documents,
and (3) using both the contents and the structure in a hybrid
manner. In this work, the latter approach is followed. The aim is to
discover structural and content patterns (characteristics) shared by
XML documents of the same class. These patterns are mainly expressed
in terms of their tags (node labels), contents, and
inter-relationship.

% kNN
To achieve this goal, the $k$-nearest neighborhood algorithm is
applied for classification. The algorithm, which strongly relies on
a distance measure, will be explained later. Since an XML document
is represented in the form of a tree, it is straightforward to adopt
an \emph{edit distance} to measure the distance between document
trees (dissimilarity measure). Basically, edit distance algorithms
compute the minimum cost to transform one document tree, the
\emph{source tree}, into another document tree, the
\emph{destination tree}.
%
Generally, existing algorithms focus on the structure only. This
chapter shows how the content can be embedded into the edit
distance. To the best of the author's knowledge, no previous work
has used edit distance taking the content and structure of the XML
tree into account.

% edit distance
The idea to measure the similarity between document-centric XML
documents using an edit distance measure is based on the way
documents are written in real world. This allows to take `natural'
document authoring techniques (see
Figure~\ref{fig:document_authoring}) into account, applying
application-specific weights. Looking at real world documents, the
structure is also quite homogenous and could simply be reduced to a
minimal set of different nodes (i.e., title, author, section,
paragraph, table, figure, etc.). This homogenous set of XML
components, which is highly recursive (i.e., chapter, section,
subsection), leads to quite similar document structures of
completely different content. Thus, an appropriate similarity
measure plays a key role in this domain.
%
\begin{figure}[ht]
\centering
%
\subfloat[inserting]{\includegraphics[width=0.3\textwidth]{07_classification/figures/Document_Authoring_insert}\label{fig:document_authoring_insert}}
\qquad%
\subfloat[deleting]{\includegraphics[width=0.3\textwidth]{07_classification/figures/Document_Authoring_delete}\label{fig:document_authoring_delete}}
\qquad%
\subfloat[altering]{\includegraphics[width=0.3\textwidth]{07_classification/figures/Document_Authoring_alter}\label{fig:document_authoring_alter}}
\\%
\vspace{0.5cm}%
\subfloat[copying]{\includegraphics[width=0.3\textwidth]{07_classification/figures/Document_Authoring_copy}\label{fig:document_authoring_copy}}
\qquad%
\qquad%
\qquad%
\subfloat[moving]{\includegraphics[width=0.3\textwidth]{07_classification/figures/Document_Authoring_move}\label{fig:document_authoring_move}}
\\%
\vspace{0.5cm}%
\subfloat[merging]{\includegraphics[width=0.3\textwidth]{07_classification/figures/Document_Authoring_merge}\label{fig:document_authoring_merge}}
\qquad%
\qquad%
\qquad%
\subfloat[splitting]{\includegraphics[width=0.3\textwidth]{07_classification/figures/Document_Authoring_split}\label{fig:document_authoring_split}}
\\%
\vspace{0.5cm}%
\subfloat[raising]{\includegraphics[width=0.3\textwidth]{07_classification/figures/Document_Authoring_raise}\label{fig:document_authoring_raise}}
\qquad%
\qquad%
\qquad%
\subfloat[lowering]{\includegraphics[width=0.3\textwidth]{07_classification/figures/Document_Authoring_lower}\label{fig:document_authoring_lower}}
%
\caption{Document authoring operations}
\label{fig:document_authoring}
\end{figure}

% outline
The rest of this chapter is organized as follows. Some of the
research work dedicated to XML document classification is briefly
summarized in Section~\ref{sec:related}. In
Section~\ref{sec:ted_distance}, the edit distance algorithm is
discussed with respect to content and structure. An alternative
approach, based solely on the content of documents, is introduced in
Section~\ref{sec:cm_distance}. Section~\ref{sec:knn} briefly
explains the $k$-NN algorithm. Section~\ref{sec:evaluation}
discusses the experimental evaluation of the approaches. Finally,
Section~\ref{sec:classification:summary} concludes this chapter.

 

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Related Work}
\label{sec:related}

Although classification has been widely discussed in the framework
of traditional information retrieval (with flat documents), it has
not yet gained much attention in the field of structured documents.
In the sequel, some of the research work dedicated to XML
document classification is briefly summarized.

% Denoyer
In~\cite{denoyer04bayesian}, a classification approach is proposed
that aims at using the structure and the contents to classify XML
documents. It relies on a generative Bayesian classifier. Here, an
XML document is represented in the conventional tree-like form of a
directed acyclic graph, where each node of the graph represents a
XML component and each edge represents a parental relationship
between the nodes. The generative model assumes that there are two
types of belief, structural and textual, which will be combined to
get one single evidence about a document assignment to classes:
%
\[P(doc|\theta) = P(str|\theta) \times P(cont|str,\theta)\]
%
where $\theta$ is a set of the model parameters, $doc$, $str$, and
$cont$ designate document, structure, and content. The quantities
$P(str|\theta)$ and $P(cont|str,\theta)$ are computed via a belief
network as follows:
%
\[P(str|\theta)=\prod_{i=1}^{\#~of~tags} P(node_i|ancestor(node_i))\]
%
\[P(cont|str,\theta)=\prod_{i=1}^{\#~of~terms}P(term_i|node_j,\theta)\]

During the training phase, the parameters of the generative model
are learned using the Expectation-Maximization method. During the
classification phase, the classifier is used to assign unseen XML
documents to the class with the highest probability
$P(doc|class_k)$.

 

% Bratko
In~\cite{bratko04exploiting}, a similar work to that described in
\cite{denoyer04bayesian} is developed. It consists of exploiting
structural information in semi-structured documents using a
generative Bayesian Classifier. The method, however, is less
general, since the assumption is that all documents have the same
structure across all classes. Documents are broken down into
components, each of which contains either structured data or
non-structured textual data. Let $d_j$, $c_k$, $s_l$, $t_i$,
designate respectively a document $j$, a class $k$, a component $l$,
and a term $i$, the total probability of a document is the product
of the individual component probabilities:
%
\[P(d_j|c_k)=\prod_{s_l\in d_j}\prod_{t_i\in s_l}P_{s_l}(t_i|c_k)^{f^j(t_{is})}\]
%
where $f^j(t_{is})$ indicates the word frequency of term $i$ in the
$s^{th}$ component of the document $j$. The first product is over
all structural components $s_l$ that are present in the document
$d_j$. The probability $P_{s_l}$ is obtained for each document
component. From this, it is clear that each document is represented
by a set of feature vectors, one for each component, so that word
frequency $f^j(t_{is})$ is maintained on a per-component basis.

 

% zaki
In~\cite{zaki03xrules} a classifier, called XRules, is developed.
This classifier is structure-oriented and aims at discovering a set
of structural rules that define the individual classes. Basically,
these rules that reflect regular structural patterns of each class
are learned during the training phase. During the classification
phase, given an unlabeled XML document, the set of rules pertaining
to that document are used to compute the membership evidence to
classes.

 

% graph matching
Hence (document) trees are a special form of graphs, approaches from
the graph matching and comparison domain can also be
applied~\cite{wang95graph,zhang95editing,messmer98subgraph,bunke00graphs,myers99graph,shoubridge99networks,cordella99evaluation,hlaoui02graph,hlaoui03graph,gori05graph,huang06subgraph}.
Nearly all of these approaches rely on additional heuristics to
reduce the graph isomorphism problem, which is considered to be
$\mathcal{NP}$-complete~\cite{lubiw81graph}, to a simpler problem
definition. These heuristics, for the sake of comparing
document-centric XML documents, are not on hand. Also, the
additional freedom in the structure provided by a graph
representation seems not to reflect the natural way of document
authoring. Because of this and their runtime performance (high
complexity) these approaches are not investigated in this work.

 

% tree edit distance
Most of the work in the area of XML documents classification uses
\emph{tree edit distance} to measure the distance between XML
document trees. Basically, edit distance algorithms compute the
minimum cost to transform one document tree into another. Each
transformation (e.g., insert, delete, alter) is associated with a
certain cost. In addition to the edit distance measure, other
approaches apply other types of distance measures. These rely on XML
peculiarities such as tags, parent-child relationships, root-leaf
XPaths, arity of nodes,
etc.~\cite{zhang03similarity,candillier05classification,flesca05similarity,park05search}.

% examples, edit script
Many algorithms have been proposed to compute the edit distance
between two trees
\cite{chawathe96change,chawathe97meaningful,chawathe99comparing,cobena02detecting,tai79tree,zhang89editing,wang94system,shasha97approximate}.
Most of these algorithms use dynamic programming techniques as
initially described in~\cite{levenshtein66binary}. The aim is to
find the cheapest sequence of transformations (i.e., the cumulated
costs), called \emph{delta script} or \emph{edit
script}~\cite{chawathe96change,wang03xdiff,wang03xdiff_master}, to
transform a tree $T_1$ into a tree $T_2$.
%
\begin{eqnarray}
dist(T_1,T_2)&=& \min_j\{s_j\} \\
s_j &=& \sum_i^{\#~transformations}{t_{i}^{(j)}}
\end{eqnarray}
%
where $s_j$ is a script, and $t_i^{(j)}$ is the cost of a
transformation $i$ in the edit script $j$.

% allowed operations
The main difference of the various edit distance measures is the set
of allowed edit operations and their associated costs. Early
work~\cite{selkow77tree} considered \emph{insert} and \emph{delete}
of leaf nodes, and node \emph{relabeling} of all nodes. Extensions
were then proposed to allow insertion and deletion of nodes anywhere
in the
tree~\cite{tai79tree,zhang89editing,wang94system,shasha97approximate}.

% ordered trees
In general the problem of finding the edit distance between two
trees is
$\mathcal{NP}$-hard~\cite{mani01normalforms,tai79tree,barnard95treetotree}.
Works in \cite{zhang89editing} and \cite{chawathe96change} instead
try to find an efficient algorithm for the reduced problem of
ordered/binary trees, in which a left-to-right order among siblings
is significant. For XML documents arising in database applications
(data-centric applications), authors believe that the unordered
model is more important~\cite{wang03xdiff}. However, in the context
of XML documents originating from natural language texts
(document-centric applications) one can argue for ordered trees.

% chawate
Chawathe et al.~\cite{chawathe96change} first applied the same edit
operations and restrictions to detect changes in structured
documents. In subsequent works he extended the approach to cover
also move operations as basic edit
operations~\cite{chawathe97meaningful}. Later he defined
operations for copying and gluing of
subtrees~\cite{chawathe99comparing}. In order to accelerate edit
distance calculations, Guha et al.~\cite{guha02approximate} relax
the problem to computationally inexpensive upper and lower bounds.
Nierman and Jagadish~\cite{nierman02structuralsimilarity}
implemented their algorithm based on Chawathe's work.

% zhang and shasha
Zhang and Shasha~\cite{zhang89editing} provide a fast algorithm to
calculate the edit distance between ordered labeled trees. The
minimum costs for mapping all descendants of a node is computed in
advance, using the notion of \emph{keyroots}. Keyroots of a tree are
defined as the set of all first-level children having left siblings
plus the root node itself. Computing the keyroots of a tree in
advance applies the concepts of \emph{tree distance} and
\emph{forest distance}. The tree distance is calculated as the
distance between two (sub-)trees without considering the context of
ancestors and/or siblings. The forest distance, instead, uses the tree
distance and takes ancestor and sibling relations into account. To
get the minimum transformation cost for two trees, the minimum cost
mapping from all keyroots amongst the children and the cost of the
leftmost child (forest distance of its rightmost child) is needed.
The algorithm then proceeds from the leaf nodes up to the root node
in a postorder traversal.

% other XML similarities, overview
Further methods dealing with similarity of XML documents can be
found in the literature related to structural mapping, e.g., change
detection~\cite{cobena02detecting,wang03xdiff_master}. A good
overview of existing algorithms for change detection together with
their properties is given by Peters~\cite{peters05change} and by
Dalamagas et al.~\cite{dalamagas06clustering}.

\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Overview of change detection algorithms and properties~\cite{peters05change}}
\label{tab:change_detection_algorithms}
\begin{tabular}{ccccccc}
%
\rowcolor{tableheadcolor} \mcc{Algorithm} & \mcc{Complexity} & \mcc{Memory} & \mcc{Operations} & \mcc{Tree type} \tabularnewline%
%
LaDiff & $O(ne+e^2)$ & linear & basic, move & ordered \tabularnewline%
MH-Diff & $O(n^2 \log{n})$ & - & basic, move, copy & unordered \tabularnewline%
XMLTreeDiff & $O(n^2)$ & quadratic & basic & ordered \tabularnewline%
MMDiff & $O(n^2)$ & quadratic & basic & ordered \tabularnewline%
XMDiff & $O(n^2)$ & linear & basic & ordered \tabularnewline%
IBM's XML Diff \& Merge & $-$ & - & basic & - \tabularnewline%
3DM's matching algorithm & $O(n)$ & - & basic, move & ordered \tabularnewline%
XyDiff & $O(n \log{n})$ & linear & basic, move & ordered \tabularnewline%
VM Tools & $-$ & - & - & unordered \tabularnewline%
DiffXML & $O(ne+e^2)$ & linear & basic, move & ordered \tabularnewline%
KF-Diff+ & $O(n)$ & - & basic & both \tabularnewline%
XML Diff and Patch & $-$ & - & - & both \tabularnewline%
X-Diff & $O(n^2)$ & quadratic & basic & unordered \tabularnewline%
DeltaXML & $-$ & linear & basic & both \tabularnewline%
TreePatch & $O(ne+e^2)$ & linear & basic, move & - \tabularnewline%
BioDIFF & $O(n^2)$ & quadratic & basic & unordered
%
\end{tabular}
\end{table}

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Tree Matching via Edit Distance}
\label{sec:ted_distance}

In order to apply any kind of (semi-)automatic classification
mechanism on semi-structured documents, one has to define a metric
that expresses the similarity of two documents first. Based on this
similarity a certain distance between two documents can be defined
by means of a `dissimilarity'.

Several metrics which are quite different in terms of how this
similarity is computed can be found in the literature. Mainly they
can be categorized into two categories:

\begin{itemize}
%
\item \textbf{Heuristic approaches:} Mainly statistically motivated,
these methods match documents by comparing common tags, parent-child
relations, root-leaf XPaths, arity of nodes,
etc.~\cite{zhang03similarity,candillier05classification,zaki03xrules}.
%
\item \textbf{Tree-based approaches:} These approaches measure the similarity
of two documents represented as labeled trees. Depending on their
application, three different methods can be
distinguished~\cite{bille05treeeditdistance}:
\begin{itemize}

\item \emph{Tree edit distance} is defined as the minimum cost of
transforming one tree into the other (see \cite{chawathe96change,zhang89editing,shasha97approximate}).
Each transformation (e.g., insert, delete, alter) is associated with
a certain cost. The algorithm has to find the cheapest sequence
of transformations known as delta script. Extensions of these approaches
also operate on (rooted) graphs.
%
Figure~\ref{fig:tree_edit_distance} shows the transformation from
the source tree to the destination tree. In (a) the nodes $C$ and
$E$ are deleted from the source tree, generating an intermediate
tree (b). Afterwards, the nodes $H$ and $I$ are inserted
into the intermediate tree. Finally, altering the nodes $B$ to $B'$ and
$D$ to $D'$ results in the destination tree (c).
%
\begin{figure}[ht]
\centering
\subfloat[]{\includegraphics[width=0.2\textwidth]{07_classification/figures/TreeEditDist_edit_a}\label{fig:tree_edit_distance_a}}
\qquad%
\subfloat[]{\includegraphics[width=0.2\textwidth]{07_classification/figures/TreeEditDist_edit_b}\label{fig:tree_edit_distance_b}}
\qquad%
\subfloat[]{\includegraphics[width=0.2\textwidth]{07_classification/figures/TreeEditDist_edit_c}\label{fig:tree_edit_distance_c}}
\caption{Tree Edit Distance}
\label{fig:tree_edit_distance}
\end{figure}

\item \emph{Tree alignment} is defined as the cumulated cost for
matching all nodes of a source tree $T_1$ to all nodes of an isomorphic
destination tree $T_2$~\cite{jiang95alignment} (disregarding node labels).
The cost function operates on pairs of nodes.
First, `empty' nodes are inserted in the source and destination trees
so that the trees become isomorph. Then, all pairs of nodes $(n_1,n_2)$,
$(n_1,null)$, and $(null,n_2)$ are compared and their costs are
summed. Figure~\ref{fig:tree_alignment} illustrates the
source tree (Figure~\ref{fig:tree_alignment_a}), the destination tree
(Figure~\ref{fig:tree_alignment_b}), and the tree with the aligned
node pairs (Figure~\ref{fig:tree_alignment_c}).
%
\begin{figure}[ht]
\centering
\subfloat[]{\includegraphics[width=0.2\textwidth]{07_classification/figures/TreeEditDist_align_a}\label{fig:tree_alignment_a}}
\qquad%
\subfloat[]{\includegraphics[width=0.2\textwidth]{07_classification/figures/TreeEditDist_align_b}\label{fig:tree_alignment_b}}
\qquad%
\subfloat[]{\includegraphics[width=0.27\textwidth]{07_classification/figures/TreeEditDist_align_c}\label{fig:tree_alignment_c}}
\caption{Tree Alignment}
\label{fig:tree_alignment}
\end{figure}

\item \emph{Tree inclusion} is defined to determine if a source
tree $T_1$ is included in a destination tree
$T_2$~\cite{knuth68programming,kilpelaeinen95inclusion}. Therefore,
nodes from the destination tree are deleted until both trees
correspond or the number of nodes in $T_1$ becomes larger than the
number of nodes in $T_2$ thus demonstrating that $T_1$ is no longer
included in $T_2$.
%
Figure~\ref{fig:tree_inclusion} sketches the process to check
whether a source tree (Figure~\ref{fig:tree_inclusion_a}) is
contained in a destination tree (Figure~\ref{fig:tree_inclusion_b})
by successively deleting nodes (Figure~\ref{fig:tree_inclusion_c}).
%
\begin{figure}[ht]
\centering
\subfloat[]{\includegraphics[width=0.2\textwidth]{07_classification/figures/TreeEditDist_subtree_a}\label{fig:tree_inclusion_a}}
\qquad%
\subfloat[]{\includegraphics[width=0.2\textwidth]{07_classification/figures/TreeEditDist_subtree_b}\label{fig:tree_inclusion_b}}
\qquad%
\subfloat[]{\includegraphics[width=0.2\textwidth]{07_classification/figures/TreeEditDist_subtree_c}\label{fig:tree_inclusion_c}}
\caption{Tree inclusion}
\label{fig:tree_inclusion}
\end{figure}

\end{itemize}

\end{itemize}

The focus of this work lies on the tree-based approach of metrics,
and here especially on the tree edit distance. The algorithm
proposed by Nierman and
Jagadish~\cite{nierman02structuralsimilarity}, which allows to
compute the edit distance between two trees using only the labels of
the nodes, will be modified. Whilst the former algorithm is
structure-oriented, this research aims at inducing a classifier that
takes the structure and the content into account. The dynamical
aspect of the algorithm is kept, but it has been improved with
respect to two dimensions: (1) the algorithm will rely on simplified
edit operations; (2) the algorithm will take the content match into
account when computing the edit distance. Concretely, a hybrid
distance measure needs to be formulated that combines
structure-based and content-based distances. Each of these are
described in the following sections.

 

\subsection{Structure Matching}
\label{sec:ted_distance_structure}

To formulate the distance, some definitions of the basic concepts
are introduced.

\begin{definition}[Tree node]
A node $n$ of a tree $T$ is associated with a label $\lambda(n)$, a
content $\gamma(n)$, a parent node $p \in T$, and a set of child
nodes $ch(n)$. $\gamma(n)$ is empty ($null$), if a node does not
have a content. The parent node $p$ of the root node of $T$ is
$null$. The child nodes of $n$ are uniquely identified as
$n^1,n^2,\ldots,n^k$, where $k$ is the degree of $n$, denoted as
$deg(n)$.
\end{definition}

\begin{definition}[Ordered Tree]
An ordered tree $T$ is defined as a rooted tree, where a
left-to-right order among the child nodes is significant.
$root(T)$ denotes the root node of $T$.
%
The children of $T$ are the subtrees $T^i$ rooted in nodes that are
child nodes of $root(T)$ ($root(T)^i = root(T^i)$ and
$T^1,T^2,\ldots,T^k$, $k = deg(root(T)$).
%
Further, $|T|$ represents the total number of nodes in tree $T$.
\end{definition}

 

Based on the aforementioned definitions, the allowed basic edit
operations can be described as follows:

\begin{definition}[Insert operation]
Given a tree $T$ and a node $p \in T$, a node $n$ can be inserted
into $T$ as the $i^{th}$ child of the node $p$ using the operation
ins(T,n,p,i). The cost function of this operation is
$C_{ins}(T,n,p)$.
\end{definition}

\begin{definition}[Delete operation]
Given a tree $T$ and a node $n \in T$ as the $i^{th}$ child of node
$p \in T$, $n$ can be removed using the operation del(T,n,p,i). The
cost function of deletion is $C_{del}(T,n,p)$.
\end{definition}

\begin{definition}[Alter operation]
Given two nodes $n_1$ and $n_2$ whose labels are $\lambda(n_1)$ and
$\lambda(n_2)$. The label of $n_1$ can be replaced by the label of
$n_2$ using the operation $alt(n_1,n_2)(\equiv~\lambda(n_1)~\leftarrow~\lambda(n_2))$.
The cost function of this operation is
$C_{alt}(n_1,n_2)$ defined as:
%
\begin{gather}
C_{alt}(n_1,n_2) =
\begin{cases}
0 & \text{if}~ \lambda(n_1) = \lambda(n_2)\\ \label{eq:relabx}
\beta & \text{otherwise}
\end{cases}
\end{gather}
\end{definition}

Note that one can parameterize $\beta$ so that altering a node lying
at level $l$ in the document tree costs $\beta(l)$. This seems
reasonable for taking the depth of the tree and semantic closeness
between tags into account. Renaming section marking labels such as
\texttt{sec}, \texttt{ss1}, \texttt{ss2} or paragraph labels such as
\texttt{p1}, \texttt{p2}, \texttt{p3} could result in cheaper costs
because of their affiliation to the same (semantic) group (related
labels).

% explanation of the figure
Figure~\ref{fig:ted_basic_operations} illustrates the basic
operations that transform the source tree $T_1$ into the destination
tree $T_2$. In order to save space, no intermediate transformation
steps are included. Deleted nodes colored in red are marked in $T_1$
only. Inserted nodes colored in green are marked in $T_2$ only.
Altered nodes colored in blue are marked in both trees $T_1$ and
$T_2$. Additionally, altered nodes are primed for better
discrimination. All other nodes are left untouched.
%
An edit script (see Definition~\ref{def:classification:edit_script})
keeps track of all operations that transform $T_1$ into $T_2$.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.6\textwidth]{07_classification/figures/TreeEditDist_Basic_Operations}
\caption{Basic tree edit operations and edit script}
\label{fig:ted_basic_operations}
\end{figure}

 

Furthermore, if $C_{ins}(T,n,p)=C_{del}(T,n,p)$, the edit distance
measure becomes symmetric. One might think of unequal costs
for insert and delete operations though. Thus, the similarity of two
document trees $dist(T_1,T_2)$ and $dist(T_2,T_1)$ differs.
Assigning a higher cost to the insert than to the delete operation
may support a possible plagiarism identification. For instance, two
distances $dist(T_1,T_2)=3,4$ and $dist(T_2,T_1)=1,2$ might indicate
that $T_1$ and $T_2$ are very similar (low distance), where it is
easier to generate $T_1$ by deleting contents from $T_2$ than vice
versa.

These operations are applied on nodes that can be either leaf nodes
or inner nodes, where an inner node is the root of a subtree. The
cost of applying an operation on an inner node is recursively
computed by summing up the cost of applying the respective structure
modifications to its descendants. Hence, the following definitions
of the cumulative costs (see
Figure~\ref{fig:ted_recursive_operations}, nodes $B$ and $E$):

\begin{definition}[Recursive delete]
The cumulative cost of deleting a subtree $Sub$ from the tree $T$ at
node $p$ is $C_{delCum}(T,Sub,p)$.
\end{definition}

\begin{definition}[Recursive insert]
The cumulative cost of inserting a subtree $Sub$ into the tree $T$
at node $p$ is $C_{insCum}(T,Sub,p)$
\end{definition}

Both costs $C_{delCum}$ and $C_{insCum}$ are bottom-up computed as
follows:
%
\begin{equation}
\left\{
\begin{array}{l}
C_{delCum}(T,Sub,p) = C_{del}(T,root(Sub),p) + \sum_i C_{delCum}(T,Sub^i,root(Sub))\\\\
%
C_{insCum}(T,Sub,p) = C_{ins}(T,root(Sub),p) + \sum_i C_{insCum}(T,Sub^i,root(Sub)) \label{eq:C_del_cum}
%
\end{array}
\right.
\end{equation}

In other words, at each node of the subtree $Sub$ of the source
(resp. destination) tree $T_1$ (resp. $T_2$), the cost of the
recursive delete (resp. insert) operation is calculated by summing
the cost for deleting (resp. inserting) the single node $root(Sub)$
with the cumulative cost of deleting (resp. inserting) each of its
children $Sub^j$. Obviously, the deletion of a subtree is done
bottom-up, while the insertion of a subtree is done top-down.

\begin{figure}[ht]
\centering
\includegraphics[width=0.7\textwidth]{07_classification/figures/TreeEditDist_Recursive_Operations}
\caption{Recursive tree edit operations and edit script}
\label{fig:ted_recursive_operations}
\end{figure}

%%%%%%%%%%
% description of the algorithm
%%%%%%%%%%
Having introduced some variables required, the edit distance between
two trees $T_1$ and $T_2$ is computed using
Algorithm~\ref{alg:tree_edit_dist}. Based on dynamic programming,
this algorithm constructs a $deg(root(T_1)) \times deg(root(T_2))$
matrix of distance values between the nodes of the two trees. First,
the algorithm compares the root nodes of $T_1$ and $T_2$. This
corresponds to an alter operation (line~\ref{alg:ted:alt}) because
root nodes can neither be inserted nor deleted. Then, as seen in
lines~\ref{alg:ted:graft} and~\ref{alg:ted:prune}, the algorithm
computes the distance values of inserting or deleting all nodes
given the roots of two trees. These values serve to trigger the
dynamic computation of cumulative costs, where each child of $T_1$
is compared to each child of $T_2$ recursively. Indeed, a cell
$distMat[i][j]$ ($i>0$ and $j>0$) is assigned a cost that is
computed using the content of its neighboring three cells:
%
\begin{itemize}
\item the content of the upper left neighbor, $distMat[i-1][j-1]$, is added to the distance
between the subtrees rooted at nodes $n_i$ and $n_j$ (i.e., $T_1^i$
and $T_2^j$) (line~\ref{alg:ted:match}). This case corresponds to a match between
node $n_i$ and node $n_j$.
%
\item the content of the left neighbor, $distMat[i][j-1]$, is added to
the cost of inserting a subtree $T_2^j$ to the source tree (line~\ref{alg:ted:ins}).
This case corresponds to an insertion of a subtree rooted at
node $n_j$.
%
\item the content of the upper neighbor, $distMat[i-1][j]$, is added to
the cost of removing a subtree $T_1^i$ from the source tree (line~\ref{alg:ted:del}).
This case corresponds to a removal of an obsolete subtree
rooted at node $n_i$.
\end{itemize}
%
The minimum cost of these three alternatives is retained and stored
in $distMat[i][j]$.

\begin{algorithm}[ht]
\caption{Tree Edit Distance algorithm}
\label{alg:tree_edit_dist}
\begin{algorithmic} [1]
\Procedure{TreeDist}{$T_1,T_2$}
\State int M = deg(root($T_1$))
\State int N = deg(root($T_2$))
\State int[][] distMat = new int[0..M][0..N]
%\State
%\newline
\State distMat[0][0] = $C_{alt}(root(T_1),root(T_2))$ \label{alg:ted:alt}
\For{$j=1$ to $N$}
\State distMat[0][j] = distMat[0][j-1] + $C_{insCum}(T_2,T_2^j,root(T_2))$ \label{alg:ted:graft}
\EndFor
\For{$i=1$ to $M$}
\State distMat[i][0] = distMat[i-1][0] + $C_{delCum}(T_1,T_1^i,root(T_1))$ \label{alg:ted:prune}
\EndFor
%\State
%\newline
\For{$i=1$ to $M$}
\For{$j=1$ to $N$}
\State distMat[i][j] = min\{
\State ~~~distMat[i-1][j-1] + $dist(T_1^i,T_2^j)$, \label{alg:ted:match}
\State ~~~distMat[i][j-1] + $C_{insCum}(T_2,T_2^j,root(T_2))$, \label{alg:ted:ins}
\State ~~~distMat[i-1][j] + $C_{delCum}(T_1,T_1^i,root(T_1))$ \label{alg:ted:del}
\State \}
\EndFor
\EndFor
%\State
\State return distMat[M][N]
\EndProcedure
\end{algorithmic}
\end{algorithm}

The original algorithm proposed by Nierman and Jagadish outputs only
the edit distance between two trees. There is no way to reconstruct
the optimal sequence of edit operations that led to the obtained
final edit distance. To overcome that, all possible edit scripts
that correspond to the minimal distance between the two trees are
memorized. Therefore, an edit script (see
Figure~\ref{fig:ted_basic_operations},
Figure~\ref{fig:ted_recursive_operations}, and
Figure~\ref{fig:ted_content_operations}) is defined as follows:

% edit script
\begin{definition}[Edit Script]\label{def:classification:edit_script}
An edit script $\delta$ is an ordered sequence of edit operations
that transform a source tree $T_1$ into a destination tree $T_2$.
\end{definition}

% minimal edit script
In general, there exist more than a single edit script that
correctly transform $T_1$ into $T_2$. For instance, if the cost of
an alter operation is equal to the cost of a delete and an insert
operation combined, both edit scripts achieve the same
transformation cost. Further, edit scripts that initially add an
arbitrary number of nodes and delete the same nodes at the end also
carry out correct transformations. Therefore, a minimal edit script
is defined as follows:
%
\begin{definition}[Minimal Edit Script]
Let $\delta^1,\delta^2,\ldots,\delta^m$ be a set of correct edit
scripts transforming $T_1$ into $T_2$. A minimal edit script is then
defined as:
%
\begin{equation}
\begin{split}
Min_{k=1..m}\{\delta^k\}=\delta^i \Longleftrightarrow \forall \delta^j \mid i\neq j,& \\
dist(T_1,T_2)^{\delta^i} \le dist(T_1,T_2)^{\delta^j}
\end{split}
\end{equation}
%
where $dist(T_1,T_2)^{\delta^j}$ indicates the distance between
$T_1$ and $T_2$ obtained after applying the script $\delta^j$. Note
that there might be more than one minimal edit script for a pair of
trees.
\end{definition}

The above algorithm is further extended to calculate the minimal
edit scripts for a (minimal) edit distance. First suggestions for
altering Nierman's and Jagadish's algorithm in this manner come from
Barnard et al.~\cite{barnard95treetotree}. In addition to the
$distMat$ storing the cumulated edit distances only, a separate
entry in the matrix $scriptMat$ keeps track of the possible
minimal edit scripts. After the algorithm finishes,
$distMat[deg(T_1)][deg(T_2)]$ contains the minimal edit distance and
$scriptMat[deg(T_1)][deg(T_2)]$ the set of minimal edit scrips for
the edit distance.

 

\subsection{Content Matching}
\label{sec:ted_distance_content}

The second type of distance required is the content-based distance.
To define it, this work still relies on Algorithm~\ref{alg:tree_edit_dist}.
However, the previously used cost functions have to be redefined in
order to support the content match. Therefore, the notion of
integer-valued costs is replaced by float values, allowing to use
common content similarity measures.

\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{07_classification/figures/TreeEditDist_Content_Operations}
\caption{Content-based tree edit operations and edit script}
\label{fig:ted_content_operations}
\end{figure}

 

Let $sim(\gamma(n_1),\gamma(n_2))$ be an existing similarity
function that compares the contents of two nodes $n_1$ and $n_2$.
This similarity can be computed by any information matching
function. Assume that this similarity measure is normalized so that
it takes values in the unit interval [0,1] (where $sim=0$ means no
match, $sim=1$ indicates full match, and $sim \in ]0,1[$ means
partial match). Further, the following two definitions are needed:
%
\begin{equation}
sim(null,null) = 1 \label{eq:nn}
\end{equation}
%
\begin{equation}
sim(null,\gamma(n_2)) = sim (\gamma(n_1),null) = 0\label{eq:nc}
\end{equation}
%
Equation~\ref{eq:nn} stipulates that the content-based similarity of
two nodes with empty content is total (i.e., complete match), while
Equation~\ref{eq:nc} stipulates that content-based similarity
between a
node with a content and another node with empty content is 0.\\
%
The cost functions for \textbf{inserting} and \textbf{deleting nodes
with their contents} are defined as:
%
\begin{equation}
C_{insCon}(n) = 1-sim(null,\gamma(n))=1-0=1 \label{eq:cost_ins}
\end{equation}
%
\begin{equation}
C_{delCon}(n) = 1-sim(\gamma(n),null)= 1-0=1 \label{eq:cost_del}
\end{equation}

The cost of altering the content, $C_{altCon}$
(Equation~\ref{eq:cost_alt}) is the sum of the cost of changing the
label $C_{alt}(n_1,n_2)$ (Equation~\ref{eq:relabx}) and the cost of
altering the content, which is expressed as:
%
%\[\rho * (1-sim(\gamma(n_1),\gamma(n_2)))\]
\begin{equation}\label{eq:cost_alt}
C_{altCon}(n_1,n_2) = C_{alt}(n_1,n_2) + \rho * (1-sim(\gamma(n_1),\gamma(n_2)))
\end{equation}
%
with $\rho$ being a cost factor that scales up the dissimilarity of
contents (since $sim(\gamma(n_1),\gamma(n_2))\in[0,1]$,
$C_{alt}(n_1,n_2)$ can be larger than 1).
%
%The cost $C_{altCon}$ is given by:
%
%\begin{equation}\label{eq:cost_alt}
%C_{altCon}(n_1,n_2) = C_{alt}(n_1,n_2) + \rho *
%(1-sim(\gamma(n_1),\gamma(n_2)))
%\end{equation}

 

 

Depending on a similarity threshold $\alpha$, an alter operation
becomes cheaper than a delete operation and an insert operation
combined. This can be reflected by a user-specified parameter,
$\alpha$ ($0\leq \alpha<1$). Precisely, if
$sim(\gamma(n_1),\gamma(n_2))>\alpha$ then
$C_{altCon}<(C_{delCon}+C_{insCon})$ (the alter is cheaper) and if
$sim(\gamma(n_1),\gamma(n_2))<\alpha$ then
$C_{altCon}>(C_{delCon}+C_{insCon})$ (alter is more expensive);
otherwise, alter has the same cost as that of both delete and insert
combined. From this, the cost factor $\rho$ is determined as
follows:
%
\begin{equation}
\rho * (1-\alpha)= C_{delCon}(n_1) + C_{insCon}(n_2)
\end{equation}
%
leading to:
%
\begin{equation}
\rho = \frac{C_{delCon}(n_1) + C_{insCon}(n_2)}{(1-\alpha)}\label{eq:rho}
\end{equation}

Now, to take the content similarity into account
$C_{alt}(root(T_1),root(T_2))$ in Algorithm~\ref{alg:tree_edit_dist}
(line~\ref{alg:ted:alt}) is substituted for
$C_{altCon}(root(T_1),root(T_2))$. Figure~\ref{fig:ted_algorithm}
summarizes the behavior of Algorithm~\ref{alg:tree_edit_dist}.
$distMat[0][0]$ is initialized with the cost of altering $A_1$ into
$A_2$. Row 0 is then completed by cumulating the inserting costs.
Also, column 0 is completed by cumulating the pruning costs. The
remaining cells are filled by the minimal cost after adding (1) the inserting costs to the
neighboring left cell, (2) the deleting cost to the neighboring
upper cell, or (3) the altering cost to the upper left cell. The final
edit distance for transforming the source into the destination tree
can be found in the bottom right corner of the distance matrix. The
minimal edit script is computed in a separate $scriptMat$ matrix.
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.9\textwidth]{07_classification/figures/TreeEditDist_Algorithm}
\caption{Tree edit distance algorithm matrix}
\label{fig:ted_algorithm}
\end{figure}

One might note that the edit distance itself cannot be interpreted
without a context, i.e., comparing two source documents with a
single destination document, where one source document is more
similar to the destination document than the other. Anyway, this
algorithm provides a relative distance measure among XML documents
based on their structure and content, which can be applied easily
for classification and clustering.

\newpage

In order to allow such an interpretation, a normalization of the
edit distances for two trees can be defined as
%
\begin{equation}
dist_{normalized}(T_1,T_2) = \frac{dist(T_1,T_2)} {maxDist(T_1,T_2)}
\end{equation}
%
where $maxDist(T_1,T_2) = |T_1| \cdot C_{delCon} + |T_2| \cdot
C_{insCon}$ defines the maximum edit distance, calculated as
deleting every node of $T_1$ and inserting every node of $T_2$. This
definition allows to compare edit distances by measuring the amount
of changes needed, incorporating the size of the compared documents.

 

 

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Tree Matching via Content Matrix}
\label{sec:cm_distance}

So far, the focus has been on explicit consideration of the
structure to compare documents. In the sequel a more
content-oriented matching procedure is suggested that uses the
structure of documents only implicitly. More expressively, the idea
is to measure the similarity between documents using only nodes with
contents. Indeed, every node in the source tree is compared to all
nodes in the destination tree relying on a content similarity matrix
(denoted \emph{Content Matrix}), where only nodes containing content
are taken into account.

A cell $contSim[i][j]$ refers to the similarity degree between the
contents of the node $n_i$ in the source tree and the node $n_j$ in
the destination tree, i.e.,
$contSim[i][j]=sim(\gamma(n_i),\gamma(n_j))$. One might define a
two-step process to perform the comparison: If the labels of the
nodes to be compared are the same, then the contents of these nodes
are compared. However, the first step of this comparison can be
optional at the discretion of the user via setting a flag variable
$\zeta$. If $\zeta$ is set to $true$, the similarity of nodes with
non-equal labels is set to $0$.

 

Once the content similarity matrix, $contSim$, is filled, the
distance between the corresponding documents can be computed using
an algorithm that traverses that matrix in a single pass. Basically,
three edit operations: insert, delete, and alter are applied. During
the computation, these operations are labeled either $safe$
(certain) or $unsafe$ (uncertain) as shown in
Algorithm~\ref{alg:content_matrix_dist} and explained below.
Basically the algorithm proceeds in two steps, (1) computing the
similarities and (2) marking the nodes according to the edit
operations (item numbers correspond to lines in
Algorithm~\ref{alg:content_matrix_dist}):
\begin{enumerate}
\item[15:] If $simMat[i][j]=1$, then nodes $n_{1,i}$ and $n_{2,j}$
are both marked as $safe\_match$ with no additional cost.

\item[16:] A source node, $n_{1,i}$ having no matching destination nodes
($simMat[i][j]=0, \forall j=1\ldots|dest|$) is marked as
$safe\_delete$. The cumulative cost is increased by the weighed cost of $safe\_delete$.

\item[17:] A destination node, $n_{2,j}$ with no corresponding source nodes
($simMat[i][j]=0, \forall i=1\ldots|source|$) is marked as
$safe\_insert$. The cumulative cost is increased by the weighed cost of $safe\_insert$.

\item[18:] An unmarked source node $n_{1,i}$ is marked as $unsafe\_match$ along
with one unmarked destination node $n_{2,j}$, if $n_{2,j}$ is the first node (minimal index $j$)
fulfilling the condition $simMat[i][j]\ge \alpha$, where $\alpha$ is a user-specified
similarity threshold.

\item[19:] Any remaining unmarked source node, $n_{1,i}$, ($simMat[i][j]<\alpha, \forall j=1\ldots|dest|$)
are marked as $unsafe\_delete$.

\item[20:] Any remaining unmarked destination node, $n_{2,j}$, ($simMat[i][j]<\alpha, \forall i=1\ldots|source|$)
are marked as $unsafe\_insert$.

\end{enumerate}

 

\begin{algorithm}
\caption{Content Matrix Distance algorithm} \label{alg:content_matrix_dist}
\begin{algorithmic} [1]
\Procedure{ContentDist}{$T_1,T_2,\zeta$}
\State int M = $|T_1|$ /*only content nodes*/
\State int N = $|T_2|$ /*only content nodes*/
\State float[][] $simMat$ = new float$[0..M][0..N]$
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\For{$i=1$ to $M$}
\For{$j=1$ to $N$}
\If{$\zeta = true$ \textbf{and} $\lambda(n_1) \ne \lambda(n_2)$}
\State $simMat[i][j] = 0$
\Else
\State $simMat[i][j]=sim(\gamma(n_{1,i}),\gamma(n_{2,j}))$
\EndIf
\EndFor
\EndFor
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\State
\State check for $safe\_match$
\State check for $safe\_delete$
\State check for $safe\_insert$
\State check for $unsafe\_match$
\State check for $unsafe\_delete$
\State check for $unsafe\_insert$
\State
\State return $\sum weighed\_costs$ (based on marks)
\EndProcedure
\end{algorithmic}
\end{algorithm}

%%%%%%%%%%

As in the tree edit distance approach, each of the edit operations
is associated with a certain cost. In addition, the labels $[safe]$
and $[unsafe]$ can be weighed. Since the distance is inversely
proportional to the similarity, the transformation costs are
calculated using the same formula as applied in the tree edit distance
approach taking $dist(n_{1,i},n_{2,j}) = 1,0-contSim[i][j]$ into
account. The final distance between the documents is the sum of all
of the (weighed) operation's costs. Clearly, the computation of the
edit script is done in linear time and space (since, there is no
recursive computation).

Figure~\ref{fig:content_matrix} shows how the algorithm works. All
nodes containing content of the source tree~($B$, $C$, $E$, $F$, and
$G$ in Figure~\ref{fig:content_matrix_a}) are compared to all nodes
containing content of the destination tree~($B$, $E$, $C'$, $G$, and
$C''$ in Figure~\ref{fig:content_matrix_b}). The corresponding
similarity matrix $simMat$ is shown in
Figure~\ref{fig:content_matrix_c}. Additional data structures
accelerate the identification of $[safe]$ matches ($row Match$),
inserts ($col Empty$), and deletes ($row Empty$). Also, $row Sim$
keeps track of the number of nodes exceeding the similarity
threshold $\alpha$. Figure~\ref{fig:content_matrix_d} presents the
final costs for all nodes, which are $4,1$ in sum. A special case is
the multiple matches of source node $C$ in the destination tree with
$C'$ and $C''$. It is resolved by matching $C$ with the node
corresponding highest, $C''$. As a single node is allowed to match
only another single node, node $C'$ automatically becomes marked as
$[unsafe]$ insert.
%
\begin{figure}
\centering
\subfloat[Source tree]{\includegraphics[width=0.4\textwidth]{07_classification/figures/ContentMatrix_source}\label{fig:content_matrix_a}}
\qquad%
\subfloat[Destination tree]{\includegraphics[width=0.4\textwidth]{07_classification/figures/ContentMatrix_destination}\label{fig:content_matrix_b}}
\\%
\subfloat[Similarity matrix ($\alpha=0,5$)]{\includegraphics[width=0.4\textwidth]{07_classification/figures/ContentMatrix_matrix}\label{fig:content_matrix_c}}
\qquad%
\subfloat[Cumulated costs]{\includegraphics[width=0.3\textwidth]{07_classification/figures/ContentMatrix_costs}\label{fig:content_matrix_d}}
\caption{Content matrix match}
\label{fig:content_matrix}
\end{figure}

 

Table~\ref{tab:comparison_ted_cm} summarizes the described
approaches, tree edit distance (TED) and content matrix (CM), and
their supported document editing operations.

\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Comparison of tree edit distance and content matrix}
\label{tab:comparison_ted_cm}
\begin{tabular}{lcc}
%
\rowcolor{tableheadcolor} \mcc{Edit operations} & \mcc{TED} & \mcc{CM} \tabularnewline%
%
insert & + & + \tabularnewline%
delete & + & + \tabularnewline%
alter & + & + \tabularnewline%
move & edit script & + \tabularnewline%
copy & edit script & + \tabularnewline%
deepening & multiple matches & multiple matches \tabularnewline%
flattening & multiple matches & multiple matches
%
\end{tabular}
\end{table}

 

 

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Overview of $k$-NN}\label{sec:knn}

The $k$-NN algorithm~\cite{duda92pattern,sable01bins} serves for
classifying documents into pre-existing classes. It is based on the
assumption that the classification of a sample is most similar to
the classification of other samples that are nearby in the space. In
Figure~\ref{fig:knn} the unknown data point is assigned to the red
class, because 5 out of $k=7$ nearest elements are assigned the red
label.

Compared to other learning methods such as probabilistic
classifiers, $k$-NN does not rely on prior probabilities. Further,
it is known for its effectiveness~\cite{yang99reexamination}. The
main computation task is to sort the training samples in order to
find the $k$-nearest neighbors of a given query. It is then
straightforward to apply the $k$-NN algorithm for classifying XML
documents.

\begin{figure}[ht]
\centering
\includegraphics[width=0.4\textwidth]{07_classification/figures/kNN}
\caption{$k$-Nearest Neighborhood classification}
\label{fig:knn}
\end{figure}

 

Its application consists of collecting the set of $N$ training
documents $X^{T}=\{(x_1, y_1), (x_2,y_2), ..., (x_N,y_N)\}$, where
$x_i$ are the XML documents and $y_i$ are the class labels. This set
is used as reference samples for the $k$-NN algorithm. To assign a
label to an unlabeled document (query), $doc$, the algorithm finds
the $k$ documents in the reference samples (labeled documents) that
are the closest to it. The label shared by the majority of these $k$
nearest neighbors is assigned to the query.

However, to apply $k$-NN an appropriate value of $k$ needs to be
chosen, since the performance of the algorithm depends on this
value.

Note that $k$-NN is a lazy learning algorithm because no model needs
to be built a priori. Moreover, despite its efficiency problems,
$k$-NN is known to be one of the most effective classification
methods. Steps of the classification procedure via $k$-NN are
summarized in Algorithm~\ref{alg:knn}. Once the unlabeled XML
document set is assigned class labels, the classification accuracy
can be computed. It measures how often the algorithm's labeling
decision meets the actual labels of the documents.

\begin{algorithm}[t]
\caption{Classification via the $k$-NN algorithm}
\label{alg:knn}
\begin{algorithmic}[1]
\State $X^{L}$ \ldots labeled documents, $\mathcal{C}$ \ldots number of classes such that
$\bigcup_{j}X_j=X^L, ~j=1...\mathcal{C}$
\State $X^{U}$ \ldots unlabeled documents
%
\smallskip
%
\Procedure{$k$-NN}{$k$}
\For {i = 1 to $|X^{U}|$} \Comment{$x_i$ \ldots unlabeled sample}
\For {j = 1 to $|X^{L}|$} \Comment{$y_j$ \ldots labeled sample}
\State $dist[j] = sim(x_i,y_j)$
\EndFor
\State Sort $dist[j]$ ascending
\State Select first $k$ documents ($\in X^{L}$ with smallest distance)
\State Assign $x_i$ the wining label $c_l$ that occurs most often among the $k$ documents
\State If more than one label wins, randomly select one of them
\EndFor
\EndProcedure
\end{algorithmic}
\end{algorithm}

 

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Evaluation}
\label{sec:evaluation}

In this section, several aspects to quantify the classification
quality of XML documents based on their content and structure are
studied: How do different $k$ values influence the classification?
What is the impact of training size on the classification
performance? How does content and structure matching perform
compared to structure-only matching? All of these issues were
investigated using some real-world XML collections based on the
movie database (MovieDB)~\cite{denoyer05xmlmining}) which were
proposed in INEX 2005~\cite{INEX:05}. Documents of these collections
are assigned to 11 classes. Basically, the XML collections are of
two types:

\begin{itemize}

\item Structure-only (\emph{SO}) collections containing only the structure
of the XML documents.
These include four collections: \emph{m-db-s-0, m-db-s-1, m-db-s-2,
m-db-s-3}, where the last three collections are noisy versions of the
first one. The amount of noise and class overlap increases from the
first to the last collection. The collections come in the form of
two sets, a training and a testing set. The training documents of the \emph{m-db-s}
data sets are organized in 11 categories, where each category corresponds
to a movie genre. Table~\ref{tab:corpora_m-db-s} summarizes the
number of training and testing documents of each of the collections.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Structure-Only corpora}
\label{tab:corpora_m-db-s}
\begin{tabular}{Bcccccc}
%
\rowcolor{tableheadcolor} Corpus & \mcc{Train/Test} & \mcc{Train/Test} & \mcc{Train/Test} & \mcc{Train/Test} & \mcc{Train/Test} \tabularnewline%
\rowcolor{tableheadcolor} & 10\% & 30\% & 50\% & 70\% & 100\% \tabularnewline%
%
\emph{m-db-s-0} & 488/485 & 1.453/1.448 & 2.415/2.409 & 3.383/3.376 & 4.824/4.816 \tabularnewline%
\emph{m-db-s-1} & 487/486 & 1.449/1.447 & 2.410/2.408 & 3.378/3.374 & 4.818/4.814 \tabularnewline%
\emph{m-db-s-2} & 486/485 & 1.450/1.447 & 2.412/2.408 & 3.379/3.372 & 4.820/4.809 \tabularnewline%
\emph{m-db-s-3} & 488/485 & 1.453/1.445 & 2.414/2.404 & 3.380/3.370 & 4.821/4.809
%
\end{tabular}
\end{table}

\item Content-and-structure (\emph{CAS}) collections consisting of the entire documents.
This set contains one collection, \emph{m-db-cs-1}, that consists of
4.825 labeled documents assigned to 11 different categories. This set is split into
2.415 training and 2.410 testing documents. Table~\ref{tab:corpora_m-db-cs} summarizes the
number of training and testing documents of the collection.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Content-and-Structure corpora}
\label{tab:corpora_m-db-cs}
\begin{tabular}{Bccccccc}
%
\rowcolor{tableheadcolor} Corpus & \mcc{Train/Test} & mcc{Train/Test} & \mcc{Train/Test} & \mcc{Train/Test} & \mcc{Train/Test} & \mcc{Train/Test} \tabularnewline%
\rowcolor{tableheadcolor} & 10\% & 20\% & 30\% & 50\% & 70\% & 100\% \tabularnewline%
%
\emph{m-db-cs-1} & 247/246 & 488/486 & 730/728 & 1.210/1.208 & 1.696/1.692 & 2.415/2.410 \tabularnewline%
%
\end{tabular}
\end{table}
%
\end{itemize}

To answer the questions formulated earlier, three sets of
experiments are run. The first two deal with the structure-only
setting, while the last one is concerned with the
content-and-structure setting. A comparison of the results against
some available results from other authors is highlighted at the end
of this section.

In all experiments, evaluation relies on the well-known
classification
accuracy~\cite{sebastiani02machine,aas99categorization,yang99evaluation},
that is how often the classifier's decision meets the expert's
assignment. Formally it is defined as:
%
\begin{equation}
\text{Accuracy}=\frac{\text{\# correctly classified testing docs}}{\text{\# testing docs}}\label{eq:class_accuracy}
\end{equation}

\newpage

In addition to the classification accuracy, recall and precision
values~\cite{sebastiani02machine} are computed for comparison.
%
Recall, $R_i$, and precision, $P_i$, for a class $i$ are given as
follows:
%
\begin{eqnarray}
R_i & = & \frac{\text{\# correctly classified testing docs in class }i}{\text{\# testing docs of class }i} \\
P_i & = & \frac{\text{\# correctly classified testing docs in class }i}{\text{\# testing docs assigned to class }i}
\end{eqnarray}
%
The overall recall and precision of a classification can be computed
using two different methods: Micro averaging ($R_{micro}$ and
$P_{micro}$) sums over all individual decisions, thus gives equal
weight to each assignment. Macro averaging ($R_{macro}$ and
$P_{macro}$), instead, sums over all classes, thus gives equal
weight to each class. The formulae are given as follows:
%
\begin{eqnarray}
R_{micro} & = & \frac{\sum_{i=1}^{k} R_i \cdot n_i} {n} \\
P_{micro} & = & \frac{\sum_{i=1}^{k} P_i \cdot n_i} {n} \\
%
R_{macro} & = & \frac{\sum_{i=1}^{k} R_i} {k} \\
P_{macro} & = & \frac{\sum_{i=1}^{k} P_i} {k}
\end{eqnarray}
%
$n_i$ is the number of documents assigned to class $i$, $n$ is the
total number of documents, and $k$ is the number of classes.

 

\subsection{Experiment I - How Does $k$ Affect the Accuracy?}
\label{sec:experiment_k_effect}

As explained in Section~\ref{sec:knn}, the size of the neighborhood
($k$) is a key parameter in the $k$-NN algorithm. Therefore, one
aspect to look at is to check the effect of $k$ on the accuracy. For
this purpose, the values 3, 5, 7, 9, 15, and 21 are experimented.
Note that, only the \emph{SO} collections are used and only a
proportion (10\%) of them is selected randomly and uniformly
distributed over the 11 classes to show the effect of $k$.

Using Algorithm~\ref{alg:tree_edit_dist} to run $k$-NN, and setting
the required parameters to $\alpha=0,5$ (Equation~\ref{eq:rho}), $C_{del}=1,0$
(Equation~\ref{eq:C_del_cum}), $C_{ins}=1,0$
(Equation~\ref{eq:C_del_cum}), and $\beta=2,0$
(Equation~\ref{eq:relabx}, $\beta \ge
C_{ins}+C_{del}$ to avoid permanent node relabeling) respectively,
the results displayed in Table~\ref{tab:results_m-db-s_k} and
Figure~\ref{fig:results_m-db-s_k} are obtained.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Effect of $k$ on the accuracy (Equation~\ref{eq:class_accuracy})}
\label{tab:results_m-db-s_k}
\begin{tabular}{Bcccccccc}
%
\rowcolor{tableheadcolor} Corpus & \mcc{$k=3$} & \mcc{$k=5$} & \mcc{$k=7$} & \mcc{$k=9$} & \mcc{$k=15$} & \mcc{$k=21$} & \mcc{max. difference} \tabularnewline%
%
m-db-s-0 & 0,922 & 0,920 & 0,918 & 0,903 & 0,892 & 0,889 & 3,3\%\\
m-db-s-1 & 0,903 & 0,892 & 0,891 & 0,897 & 0,885 & 0,866 & 3,7\%\\
m-db-s-2 & 0,874 & 0,868 & 0,858 & 0,849 & 0,802 & 0,790 & 8,4\%\\
m-db-s-3 & 0,862 & 0,868 & 0,860 & 0,852 & 0,825 & 0,814 & 4,8\%\\
%
\end{tabular}
\end{table}
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.7\textwidth]{07_classification/figures/results_m-db-s_k}
\caption{Effect of $k$ on the accuracy}
\label{fig:results_m-db-s_k}
\end{figure}

The outcome of the experiment is a 3-fold conclusion: \textbf{(i)}
as $k$ increases, the accuracy of the algorithm monotonically
decreases independently of the collection used (except for $k=9$ in
\emph{m-db-s-1} and for $k=5$ in \emph{m-db-s-3}), \textbf{(ii)} the
noise introduced in the \emph{m-db-s-1/2/3} has negatively impacted
the accuracy (as the amount of noise in relation to \emph{m-db-s-0}
increases, the accuracy decreases), and \textbf{(iii)} the maximum
drop in the accuracy due to variation of $k$ is quite low, lying between 3,3\% and 8,4\%
only (see last column of Table~\ref{tab:results_m-db-s_k}).
Therefore, the different values of $k$ are retained in the remaining
experiments since these results do not allow to convincingly
consider a particular $k$-value better than the others.

More interesting, the classifier provides very high accuracy
results, but this remains relative to the amount of documents used
in this experiment.

 

 

\subsection{Experiment II - How Does the Training Data Affect the Accuracy?}

$k$-NN uses a set of training samples as a basis to label unseen
documents. Hence, it is clear that the size of the training set is
crucial for the accuracy of the algorithm. To observe the effect of
increasing the size of the training data set, the \emph{m-db-s-0}
collection is split into five ratios (10\%, 30\%, 50\%, 70\%,
100\%). These ratios are randomly and uniformly selected among the
whole training data so that every chunk contains all labels.

 

Using the same setting as described in
Section~\ref{sec:experiment_k_effect}, the results shown in
Table~\ref{tab:results_m-db-s_train} and
Figure~\ref{fig:results_m-db-s_train} are obtained. Unexpectedly,
the size of the training set had no large impact on the accuracy of
the classifier (see the last column of the table). There is an
extremely weak trend that a larger training set improves
performance. The reason might lie in the inter-document similarity,
meaning that the classes are highly homogeneous. Furthermore, the
accuracy remains in the same range of values (regardless of the $k$
value) without noticeable fluctuations when increasing the size of
the training data.
%
\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Effect of the training data on the accuracy}
\label{tab:results_m-db-s_train}
\begin{tabular}{Bcccccccc}
%
\rowcolor{tableheadcolor} Corpus & \mcc{$k=3$} & \mcc{$k=5$} & \mcc{$k=7$} & \mcc{$k=9$} & \mcc{$k=15$} & \mcc{$k=21$} & \mcc{max. difference} \tabularnewline%
%
~10\% & 0,922 & 0,920 & 0,918 & 0,903 & 0,892 & 0,889 & 3,3\% \tabularnewline%
~30\% & 0,928 & 0,925 & 0,934 & 0,932 & 0,930 & 0,930 & 0,9\% \tabularnewline%
~50\% & 0,924 & 0,929 & 0,928 & 0,929 & 0,929 & 0,924 & 0,5\% \tabularnewline%
~70\% & 0,934 & 0,933 & 0,932 & 0,930 & 0,930 & 0,932 & 0,4\% \tabularnewline%
100\% & 0,934 & 0,934 & 0,932 & 0,932 & 0,929 & 0,931 & 0,5\%
%
\end{tabular}
\end{table}
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.7\textwidth]{07_classification/figures/results_m-db-s_train}
\caption{Effect of the training data on the accuracy}
\label{fig:results_m-db-s_train}
\end{figure}

%~\\~\\~\\~\\

 

\newpage
\subsection{Experiment III - How Does the CAS Setting Affect the Accuracy?}

To check the effectiveness of the approach taking both the content
and the structure of XML documents into account, the \emph{CAS}
collection (\emph{m-db-cs-1}) described earlier is used and five
methods are applied. These methods include:
\\

\begin{threeparttable}
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\begin{tabular}{lp{13cm}}
%
\rowcolor{tableheadcolor} \mcc{Method} & \mcc{Description} \tabularnewline%
%
BM & A Boolean model~\cite{baeza-yates99ir} is applied as in traditional information retrieval, where documents are represented as a bag of words and where structure is neglected. \tabularnewline%
%
TED\_SO\tnote{*} & Tree edit distance based on structure-only matching (see Section~\ref{sec:ted_distance_structure}). Parameters are set to $\alpha=0,5$, $\beta=2,0$, $C_{del}=1,0$, and $C_{ins}=1,0$. \tabularnewline%
%
TED\_CAS & Tree edit distance based on content-and-structure matching (see Section~\ref{sec:ted_distance_content}). Parameters are set to $\alpha=0,5$, $\beta=2,0$, $C_{del}=1,0$, and $C_{ins}=1,0$. \tabularnewline%
%
CM\_match & Content matrix matching where node labels are considered during comparison (see Section~\ref{sec:cm_distance}). \tabularnewline%
%
CM\_any & Content matrix matching where node labels are ignored during comparison (see Section~\ref{sec:cm_distance}). \tabularnewline%
%
TED\_CM & A combination of tree edit distance based on structure-only matching (TED\_SO) and content matrix matching where node labels are ignored (CM\_any).
Although CM\_match outperforms CM\_any, it was not combined with TED\_CM.
This was because both methods, CM\_match and TED\_SO, are structure-oriented, which would have given unfair weight to structural similarity.
The final distance of two document trees is computed as $\alpha \cdot TED\_SO + (1-\alpha)\cdot CM\_any$, where $\alpha$=$0,5$
%
\end{tabular}

\begin{tablenotes}
\footnotesize
\item[*] Note that the TED\_SO method does not use the document contents, but it is included here
for the purpose of comparison.
\end{tablenotes}
\end{threeparttable}
\\

 

These methods are run on 20\% of \emph{m-db-cs-1} to
obtain the results displayed in Table~\ref{tab:results_m-db-cs}
and Figure~\ref{fig:results_m-db-cs}.

\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Effect of CAS on the accuracy}
\label{tab:results_m-db-cs}
\begin{tabular}{Bccccccc}
%
\rowcolor{tableheadcolor} Approach & \mcc{$k=3$} & \mcc{$k=5$} & \mcc{$k=7$} & \mcc{$k=9$} & \mcc{$k=15$} & \mcc{$k=21$} \tabularnewline%
%
BooleanModel & 0,327 & 0,352 & 0,352 & 0,331 & 0,335 & 0,313 \tabularnewline%
TED\_SO & 0,916 & 0,907 & 0,895 & 0,895 & 0,856 & 0,860 \tabularnewline%
TED\_CAS & 0,652 & 0,634 & 0,640 & 0,673 & 0,584 & 0,558 \tabularnewline%
CM\_match & 0,352 & 0,360 & 0,305 & 0,309 & 0,296 & 0,272 \tabularnewline%
CM\_any & 0,130 & 0,163 & 0,175 & 0,160 & 0,134 & 0,140 \tabularnewline%
\textbf{TED\_SO+CM\_any} & 0,909 & 0,907 & 0,897 & 0,893 & 0,862 & 0,860
%
\end{tabular}
\end{table}
%
\begin{figure}[ht]
\centering
\includegraphics[width=0.7\textwidth]{07_classification/figures/results_m-db-cs}
\caption{Effect of CAS on the accuracy}
\label{fig:results_m-db-cs}
\end{figure}

Although \emph{m-db-cs-1} and \emph{m-db-s-0} are different form
each other, the accuracy of TED\_SO on both collections (when using
the structure only) is nearly the same and exceeds $90\%$ accuracy.
More surprising is the fact that the BM, CM\_match, and CM\_any
methods perform worse. Furthermore, comparing the tree-edit methods
TED\_CAS and TED\_SO, it is worth concluding that the content
deteriorates the accuracy. Indeed the difference in the accuracy
values of the two methods are significant: 26\% ($k=3$), 27\%
($k=5$), 25\% ($k=7$), 22\% ($k=9$), 23\% ($k=15$), and 31\%
($k=21$). This is also true when ignoring the structure entirely, as
with the BM method or when using the content matrix described in
Algorithm~\ref{alg:content_matrix_dist} and relying on CM\_match and
CM\_any. The accuracy deterioration in this case is much worse.
%
More consistent with the expectations, a combination of TED\_SO and
CM\_any, which results in TED\_CM, obtains best classification
results regarding the content and the structure of document trees.
It considers the content while assuring high classification accuracy
of the structure-oriented method.

It is clear from these experiments that the XML documents
classification is more influenced by the document's structure than
by its content. However, this could be true only for particular
collections where the content is relatively poor as is the case of
the MovieDB.

 

 

 

\subsection{Comparison}

Because a range of methods is provided, it is relevant to check how
these methods compare to the state-of-art methods that have been
applied on the same collection. To do that, two references
appearing in the INEX 2005 workshop are considered. The first is by
Hagenbuchner et al.~\cite{hagenbuchner05clustering} who applied
contextual self-organizing maps for structured data (CSOM-SD) to
classify XML documents, and the second is by Candillier et al. in
~\cite{candillier05classification} who applied inductive decision
trees (IDT).

To compare these methods against those proposed in this work, the
same evaluation metrics, accuracy, recall, and precision (at the
macro and micro levels)~\cite{sebastiani02machine} as shown in
Table~\ref{tab:results_comparison} have to be used. Note that just
the best performance rates achieved by each method are considered.

\begin{table}[ht]
\centering
\sffamily \small
\rowcolors{1}{tablerowcolorodd}{tablerowcoloreven}
\caption{Classification Comparison for \emph{m-db-s-0}}
\label{tab:results_comparison}
\begin{threeparttable}

\begin{tabular}{Bccccc}
%
\rowcolor{tableheadcolor} \mcc{Approach} & \mcc{Accuracy} & \mcc{Micro Recall} & \mcc{Macro Recall} & \mcc{Micro Precision} & \mcc{Macro Precision} \tabularnewline%
%
CSOM-SD & 0,873 & -\tnote{*} & -\tnote{*} & -\tnote{*} & -\tnote{*} \tabularnewline%
IDT & -\tnote{*} & 0,968 & 0,960 & -\tnote{*} & -\tnote{*} \tabularnewline%
TED\_CM & 0,934 & 0,934 & 0,934 & 0,937 & 0,911
%
\end{tabular}

\begin{tablenotes}
\footnotesize
\item[*] '-': means value not available
\end{tablenotes}

\end{threeparttable}
\end{table}

The results illustrate that TED\_CM largely outperforms CSOM-SD in
terms of accuracy by a rate difference of 6\%. However, when
considering micro and macro-recall, the IDT approach performs better
than TED\_CM. Unfortunately, recall without precision is not much
telling. The precision values achieved by TED\_CM are very
encouraging especially when taking recall values into account.

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Future Extensions}

The algorithms for computing the edit distance of two document trees
lack an intuitive applicability of standard document editing.
Altering a document is generally performed by applying more complex
transformations than the basic operations mentioned
above~\cite{barnard95treetotree}. Operations like inserting or
deleting whole subtrees (e.g., chapters), moving of structures, and
changing of contents have to be considered. Therefore, altering,
merging, splitting, raising, and lowering of contents or parts of
contents is of importance. Some of these operations can be modeled
using combinations of the basic operations, where the main focus
lies on the content of a node (i.e., merging of nodes contents).

Incorporating such additional operations in the edit distance
algorithm would increase the computational complexity by far,
leading to unacceptable runtime performance. According to
Barnard~\cite{barnard95treetotree} these kinds of transformation
operations should be handled as a post-processing task of the
minimal edit scripts. Here, the inherent order of the edit
operations and proper node matching strategies can be exploited,
which allows a substitution of basic operations by more elaborated
ones:
%
\begin{itemize}
%
\item $deleteTree(T_i,n,p)$: If all descendant nodes of a node are deleted,
the delete operations can be replaced by a single $deleteTree$
operation, where $p$ is the parent of the deleted subtree rooted in
node $n$ in the document tree $T_i$.

\item $insertTree(T_i,n,p,j)$: If all descendant nodes of a node are inserted,
the insert transformations can be replaced by a single $insertTree$
operation, where $p$ is a parent node where node $n$ is inserted as
the $j$-th child in the document tree $T_i$.

\item $moveTree(T_i,n_{old},p_{old},n_{new},p_{new},j)$: Utilizes a combination
of $deleteTree(T_i,n_{old},p_{old})$ and $insertTree(T_i,n_{new},p_{new},j)$.

\item $moveNode(T_i,n,p,j)$: Moves the node $n$ to the $j$-th child of node
$p$ in the document $T-i$. More elaborated move operations may
include $raiseChild$ and $lowerChild$.

\item $moveContent(T_i,n_1,n_2)$: Deletes the content of node $n_2$ and
appends it to the content of node $n_1$ in the document tree $T_i$.

\end{itemize}

% consequence
These additional operations reflect document authoring tasks better
than the basic edit operations for inserting, deleting, and altering
of single nodes. Different changes could be weighted differently,
which allows to focus on operations at a higher level. At the same
time, handling of these operations in a post processing step keeps
up its processing performance.

 

 

 

%----------------------------------------------------
%----------------------------------------------------
%----------------------------------------------------
\section{Summary}\label{sec:classification:summary}

This chapter discusses an XML classification approach based on the
$k$-nearest neighborhood algorithm which relies on edit distance
measures. The originality of the approach lies in the edit distance
similarity which considers both, the content and structure of XML
trees. Initial results indicate that the approach proposed is very
promising on both tasks, structure-only and content-and-structure
matching. One of the main findings is that the structure bears more
weight than the content does. However, a combination of two methods
among the proposed ones allows to tune the weight of both the
content and the structure. Further empirical work using other
document collections is certainly needed. Especially in the context
of structured document retrieval, the benefits of automatic
classification of documents or parts of documents have to be
investigated.

 

 

%