On Tree Pattern Matching by Pushdown Automata

Tree pattern matching is one of the fundamental problems with many applications, and is often declared to be analogous to the problem of string pattern matching [3, 5, 17]. String pattern matching is the problem of finding all occurrences of string patterns and their positions in a given text. A model of computation for string pattern matching is the finite automaton [8]. One of the basic approaches used for string pattern matching is represented by finite automata which are constructed for string patterns, which means that the patterns are preprocessed. Given a text of size n, such finite automata typically perform the search phase in time linear to n (see [8, 9, 19] for a survey). Tree pattern matching is the problem of finding all occurrences and their positions of matches of tree patterns in a subject tree. Although many tree pattern matching methods have been described [4, 5, 6, 10, 11, 13, 14, 16, 17, 18, 20, 21], most of them fail to provide a search phase in linear time (based on the size of the subject tree) or have huge memory requirements. This paper presents the first attempt to perform tree pattern matching by a unified and systematic approach using pushdown automata. We present the first, basic, non-deterministic model of a pushdown automaton performing tree pattern matching. The goal of this research is to provide a method for determinising the non-deterministic model of the proposed pushdown automaton, which will make linear time (based on the size of the subject tree) pattern matching possible.


Introduction
Tree pattern matching is one of the fundamental problems with many applications, and is often declared to be analogous to the problem of string pattern matching [3,5,17].String pattern matching is the problem of finding all occurrences of string patterns and their positions in a given text.A model of computation for string pattern matching is the finite automaton [8].One of the basic approaches used for string pattern matching is represented by finite automata which are constructed for string patterns, which means that the patterns are preprocessed.Given a text of size n, such finite automata typically perform the search phase in time linear to n (see [8,9,19] for a survey).
Tree pattern matching is the problem of finding all occurrences and their positions of matches of tree patterns in a subject tree.Although many tree pattern matching methods have been described [4,5,6,10,11,13,14,16,17,18,20,21], most of them fail to provide a search phase in linear time (based on the size of the subject tree) or have huge memory requirements.
This paper presents the first attempt to perform tree pattern matching by a unified and systematic approach using pushdown automata.We present the first, basic, non-deterministic model of a pushdown automaton performing tree pattern matching.The goal of this research is to provide a method for determinising the non-deterministic model of the proposed pushdown automaton, which will make linear time (based on the size of the subject tree) pattern matching possible.
We denote the set of natural numbers by N. A ranked alphabet is a finite, non-empty set of symbols, each of which has a unique, non-negative arity (or rank).Given a ranked alphabet A, the arity of a symbol a ÎA is denoted Arity(a).The set of symbols of arity p is denoted by A p .Elements of arity 0, 1, 2, …, p are respectively called nullary, unary, binary, …, p-ary symbols.We assume that A contains at least one constant.In the examples we use numbers at the end of identifiers for a short declaration of symbols with arity.For instance, a2 is a short declaration of a binary symbol a.Based on the concepts of graph theory (see [1]), a labeled, ordered, ranked tree over a ranked alphabet A can be defined as follows: An ordered directed graph G is a pair (N, R), where N is a set of nodes and R is a set of linearly ordered lists of edges, such that each element of R is of the form (( , ), ( , ), , ( , )) , where Î , n ³ 0. This element would indicate that, for node f, there are n edges leaving f, the first entering node g 1 , the second entering node g 2 , and so forth.
A sequence of nodes ( , , , ) , n ³1, is a path of length n from node f 0 to node f n , if there is an edge which leaves node f i-1 and enters node , where f f n 0 = .An ordered Directed Acyclic Graph (DAG) is an ordered directed graph that has no cycle.

Labeling of an ordered graph
In the examples we use a f for a short declaration of node f labeled by symbol a.
A labeled, ordered, ranked tree t over a ranked alphabet A is an ordered DAG t N R = ( , )with a special node called the root, such that (1) r has in-degree 0, (2) all other nodes of t have in-degree 1, (3) there is just one path from root r to every f N Î , where We note that in many papers on the theory of tree languages, such as [5,7,15,17], labeled ordered ranked trees are defined with the use of ordered ranked ground terms.
Ground terms can be regarded as labeled, ordered, ranked trees in prefix notation.Therefore, the notions ground term, tree and tree in prefix notation are used interchangeably in these papers.The prefix notation of tree t 1 is pref t a a a a a a a ( ) 1 2 2 0 1 0 1 0 = .Trees can also be represented graphically and tree t 1 is illustrated in Fig. 1.The height of a tree t, denoted by Height(t), is defined as the maximal length of a path from the root of t to a leaf of t.
To define a tree pattern, we use a special nullary symbol S, where S ÏA, which serves as a placeholder for any subtree.A tree pattern is defined as a labeled, ordered, ranked tree over ranked alphabet A È { } S .By analogy, a tree pattern in prefix notation is defined as a labeled, ordered, ranked tree over ranked alphabet A È { } S in prefix notation.
A pattern p with k ³ 0 occurrences of the symbol S matches a tree t at node n if there exist subtrees t t t k 1 2 , , , K (not necessarily the same) of the tree t, such that the tree ¢ p , obtained from p by substituting the subtree t i for the i-th occurrence of S in p, The prefix notaion of tree pattern p 1 is . The tree pattern p 1 is illustrated in Fig. 2 and has two occurrences in tree t 1 , matching at nodes a2 1 and

Alphabet, language, context-free grammar, pushdown automaton
Let an alphabet be a finite, nonempty set of symbols.A language over an alphabet A is a set of strings over A.
and a b g , , ( and Þ * are used for the transitive, and the transitive and re- , where x T Î * .The relationA

Hidden-left recursion is a
The language generated by a CFG G, denoted by L(G), is the set of strings L G w S w w T * * A derivation tree is a labeled, ordered tree representing a syntactic structure of a string w, generated by the grammar G.

An (extended) non-deterministic pushdown automaton
where Q is a finite set of states, A is an input alphabet, G is a pushdown store alphabet, d is a mapping from is the initial content of the pushdown store, and F Q Í is the set of final (accepting) states.The triplet ( , , ) denotes the configuration of a pushdown automaton.In this paper, the top of the pushdown store x is always on the left-hand side.The initial configuration of a pushdown automaton is a triplet ( , , ) q w Z 0 0 for the input string w ÎA * .The relation (1) d g ( , , ) q a £1 for all q Q Î , a A Î È{ } e , g ÎG * . ( ¹ / 0 and a b ¹ then a is not a prefix of b and b is not a prefix of a. (3) If d a ( , , ) q a ¹ / 0, d e b ( , , ) q ¹ / 0, then a is not a prefix of b and b is not a prefix of a.

The class of languages accepted by non-deterministic
PDAs is exactly the class of context-free languages.Languages accepted by deterministic PDAs are called deterministic context-free languages.There exist context-free languages which are not deterministic, that is, for which no deterministic PDA can be constructed.
A pushdown automaton is input-driven if each of its pushdown operations is determined only by the input symbol.

LR(0) parsing
Given a string w, an LR(0) parser for a CFG G N T P S = ( , , , ) reads the string w from left to right without any backtracking and is implemented by a deterministic PDA.A string * a a b is a rightmost derivation in G; the string b is called the handle.We use the term complete viable prefix to refer to ab in its entirety.
During parsing, each content of the pushdown store corresponds to a viable prefix.
The standard LR(0) parser performs two kinds of transitions: (a) When the contents of the pushdown store correspond to a viable prefix containing an incomplete handle, the parser performs a shift, which reads one symbol a and pushes a symbol corresponding to a onto the pushdown store.(b) When the contents of the pushdown store correspond to a viable prefix ending by the handle b, the parser performs a reduction by rule A ® b.The reduction pops b symbols from the top of the pushdown store and pushes a symbol corresponding to A onto the pushdown store.
A CFG G is LR(0) if the two conditions for G: , that is a g = , A B = , and x y = .
If the CFG G is not an LR(0) grammar, then the PDA constructed as an LR(0) parser contains conflicts, which means the next transition to be performed cannot be determined according to the contents of the pushdown store only.
For CFGs without hidden-left and right recursions, the number of consecutive reductions between the shifts of two adjacent symbols cannot be greater than a constant, and therefore the LR(0) parser for such a grammar can be optimized by precomputing all its reductions.Then, the optimized resulting LR(0) parser reads one symbol on each of its transitions [2].
A language L accepted by a pushdown automaton M is defined in two distinct ways: (1) Accepting by final state: Accepting by empty pushdown store: }. e e Ù Î Ù Î A

A Deterministic Pushdown Automaton accepting trees in prefix notation
The prefix notation of a tree can be generated by a grammar G N T P S = ( , , , ), having rules P of the following form: Since the grammar is LR(0), belonging to the subclass of context-free grammars named as deterministic context free grammars, the generated language belongs to the class of deterministic context-free languages and can be recognised by a deterministic pushdown automaton.
In this section we present the deterministic pushdown au- , for each x ÎA.Have in mind that S 0 = e.The automaton, which is input-driven, is depicted in Fig. 3.This particular automaton will be the basic building block for our non-deterministic automaton, which will serve as a pattern matcher.

The automaton presented in section 3 accepts valid trees in prefix notation by empty pushdown store.
Proof.There are three possible types of input to be given to the automaton: (1) A valid input, which represents the prefix notation of a tree.(2) An invalid input, in which there exists a prefix that represents a valid prefix notation of a tree.(3) An invalid input which is the prefix of the prefix notation of some (unknown) tree.
To show that the first type of (valid) input is accepted by our automaton by empty pushdown store, we use strong induction.
Let P(n) be a predicate defined over all integers n.Predicate P(i) is true, if trees of height i are accepted by the presented deterministic PDA.We define the base case and the inductive step in the following manner: (1) Base case: P(0) is true.
(2) Inductive case: Since the initial pushdown store symbol is symbol S, statement (1) is true, as the only trees of height (0) are trees with only one node x having arity 0. The transition to be taken is d e ( , , ) ( , ) 0 0 x S = , removing the initial symbol S from the pushdown store.
Each tree of depth n can be represented as a root x of arity k, where each of its children nodes can be subtrees of height at most n -1.According to the inductive case assumption (2), each of the subtree can be accepted by the automaton by empty pushdown store, removing the initial pushdown store symbol S. Since the root appends k -1 symbols S on the pushdown store, the k subtrees remove k symbols S from the pushdown store, leaving it empty.As a result, we have proven that our automaton accepts valid input (prefix notations of trees) by empty pushdown store.
In the second case (invalid input, in which there exists such a prefix that represents a valid prefix notation of a tree), the pushdown store will be emptied at the moment the prefix which represents a valid prefix notation is read.
In the third case, which is apparent from the first case, the pushdown store will not be emptied.
We have proved that the automaton accepts only valid input (prefix notations of trees) by empty pushdown store.q Corollary 1. Processing an arbitrary tree with the automaton introduced in Section 3 results in one symbol being removed from the top of the pushdown store.

Searching Non-Deterministic Pushdown Automaton
In this section we present the Searching Non-Deterministic Pushdown Automaton (SNPDA), performing tree pattern matching.
The structure of an SNPDA accepting all occurrences of the tree pattern for a given tree in prefix notation is described by Algorithm 4. We note that the SNPDA accepts the matched patterns by final state.
The SNPDA is loosely based on the searching non-deterministic finite automaton, which is used for pattern matching in strings, as described in [19].It is constructed by extending the deterministic pushdown automaton presented in Section 3 with states and transitions corresponding to the given tree pattern.
We start by constructing the deterministic pushdown automaton M q G q S F = ( , , , , , , ) where d is defined as d( , , ) {( , )} ( ) q x S q S Arity x 0 0 = for each x ÎA, F = / 0 and G S = { }.The prefix notation of the pattern is read from left to right; for each node x (except the special nullary symbol S) at position i (the position of the first node is 1) , the following steps are carried out: 1. Create a new state q i .
2. In case i ¹ 1 do step 3, otherwise do step 4.  q x S q x S q S Arity x 0 0 1 = È .
In case the nullary symbol S is found at position i, the following steps are carried out: 1. Create a new state q i .2. Define a symbol # j , where # j G Ï .The last created state (state q n ) is set as final (that is, F q n = { }).Examples of PDAs constructed by Algorithm 1 for various patterns are shown in Fig. 4-6.

Theorem 2.
The SNPDA constructed by Algorithm 1 finds all occurrences of the tree pattern in a subject tree by final state.
Proof.We provide a sketch of the proof.A tree pattern, for which the SNPDA M Q G q S F = ( , , , , , , ) A d 0 is constructed, has either the form p x x x n = 1 2 K (form 1), where x 1 ÎA and x S i Î È A { } for i >1, or the form p S = (form 2).The automaton is non-deterministic at state q 0 , due to the transitions d( , , ) { ( , ) , ( , ) } ( ) ( ) q x S q S q S Arity x Arity x 0 1 0 1 Because of this property, the SNPDA can follow more than one paths at each input symbol.It can either cycle through state q 0 or move to state q 1 and on to q n , in case the input symbols match the pattern.
At the point of a nullary S symbol in the pattern, an e-transition leading to state q 0 is taken, replacing the top of the pushdown store (which is a symbol S) with S j # , where j is distinct for each S in the tree pattern.Using this method, we simulate a new pushdown store on the top of the current pushdown store.Symbol # j denotes the end of the new, simulated pushdown store.From Corollary 1, we know that reading a tree by cycling through state q 0 removes 1 S symbol from the pushdown store.As a result, the top of the pushdown store will be # j , which indicates that a tree (required by the respective symbol S in the tree pattern) has been processed.The # j -transition can now be taken to resume pattern matching at the point after the respective symbol S in the pattern.While reading a tree by cycling at state q 0 , a new pattern can be detected since the automaton is non-determinstic.q Note that the SNPDA in Fig. 6 is input-driven and thus it can be determinised in the same way as finite automata.The deterministic version is illustrated in Fig. 7.

Conclusion and future work
In this paper we have presented an innovative method of tree pattern matching by pushdown automata.We have introduced a non-deterministic model of the searching pushdown automaton, which correctly accepts all occurrences of a pattern in a given tree presented in its prefix notation.
Our goal is to perform determinisation of this automaton, which will lead to linear time (to the size of the subject tree) searching of patterns, as in the case of string pattern matching by deterministic finite automata.Work on the determinisation of this automaton has already begun and the first results were presented at the London Stringology Days conference held at King's College, London [12].
The symbol A* denotes the set of all strings over A, including the empty string, which is denoted by e. Set A + is defined as A A + = * {e}.Similarly for string x ÎA * , the symbol x m , m ³ 0 denotes the m-fold concatenation of x with x 0 = e.Set x * is defined as x x context-free grammar (CFG) is a 4-tuple G N T P S = ( , , , ), where N and T are finite, disjoint sets of nonterminal and terminal symbols, respectively.P is a finite set of rules A ® a, where A N Î , a Î È ( ) * N T .S N Î is the start symbol.A CFG G N T P S = ( , , , ) is said to be in Reversed Greibach Normal Form, if each rule from P is of the form A a ® a , where a T Î and a Î N * .

Fig. 1 :Fig. 2 :
Fig. 1: Tree t 1 from Example 1 and its prefix notation of a pushdown automaton M, if ( , ) ( , , ) p qa g d a Î .The k-th power, transitive closure, and transitive and reflexive closure of the relation M is denoted M k , M + , M * , respectively.A pushdown automaton M is deterministic (deterministic PDA), if it holds: trees in prefix notation by empty pushdown store.The transitions of the automaton are in the form d( , ,

1 1 ==
in the case of form 1, or due to the transition d conflicting with all other transitions in the case of form 2.

Algorithm 1 :Fig. 4 : 1 0Fig. 5 : 3 2 0 Fig. 6 :Fig. 7 :
Fig. 4: Non-deterministic searching automaton pushdown automaton for tree pattern p a a Sa = 2 1 0 Let this application ofStep to be node a f .If a f is a leaf, list a f and halt.If a f is not a leaf, let its direct descendants be