A Bypass-Ring Scheme for a Fault Tolerant Multicast

We present afault tolerant sckemefor recoveryfrom single or multiple nodefailuTeS in multi-directional multicast trees. Tke sckeme is based on cyclic structures providing alternative patks to eliminatelaulty nodes and reroute tke traffic. Dur sckeme is independent olmessage source and direction in tke tree, provides a basis for on-tke-fly repair and can be used as a platformfor various strategies for reconnecting tree partitions. lt only requires an underlying injrastructure to provide a reliable routing service. Altkougk it is described in tke context ofa message multicast, tke sckeme can be used universaUy in aU systems using tree-based oveday networks for cornmunication among components.


Introduction
Recently, many distributed applications, particularly peer--to-peer internet-based systems, have become increasingly popular.As these systems need to transport data to a set of participating nodes, an efficient and reliable multicast is critical for their success.
Since multicast message routing is usually realized by overlay structures connecting members of often sparsely distributed multicast groups, it is necessary to deal with failures not only at the level of routers of the underlying infrastructure but also at the overlay level to keep the routing structure in an operational state.
The most efficient topology for message propagation in a multicast group is a tree, which is scalable and can be easily reconfigured according to the current network state to keep message dissemination cost minima!.However, tree topologies may have a reliability problem, since without any enhancement a single point of failure causes tree partitioning and thus prevents the messages being delivered to all receivers.
Generally, there are two approaches for tree recovery.The optimistic approach does not care about partitioning before it occurs, and only when partitioning is detected does the system attempt to rejoin the partitions into a connected graph.This on-demand fault tolerance usually forces affected members to abandon the existing connections and rejoin the tree ( [1], [8]).However, the long recovery latency could be undesirable for many applications.
The alternative is the pessimistic approach where backup routes are set beforehand, and they are aetivated when a failure is detected [5], [6].
We present a scheme for failure repair in tree-topology structures based on virtual cyclic backup paths used to bypass faulty nodes and repair the tree without traffic interruption.Since our protocol ensures that all partitions are reconnected and that no cycle is formed during recovery, no matter how and where faults are detected, it can be used as a platform for various strategies of partition reconnection that influence the resulting tree topology.Thus, the tree optimization preferences of other application levels can be involved in the repair process.The scheme can be used generally in applications using overlay tree-topology networks for message propagation.ln this paper, we aim at message multicast, since multicasting is one of applications where trees can be easily and efficiently used, and fault tolerance is the criticaJ issue at the same time.Other applications may include unicasting, routing, object location, replica management, etc.
The rest of paper is structured as follows.In section 2 the related work is summarized.Section 3 presents tbe model and the notations used further in the text.In section 4 we introduce the fault tolerant scheme, describe the principles of the scheme, single and multiple failure repair mechanisms and deal with practical issues.Section 5 briefly describes three different repair methods for tree partition reconnection.Section 6 discusses the properties of our protocol and section 7 contains conclusion and sets some future directions.

Related work
Several fault tolerant schemes for multicast trees have been reported based on preplanned failure restoration.Some of them ( [5], [12], [li]) use schemes similar to path restoration or link restoration, originally proposed for unicast fault tolerance in self-healing ATM networks ( [7], [9]) where pre-computed backup virtual paths either protect an entire individual end-to-end virtua) path or are used to reroute the traffic originally carried by a failed linko ln the Dual-Tree Scheme [5], a secondary tree providing alternative delivery paths is built and it is activated when node or link failure is detected in the primary tree.Unfortunately, this scheme as welJ as schemes proposed in ]7], [12], [ll] assume a single failure model in which there is only a single link or node that fails at the time.
An approach dealing with multiple failures is the Efficient Fault Tolerant Multicast Routing Protocol [6], where bypass paths connect nodes and their grandparents in the tree and are used to reconnect partitions when the parent node of these nodes fails.This solution is not efficient in multi-source networks, where the parent-child relation often alters.
A different approach to ensure multicast fault tolerance is used in Bayeux architecture [14].The messages are routed in a tree determined by a common prefix of addresses with small salt values, enabling delivery even in the case that a particular intermediate address is not available.Bayeux architecture requires Tapestry [13] as its underlying infrastructure.
The underlying netrlrork that we consider in this paper is a network providing a routing service (e.g., IP network, Tapestry [3], Pastry [0], orjust a set of virtual inter-process connections with routing capability) modeled as a graph SN=(( In order to connect members of a given multicast group and allow fior efficient message traffic, an overlay multicast tree connecting all MG-members is built and it is used as a source-independent structure for message propagation to the group members.The multicast tree is modeled as a graph 147 = (MG, CE) where C E is a set of core tree edges -virtual links (built on top of SAf connecting nodes in MC.We expect that the multicast group (thus either multicast tree) dynamically adapts to the current network state and user requirements.
We assume that nodes in MT may fail and that their faulty state can be detected by neighboring nodes.Note that we do not have to deal with link failures, since the multicast tree is an overlay stnrcture.Thus, message delivery across a virtual link connecting two M?-neighbors z, and n, depends on routing in the underlying nerwork fabric, and we expect it to deliver the messagewith the best effort if there is any path from node n, to node z, in SN.Further in the text, we consider each node n e V to be assigned with an SN-unique identifier 1D,. 4 Fault tolerant scheme Message dissemination using a multicast tree is vulnerable to even a single node failure.To prevent network partitioning, we propose a protocol based on deploying blpass rings of radius r that are used to bypass either a single faulty node or a cluster offaulty nodes, to reconnect the tree avoiding cycles and thus to enable the communication between remaining group members to continue.

Basic d,eft.nitions
Besides SN-unique ID, each member of multicast tree MT may be assigned with an MT-specifi c h'izrarchhal 'id,mtifur HID .
Identifier HID| of node n related to node c in MT tree is a concatenation of IDs of nodes on the only path in MT from node c to n. Functions pref (l,nntr) and.suff (l,nn;,) denote a prefix and suffix of length i of identifier HIDfi and gcp (HIDit,HIDir) denotes thc greatest commorl prefa of the hierarchical identifiers of nodes n, and nr.For example, in Fig. r, HIDzp{ is BA.l8.2F,Fnf(t,HID!{) is BA and scp (a n ru{,H/DFBI) is BA.I 8.
A directed bypass edge be; = (ni,n;*1) (for all i = l, . .., t) is a virtual link from node n, to node n,*,, where n, is the initial node and n;+t the terminal node of be,.For convenience, the direction is further indicated by left (l)/ right (n) syrnbols, so that the lefi ring-nzighhor of node n, is the initial node of the bypass edge terminated, atn, and the right ring-neighDor is the terminal node of the bypass edge initiated at n,.
The bypass ring .BR.(r) of radius r ) I centered at node c e MG is a circuit consisting of an ordered sequence of nodes n v n2, ..,, n,*, e MG, tu t= flt*r such that: (2.1) Forall n;,i=1,..., l, distance d("i,4=r ord(ni,)sr if n, is a leaf ofMT and (2.2) Each-sequence of all nodes n;, rr;4y ...r n;a; having gcp(utoi ,..., HID',,.r)equal to Frf (r,fi4 ) is ordered equally to the ordering of these nodes in u^oo1r,, wto' /l) '   Property ensures that all bypass edges constructed between any two nodes are equally oriented, which is important for the repair process.
A conupl,ete bypass ring is a ring consisting of all nodes having property (l .I ) or (2.1); a redrrced bypass nng comprises only a subset of these nodes.An example of complete BrR(l)   and 8/i(2) centered at node c is shown in Ftg.Integration of all bypass rings and an MT graPh creates an extenfud multicast tree EMT =(MG,CE UBE) where CE is the set of core tree edges of the original MT and BE is the set of all bypass edges.MT is then a spanning tree of EMT.Further in this paper, we will consider only complete bypass rings and assume that ifBR c (r) is part ofEMT then aU BRc(q), 1 ~q ~r are contained in EMY, too.r max denotes the maximum radius of bypass rings deployed.

Practical bypass ring construction algorithm
A nade c that is to creale its bypass ring of radius I first sorts all its t MT-neighbors ni by their IDs to get the sorted list nI' ..., ni fulfilling property (1.2) and sends to each node ni a CREATE_ BR (pref (2, HlD~J lD nl , lD n ,) message, where l=(i-2)modt+l and r=imodt+1.Upon receiving a CREATE_ BR message, each node ni saves the ID of the sender together with its own ID as suff(2, HID~).
pref (1,pref(2,HID~J) as the ID of center nade c and lD n I and ID nT as left and right ring-neighbors into its BR table.
To build BR(r+ I), each member n of BR(r) sorts aH its t MT-neighbors by their ID to ~et the list ni' ... , ni' ... , ni' where ID nj = pref(l, suff(2, HID~)) and creates the sequence n j + p ... , nt' nI' ... , n j _ 1 to fulfill property (2.2).This sequence is then dealt as sorted list in the case of BR(l) construction.To get ID~I forn}+1 and ID~, fornj_l' noden has to communicate with its left and right BR(r)-neighbor, respectively.
Construction of rings (namely BR(I)) is similar to construction of a sorted circular bidirectional linked list.The important factor here is to prevent incomplete rings in cases when the center nade fails during the process.Each nade sends a HALLO_ BR message to its ring-neighbors upon receiving CREATE_BR from the center nade.If it does not receive the same message from bOlh ilS ring-neighbors wilhin a certain period of time, il assumes lhat the ring has not been constructed completeIy and sends a DELETE_ BR message around to delete accordant entries in the BR tables of the members of lhe incomplete ring.
Updating lhe bypass ring when a member is either added ar deleled is again similar to corresponding operalions with a sorted circular bidirectional linked list.The center node sends CHANGE_BRR and CHANGE_BRL messages, causing the target node to change the ID of its right and left ring-neighbor, retaining the oriemation in the ring.To deal with center node failure that can lead to bypass ring damage, nodes receiving a CHANGE_BRL ar CHANGE_BRR message have to agree with each other to perform lhe change atomically.

Single failure repair using BR(l)
The aim of the repair process is to create new core tree edges using a bypass ring to connect tree partitions caused by a node failure and restore the multicast tree MT to the connected and consistent state.At the same time, it has to be assured that the repair process operates properly no matter 20 where and at how many nodes the repair of a failed node is simultaneously initiated.
Let node c be the faulty node in EMT =(MG, CE UBE) and BR,( I) its bypass ring consisting of nodes ni' ... , nI" Define function R(n) for all ni' i = I, ... , t as follows: The basic idea behind the repair process is that each neighbor nf = ni of faulty nade c detecling the failure consecutively iterates along the ring BR,( I) in the direction or lhe bypass edges through the nodes ni ,ni." ...
is the right ring-nelghbor of ni , ...) until it reaches a node ni that has already been notified about failure (i.e., R(ni ) is d~fined).At each hop ni ' I s q s p, it is determined if n~-+ ni and ni is nOlifiJd about failure of nade c such lh~řlR(ni )q = ni' Tetmination of this process is guaranteed, since Bit(r) consists of a finite number of nodes (each node has a finite set of neighbors in MT).
Several methods can be deployed for selection of node ni, .They are discussed in section 5.
Mter all nodes nt, ... , nt E BRr (I) have been notified, the reIation -+ between the incident nodes of all bypass edges bei is known.With the definition of the bypass ring and -+ relation, it can be proven lhat relation -+ on the bypass ring has the following properties: 2) There is only one bypass edge bei =(n u , n v ) E BR c (1) such that nu -+ nu is not true.
Together with properties (5.1 )-(5.3) of newly constructed core tree edges, we can get the foHowing: (7.1)The repaired MY connects aH partitions induced by a node failure. (7.2) The repaired MT is a tree graph.
Moreover, the described technique is independent of the repair initialing node and also prevents collisions in cases when multiple nodes initiate repair simultaneously (which can easily happen in a distributed system). 4.4 Practical single failure repair algorithm , A practical repair algorithm may incorporate several per- formance improvements : o The iteration along the ring from the failure detecting node n; is performed in both directions (further referred to as a two-way algorithm).
o The iteration is nor performed directly by n;, it is rather delegated by the ring members.
o The new core tree edges are constructed on the fly, imme- diately after node n;-is notified and relation nro_, -ni determined.I   The algorithm works as follows: Each node ni that during message routing through the multicast tree realizes that one of its neighbors, say, node c, is down, excludes node c fiom its bypass ring 8rl (l) and sends REPAIR (1Dd , ID, ) messages to the both its ring-neighbors on the ring Bft.(l).
procedure OnRecv_REPAfR ( ID of failed node IDc, ID of initiating node IDni) (performed at nir, REPAIR received from ring-neighbor zq-r) l. remove node c from -BR(l) centered at ni, 2. if R (IDzi,) is undefined then 3.
add node nro to 8.R(l) centered at nr.
send REJECT to node zir-r and goto step 18 end I l.
An example of the distributed construction of relation -+ on BR(l) is shown in Fig. 2. In the first step, node 18 finds that node, is not available so it uses BR.(l) to eliminate tree partition and sends REPAIR(c, 18) messages to its ring-neighbors (step 2) to determine the + relation.In step 4, another node (8B) ) and the second one (from node 88) was received from its left neighbo4, node A0 rejects the message (there is no relation 8B -+ A0) and thus prevents a cycle forming.
Each node accepting a REPAIR message establishes a new core tree edge according to the chosen repair method.

Multiple failure repair
To deal with failure of a cluster of nodes in an MZ tree, bypass rings .B R(r), r > l, have to be deployed.Essentially, the repair process is similar to the single failure repair described in section 4.3, as it is also based on relation -+ determination between each nuo neighbors on some repair route.The difference is, of course, in the repair message routing and in relation -> determination, since the nodes on the roure mav not be sorted in general Let.FC =p; i=1,2, ..., s; c;eMG\bethefaukycluster (i.e., set of faulty nodes), where s = ['Cl is the cluster size.The repair message is routed as follows.The repair process always begins at the member of BrR(l), since the failure is detected by the MT-neighbor of some faulty node.Each node n, (detecting failure of its neighbor c,) iterates through nodes along 84, (l) in the direction of the bypass edges in the same way as in sihgle failure repai4 except that it can reach another faulry node n;,*, =ci+teB4,(D.In this case, IDr._, is appended to the list of faulry nodes and the iteration'tontinues from node n; along tiie ring BR.,.,(r), where r =il(ni", ci*t), provided that a bypass ring wiih radius r cenrered hr c,*, is constructed (r S r,,'u*).This 'switch' to another ring is done whenever the next regular node in the repair route is faulty.
From each member of a BR,(r), r > 1, the repair message is routed along a core tree edge to the MT-neighbor node m e BR"(rl) (if it is not faulty) in order to keep the iteration path as close as possible to the failed cluster of nodes.
,. ._~.... , , n(ni' ni) is a bypass edge of BRk) or (ni,nj) EMT and ni EBRc(r) and To achieve comparability between nodes on BC ( ofthe function R definition also has to be modified: This process is performed until iteration reaches a node ni =ni+1 that has already been notified about failure of at le~st one node Ci from the list of faulty nodes (i.e., R(ni ) is defined).The number of steps of this iteration is finite, s'ince aU BR(r) have finite length, the number of fauJty nodes also has to be finite and aH rings have a uniform orientation.
Let Pi be iteration paths initiated at nodes ni' i = 1, ... , h, where h is the number of nodes initiating multiple failure repair.Since every path Pi is terminated at node n i + 1 initiating path P i + p the union Pl U P 2 U ... U Ph forms a continuous path surrounding the faulty cluster, where the terminating node of Ph is at the same time the initiating node of Pl' This cycle is caUed a lJypass cycle BC.A bypass cycle has similar properties to a bypass ring, and in the case of single failure repair of node c, the bypass cycle is formed solely by BR,(l).
To enable the multiple failure repair algorithm to work properly with bypass cycles, paragraph repair message routing.To get the correct node cm considering aU nodes ci E FC (not only the information known locally to the message routed along a particular path P.), a technique similar to the Chang-Roberts leader election algorithm [2] is used.ASter these modifications, the relation ~between each two neighboring nodes on a bypass cycle has similar properties to the relation ~on BR( 1).In particular, relation ~on the bypass cycle is acyclic and there are only two neighbors nu' nu on the cycle such that neither nu ~n v nor n v ~nu is true.With these properties, it can be shown that all partitions induced by simultaneous failure of multiple adjacent nodes in MT are reconnected into one tree.

Repair methods
Bypass rings used to repair the multicast tree ne"twork MT together with the ~relation create a platform that can be used as a basis for failure repair strategies responsible for constructing new core tree edges as they select node ni, (see Alg. I) influencing the topology ofthe repaired MT For convenience, the set of new core tree edges constructed during the repair process is further denoted as SE.• LRM method.AlI member nodes of BR(l) (or the bypass cycle in the case of multiple failure) together with SE form a linear tree (Fig. 3 bl, e)).Node ni is substituted by ni .
This approach is preferable if optimization priority is not to increase the degree of nodes in MT (i.e. the number of incident edges).o HRM method.This method is a combination of both LRM and TRM merhods.Specifically, it is similar to TRM except that the path fiom R1n,) to ni may contain one or more intermediare nodes nlwith n(n 1) =R(z;) .The branches of the newly constructeil core tre6 edges rooted at R(n) may thus be longer than one hop (Fig. 3 d)).
The appropriate strategy for SE construction can be cho- sen autonomously by all repair-initiating nodes based on their current local state.

Discussion
Bypass rings 8R(l) are sufiicienr for the repair of a single node failure.Generally, aBR(r) scheme is sufficient for the re- pair of a cluster of faulty nodes with a diameter less or equal to 2(r -l).Howeve6 because of the repair message routing mechanism 'switching' between bypass rings centered at different faulty nodes, bypass rings BR(r) can repair an even larger cluster of faulty nodes under convenient circumstances.
For any repair to be successful, there has to be a parh not containing faulty nodes in the underlying routing infiasructure between each two neighboring nodes on a bypass cycle.
The proposed bypass ring scheme is tolerant to failures of nodes on Bft(l) (or the bypass cycle) even if rhe come- sponding repair is still in progress.This properry is ensured by the fact that core tree edges can be construcred immediately after a node n is notified about failure, and thus it is possible to update BR"(r) appropriately.
Constmction of a complete bypass ring BR(r) (together with all BR(q),q=1,...,r -l) takes O(r)steps and needs O(b') messages, where b is the average branching factor in MI Failure repair can be done in the worst case in O(s2b) steps.The time to perform a repair is inversely proportional to the number of nodes simultaneously initiating the repaiq and it is further reduced if the repair message is routed in both directions in BR(l) or a bypass cycle (two-way algorithm).
Reduced bypass rings may lower the memory overhead and construction cost to O(b) at the expense of the fault--tolerance level.Note that even complete 8R(2) rings provide a substantial level of fault-tolerance for many applications.
Additionally, the higher the average branching factor is, the higher is the probability that clusters with a diameter greater than 2 (r.,'u,.-I) will be successfully repaired.

Conclusion
We have proposed a fault-tolerant scheme for tree--topology communication networks based on bypass rings of optional radii that are used to repair the tree when a node or cluster of nodes fails.For a single failure model, a practical repair algorithm was presented.
As this scheme guarantees that all tree partitions in- duced by faulty nodes are reconnected and that the repaired network does not contain cYcles, it can be used as generic platform for various sffategies for reconnectrng tree partitions' Three strategies for creating new core edges were briefly discussed.
Multicasting is not the only application where our scheme can be successfully deployed.We argue that our scheme can be used generally by all applications using tree-topology overlay networks to connect or communicate between their components, since the scheme is independent of message source and traffic direction, and the repair can be done in real-time without a significant delay penalty.The only requirement is an underlying inllastructure providing a routing service.
Our future work in this area will include simulation of standard tree traffic patterns, evaluation of the performance of the scheme under various workloads, and a comparison of the effectiveness of various strategies for partition reconnection in terms of external tree optimization require- ments.We would also like to refine repair algorithms for reduced bypass rings and measure the impact of reduction on the fault-tolerance level provided.

Acknowledgement
This work was elaborated as a part of the Gaston project at the Czech Technical Universiry in Prague.Gasron t3l, t4l is a peer-to-peer large-scale file system designed to provide a fault-tolerant and highly available file service.

Symbols
Fig. I : Example of complete bypass rings BrR( I ) and BrR(2)

( 4 . 1 )
in the definition of relation ni ~n j introduced in section (4.3) has to be modified: Bypass ring BR(1) New tree edges SE Original tree edges Repair message routing (9.1) R(ni) =H!D~~, where !D cm = mine{!Dc i; Ci E FC}).if node ni has been first notified about node c failure by nodenJ' Node Cm' HIDcm and also HIDcm are determined during nf ni

~j)Fig. 3 :
Fig. 3: Methods of multicast tree repair using a two-way single failure repair algorithm a) Original multicast tree bl, cl, d) LRM, TRM and HRM methods ofrepair initiated at a single node e), fl LRM and TRM methods, repair initiated simultaneously at two nodes Based on requirements for optimization of the multicast tree, one of the following approaches to SE creation can be chosen: E)where I/ is a hnite set of vertices representing nodes; E is a finite set of edges, representing links benryeen nodes in the nenuork.
Each nade n E MG holds a BR routing tahle containing information about aU bypass rings thal lhe nade is member of.Each entry in the table consiSlS of the ID of ring center nade c, the IDs of the left and right ring-neighbors, suff (2, HlD~), pref (2, HID~) and the radius ofthe ring.
/ \ r----./\\/\\/ -'--\ ze ', / vcc l9 similarly sends a REPAIR(c, 88) Example of a two-way single failure repair process message to its ring-neighbors.Since the first REPAIR mes- sage received by node A0 originated at 18 (after step 3 • TRM method.Alt nodes ni on BC are connected to R(ni) and also to R(ni) if there is a node ni such that ni ~n j and R(n j) *' R(ni) (Fig.3 cl, f).Node ni, is substituted by R(ni) in Alg. 1.This method is preferable if optimization priority is to retain the mutual distance of nodes ni and thus not to increase the MT diameter.
BC Bypass cycle BE Set ofbypass edges bri Bypass edge i BRr(r) Bypass ring of radius r centered at node c CE Set ofcore tree edges cei Core tree edge i E Set of all edges in underlying nenvork S.ly' EMT Extended multicast tree FC Cluster of faulty nodes gcp Greatest common prefix function HID., Hierarchical identifier of node rt related to node c ID, Identifier of node n h * The greatest bypass ring radius deployed s Size of faulty cluster SE Set of new core tree edges constructed during repair process SN Underlying network suff Sufftx function V Set of ail nodes in under\ing nework SN r^