Technical report detail

Solving Atomic Multicast when Groups Crash

by Nicolas Schiper, Fernando Pedone

In this paper, we study the atomic multicast problem, a fundamental abstraction for building fault-tolerant systems. In the atomic multicast problem, the system is divided into non-empty and disjoint groups of processes. Multicast messages may be addressed to any subset of groups, each message possibly being multicast to a different subset. Several papers previously studied this problem either in local area networks or wide area networks. However, none of them studied atomic multicast when groups may entirely crash. We first present an algorithm that solves atomic multicast in a genuine way, i.e., to deliver a message m, only addressees of m are involved in the protocol. This algorithm tolerates an arbitrary number of failures and uses a failure detector, denoted as P, that provides perfect information about process failures, i.e., it never stops trusting a process to be alive before it crashes and eventually stops trusting crashed processes. We show that among failure detectors of practical interest, i.e., those that do not predict the future, P is necessary to solve genuine atomic multicast if we do not bound the number of processes that may fail. Failure detector P is thus the weakest realistic failure detector for solving genuine atomic multicast when an arbitrary number of processes may crash. As implementing P seems hard, especially in wide area networks, we also solve atomic multicast with a non-genuine algorithm. This algorithm has a lower process resiliency than the former (group crashes are still tolerated) but provides several advantages compared to the former. It requires perfect failure detection inside each group instead of on a system-wide level. Moreover, it is fast, messages addressed to multiple groups may be delivered within two inter-group message delays only. We also present a modification of this algorithm that tolerates unreliable failure detection at the cost of a weaker liveness guarantee.

Technical report 2008/02, July 2008

BibTex entry

@techreport{08solving, author = {Nicolas Schiper and Fernando Pedone}, title = {Solving Atomic Multicast when Groups Crash}, institution = {University of Lugano}, number = {2008/02}, year = 2008, month = jul }