dFault: Fault Localization in Large-Scale Peer-to-Peer Systems
Abstract
Distributed hash tables (DHTs) have been adopted as
a building block for large-scale distributed systems. The upshot of this
success is that their robust operation is even more important as
mission-critical applications begin to be layered on them. Even though
DHTs can detect and heal around unresponsive hosts and disconnected
links, several hidden faults and performance bottlenecks go undetected,
resulting in unanswered queries and delayed responses. In this paper, we
propose dFault, a system that helps large-scale DHTs to localize such
faults. Informed with a log of failed queries called symptoms and some
available information about the hosts in the DHT, dFault identifies the
potential root causes (hosts and overlay links) that with high
likelihood contributed towards those symptoms. Its design is based on
the recently proposed dependency graph modeling and inference approach
for fault localization. We describe the design of dFault, and show that
it can accurately localize the root causes of faults with modest amount
of information collected from individual nodes using a real prototype
deployed over PlanetLab.
Domains
Digital Libraries [cs.DL]Origin | Files produced by the author(s) |
---|
Loading...