Failure Prediction for Proactive Fault Management

Staff - Faculty of Informatics

Start date: 30 September 2010

End date: 1 October 2010

SPEAKER: Miroslaw Malek, Humboldt-Universität, Berlin
DATE: Thursday, September 30th 2010
PLACE: USI Università della Svizzera italiana, room A24, Red Building (Via G. Buffi 13)
TIME: 10.30

Assuring business continuity is and will remain a key challenge due to ever-increasing systems complexity, growing connectivity, interoperability as well as dynamicity (frequent configurations, reconfigurations, updates, upgrades and patches) and systems proliferation in all walks of life.
We present a brief overview of computer/communication systems engineering by focusing on business processes, services, operating system and the IT hardware infrastructure and introduce a concept of translucency which helps deciding at what level the highest cost- efficient dependability gains can be achieved.
We argue that with current complexity levels and necessity of dealing with time, in addition to classical synthesis and analysis methods, we need to turn to management and the use of empirical data-driven approaches which require monitoring, online measurement, online 
analysis, diagnosis, failure prediction and decision making to  
support recovery and nonstop computing and communication.
We address the problem of proactive fault management by demonstrating how runtime monitoring, variable selection, model re-evaluation and the use of preventive measures which are triggered by failure prediction mechanisms lead to significant availability increase. We present a brief taxonomy of such approaches as well as propose and evaluate two techniques which model and predict the occurrence of failures as a function of discrete and continuous measurements of system variables. We introduce two modelling approaches: a function approximation technique utilising Universal Basis Functions (UBF) and a Hidden Markov Model. The presented modelling methods are data driven rather than analytical and can handle large amounts of variables and data. They offer the potential to capture the underlying dynamics of highly-dimensional and noisy systems.
Next, we show how such modelling techniques have been applied to real data of a commercial telecommunication platform. The data includes event-based log files and measured system states. Results are presented in terms of precision, recall, F-Measure and cumulative cost. Our findings demonstrate how predictive technologies combined with effective failure avoidance and recovery methods can boost dependability and performance, especially in cloud, multicore and many core computing environments. By using the presented techniques the cumulative system downtime may be reduced by an order of magnitude. In conclusion, the main research challenges for proactive fault management in the next generation systems will be presented.

Miroslaw Malek is professor and Chair of Computer Architecture and Communication at the Department of Computer Science at Humboldt University in Berlin. His research interests focus on dependable architectures and services in parallel, distributed and embedded computing and communication environments. He has participated in two pioneering parallel computer projects, contributed to the theory and practice of parallel network design, developed the comparison-based method for system diagnosis, codeveloped comprehensive WSI, networks testing and failure prediction techniques, proposed the consensus- based framework for responsive (fault-tolerant, real-time) computer systems design and has made numerous other contributions, reflected in over 200 publications. He has supervised over 25 Ph.D. dissertations (ten of his students are professors) and founded, organized and co- organized numerous workshops and conferences. He served and serves on editorial boards of several journals and is consultant to government and companies on technical and strategic issues in information technology. Malek received his PhD in Computer Science from the Technical University of Wroclaw in Poland, spent 17 years as professor at the University of Texas at Austin and was also, among others, visiting professor at Stanford, Universita di Roma "La Sapienza", Keio University, Technical University in Vienna, New York University, Chinese University of Hong Kong, and guest researcher at Bell Laboratories and IBM T.J. Watson Research Center.

HOST: Prof. Fernando Pedone