Self-adaptivity of Applications on Network on Chip Multiprocessors: The Case of Fault-tolerant Kahn Process Networks

Decanato - Facoltà di scienze informatiche

Data d'inizio: 19 Maggio 2015

Data di fine: 20 Maggio 2015

You are cordially invited to attend the PhD Dissertation Defense of Onur DERIN on Tuesday, May 19th 2015 at 13h30 in room SI-003 (Informatics building)

Abstract:
Technology scaling accompanied with higher operating frequencies and the ability to integrate more functionality in the same chip has been the driving force behind delivering higher performance computing systems at lower costs. Embedded computing systems, which have been riding the same wave of success, have evolved into complex architectures encompassing a high number of cores interconnected by an on-chip network (usually identified as Multiprocessor System-on-Chip). However these trends are hindered by issues that arise as technology scaling continues towards deep submicron scales. Firstly, growing complexity of these systems and the variability introduced by process technologies make it ever harder to perform a thorough optimization of the system at design time. Secondly, designers are faced with a reliability wall that emerges as age-related degradation reduces the lifetime of transistors, and as the probability of defects escaping post-manufacturing testing is increased.

In this thesis, we take on these challenges within the context of streaming applications running in network-on-chip based parallel (not necessarily homogeneous) systems-on-chip that adopt the no-remote memory access model. In particular, this thesis tackles two main problems: (1) fault-aware online task remapping, (2) application-level self-adaptation for quality management. For the former, by viewing fault tolerance as a self-adaptation aspect, we adopt a cross-layer approach that aims at graceful performance degradation by addressing permanent faults in processing elements mostly at system-level, in particular by exploiting redundancy available in multi-core platforms. We propose an optimal solution based on an integer linear programming formulation (suitable for design time adoption) as well as heuristic-based solutions to be used at run-time. We assess the impact of our approach on the lifetime reliability. We propose two recovery schemes based on a checkpoint-and-rollback and a rollforward technique. For the latter, we propose two variants of a monitor-controller-adapter loop that adapts application-level parameters to meet performance goals. We demonstrate not only that fault tolerance and self-adaptivity can be achieved in embedded platforms, but also that it can be done without incurring large overheads. In addressing these problems, we present techniques which have been realized (depending on their characteristics) in the form of a design tool, a run-time library or a hardware core to be added to the basic architecture.

Dissertation Committee:

  • Prof. Mariagiovanna Sami, Politecnico di Milano, Italy (Research Advisor)
  • Prof. Matthias Hauswirth, Università della Svizzera italiana, Switzerland (Internal Member)
  • Prof. Miroslaw Malek, Università della Svizzera italiana, Switzerland (Internal Member)
  • Prof. Fernando Pedone, Università della Svizzera italiana, Switzerland (Internal Member)
  • Prof. Peter Marwedel, Technische Universität Dortmund, Germany (External Member)