HPC-SFI: System-Level Fault Injection for High Performance Computing Systems - Network and Parallel Computing
Conference Papers Year : 2018

HPC-SFI: System-Level Fault Injection for High Performance Computing Systems

Yanqi Wang
  • Function : Author
  • PersonId : 1053403
Yi Liu
  • Function : Author
  • PersonId : 1053404

Abstract

Resilience/fault-tolerance has become a key challenge for large-scale parallel systems. To ensure reliability of high performance computing systems, various kinds of techniques have been proposed, such as hardware-level fault-tolerance, checkpointing, replication, algorithm-base fault-tolerance, etc. There are also many software systems to monitor and handle system-failures, e.g. management and job-scheduling system of HPC systems. To evaluate the effectiveness of these systems, it is necessary to provide some kind of tool to inject failures in a HPC system. This paper proposes HPC-SFI, a system-level fault injection tool for HPC systems. Basically, HPC-SFI can generate three kinds of system-failures in a HPC system including in-node faults, failure in the interconnection network and failure of storage/parallel-file system. In addition, HPC-SFI can inject system-faults in pseudo-random model according to pre-defined parameters and probabilities. Preliminary experimental results demonstrate effectiveness of the tool.
Fichier principal
Vignette du fichier
477597_1_En_9_Chapter.pdf (537.28 Ko) Télécharger le fichier
Origin Files produced by the author(s)
Loading...

Dates and versions

hal-02279558 , version 1 (05-09-2019)

Licence

Identifiers

Cite

Yanqi Wang, Qi Zhang, Yi Liu, Depei Qian. HPC-SFI: System-Level Fault Injection for High Performance Computing Systems. 15th IFIP International Conference on Network and Parallel Computing (NPC), Nov 2018, Muroran, Japan. pp.103-113, ⟨10.1007/978-3-030-05677-3_9⟩. ⟨hal-02279558⟩
90 View
107 Download

Altmetric

Share

More