Failure Avoidance in MPI Applications Using an Application-Level Approach
Authors / Editors
Research Areas
Publication Details
Output type: Journal article
Author list: Cores I, Rodriguez G, Gonzalez P, Baykal B
Publisher: Oxford University Press
Publication year: 2014
Journal: The Computer Journal (0010-4620)
Volume number: 57
Issue number: 1
Start page: 100
End page: 114
Number of pages: 15
ISSN: 0010-4620
eISSN: 1460-2067
Languages: English-Great Britain (EN-GB)
Unpaywall Data
Open access status: bronze
Full text URL: https://academic.oup.com/comjnl/article-pdf/57/1/100/1141882/bxs158.pdf
Abstract
Execution times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures. For this reason, hardware failures must be tolerated by the applications to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to provide fault tolerance support to parallel applications. However, when a failure occurs, most checkpointing mechanisms require a complete restart of the parallel application from the last checkpoint. New advances in the prediction of hardware failures have led to the development of proactive process migration approaches, where tasks are migrated in a preventive way when node failures are anticipated, avoiding the restart of the whole application. The work presented in this paper extends an application-level checkpointing framework to proactively migrate message passing interface (MPI) processes when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: low overhead in failure-free executions, avoiding the checkpoint dumping associated to rolling back strategies; low overhead at migration time, by means of the design of a light and asynchronous protocol to achieve a consistent global state; transparency for the user, thanks to the use of a compiler tool and a runtime library and portability, as it is not locked into a particular architecture, operating system or MPI implementation.
Keywords
checkpointing, failure avoidance, message-passing, proactive migration
Documents
No matching items found.