Failure Avoidance in MPI Applications Using an Application-Level Approach


Authors / Editors


Research Areas


Publication Details

Output typeJournal article

Author listCores I, Rodriguez G, Gonzalez P, Baykal B

PublisherOxford University Press

Publication year2014

JournalThe Computer Journal (0010-4620)

Volume number57

Issue number1

Start page100

End page114

Number of pages15

ISSN0010-4620

eISSN1460-2067

LanguagesEnglish-Great Britain (EN-GB)


Unpaywall Data

Open access statusbronze

Full text URLhttps://academic.oup.com/comjnl/article-pdf/57/1/100/1141882/bxs158.pdf


Abstract

Execution times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures. For this reason, hardware failures must be tolerated by the applications to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to provide fault tolerance support to parallel applications. However, when a failure occurs, most checkpointing mechanisms require a complete restart of the parallel application from the last checkpoint. New advances in the prediction of hardware failures have led to the development of proactive process migration approaches, where tasks are migrated in a preventive way when node failures are anticipated, avoiding the restart of the whole application. The work presented in this paper extends an application-level checkpointing framework to proactively migrate message passing interface (MPI) processes when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: low overhead in failure-free executions, avoiding the checkpoint dumping associated to rolling back strategies; low overhead at migration time, by means of the design of a light and asynchronous protocol to achieve a consistent global state; transparency for the user, thanks to the use of a compiler tool and a runtime library and portability, as it is not locked into a particular architecture, operating system or MPI implementation.


Keywords

checkpointingfailure avoidancemessage-passingproactive migration


Documents

No matching items found.


Last updated on 2025-01-07 at 00:29