Resilient and energy-efficient scheduling algorithms at scale

Guillaume Aupy

Résumé

This thesis deals with two issues for future Exascale platforms, namely resilience and energy. We address these two issues in two main parts. The first part focuses on the importance of reliability for future Exascale platforms, while the second part discusses how to improve the energy consumption of these platforms. Considering the relative slopes describing the evolution of the reliability of individual components on one side, and the evolution of the number of components on the other side, the reliability of the entire platform is expected to decrease, due to probabilistic amplification. The mean time between two failures on Exascale systems is expected to be shorter than the time to do a system checkpoint. In the first part of this thesis, we focus on the optimal placement of periodic coordinated checkpoints to minimize execution time. In a general context, we start by improving the first-order period formula obtained by Young and Daly. We then consider fault predictors, a software used by system administrators that tries to predict (through the study of passed events) where and when faults will strike. These predictors are imperfect, they only predict a proportion of failures called ``recall'', and amongst their predictions, only a percentage called ``precision'' are actual faults. Finally, we also consider fault predictors that do not give the exact moment where a fault may occur, but instead offer an interval of time where the fault is likely to occur. In these contexts, we propose efficient algorithms, and give a first-order optimal formula for the amount of work that should be done between two checkpoints. We then revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters. While reliability is a major concern for Exascale, another key challenge is to minimize energy consumption, which we address in the second part of the thesis. The principal reason is that Exascale systems are expected to have similar requirements as current Petascale system components, but at an extreme scale capacity, namely 1000 times today’s capacity! In order to be able to bring the necessary power to these systems, the thermal power by $cm^2$ needs to be reduced drastically. One of the most power-consuming components of today’s systems is the processor. Even when idle, it dissipates a significant fraction of the total power. However, for future Exascale systems, the power dissipated to execute I/O transfers is likely to play an even more important role, because the relative cost of communications is expected to dramatically increase, both in terms of latency and consumed energy. The speed scaling technique consists in diminishing the voltage of the processor, hence diminishing its execution speed. It is commonly assumed that the voltage is proportional to the frequency, implying a dynamic power in the cube of the frequency. Unfortunately, it was also pointed out that DVFS increases the probability of failures exponentially, and that this probability cannot be neglected within a large-scale computing framework. In this context, we consider the speed scaling technique coupled with reliability-increasing techniques such as re-execution, replication or checkpointing. For these different problems, we propose various algorithms whose efficiency is shown either through thorough simulations, or approximation results relatively to the optimal solution. Finally, we consider the different energetic costs involved in periodic coordinated checkpointing and compute the optimal period to minimize energy consumption, as we did for execution time.

Dans cette thèse, j'ai considéré d'un point de vue théorique deux problèmes importants pour les futures plateformes dîtes Exascales : les restrictions liées à leur fiabilité ainsi que les contraintes énergétiques. En première partie de cette thèse, je me suis intéressé à l'étude de placements optimal de ces checkpoints dans un but de minimisation de temps total d'exécution. En particulier, j'ai considéré les checkpoints périodiques et coordonnés. J'ai considéré des prédicteurs de fautes capables de prévoir, de manière imparfaite, les fautes arrivant sur la plateforme. Dans ce contexte, j'ai conçu des algorithmes efficaces pour résoudre mes problèmes. Dans un deuxième temps, j'ai considéré des fautes silencieuses. Ces fautes ne peuvent être détectées qu'uniquement par un système de vérification. Dans le cas où une de ces fautes est détectée, l'utilisateur doit retourner au point de sauvegarde le plus récent qui n'a pas été affecté par cette faute, si un tel point existe ! Dans ce contexte, j'ai à nouveau proposé des algorithmes optimaux au premier ordre, mixant points de sauvegarde et points de vérification. Dans la seconde partie de cette thèse, j'ai considéré des problèmes énergétiques liés à ces mêmes plateformes. Ces problèmes critiques doivent être reliés aux problèmes de fiabilité de la partie précédente. Dans ce contexte, j'ai couplé des techniques de baisse de consommation énergétique à des techniques d'augmentation de fiabilité comme la reexécution, la réplication ainsi que le checkpoint. Pour ces différents problèmes, j'ai pu fournir des algorithmes dont l'efficacité a été montrée soit au travers de simulations, soit grâce à des preuves mathématiques.

Resilient and energy-efficient scheduling algorithms at scale

Algorithmes d'ordonnancement fiables et efficaces énergétiquement à l'échelle

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Partager