-------- Original-Nachricht -------- Betreff: [computational.science] CFP: The 2nd International Workshop on Resiliency in High Performance Computing (Resilience 2009) Datum: Wed, 28 Jan 2009 20:52:51 -0500 Von: Christian Engelmann engelmannc@ornl.gov Organisation: "OptimaNumerics" An: Computational Science Mailing List computational.science@lists.optimanumerics.com
We apologize if you receive multiple copies of this CFP.
-----------------------------------------------------------------------
Call for Papers
2nd International Workshop on Resiliency in High Performance Computing (Resilience 2009) http://xcr.cenit.latech.edu/resilience2009 in conjunction with the International Symposium on High Performance Distributed Computing (HPDC) June 9-13, 2009 Munich, Germany
Recent trends in high-performance computing (HPC) systems have clearly indicated that future increases in performance, in excess of those resulting from improvements in single-processor performance, will be achieved through corresponding increases in system scale, i.e., using a significantly larger component count. As the raw computational performance of the world's fastest HPC systems increases from today's current tera-scale to next-generation peta-scale capability and beyond, their number of computational, networking, and storage components will grow from the ten-to-one-hundred thousand compute nodes of today's systems to several hundreds of thousands of compute nodes and more in the foreseeable future. This substantial growth in system scale, and the resulting component count, poses a challenge for HPC system and application software with respect to fault tolerance and resilience.
Furthermore, recent experiences on extreme-scale HPC systems with non-recoverable soft errors, i.e., bit flips in memory, cache, registers, and logic added another major source of concern. The probability of such errors not only grows with system size, but also with increasing architectural vulnerability caused by employing accelerators, such as FPGAs and GPUs, and by shrinking nanometer technology. Reactive fault tolerance technologies, such as checkpoint/restart, are unable to handle high failure rates due to associated overheads, while proactive resiliency technologies, such as preemptive migration, simply fail as random soft errors can't be predicted. Moreover, soft errors may even remain undetected resulting in silent data corruption.
The goal of this Workshop is to bring together experts in the area of fault tolerance and resiliency for HPC to present the latest achievements and to discuss the challenges ahead. Accepted papers will be included with the HPDC conference proceedings published by ACM. Resilience 2009 is the follow-on to the successful Resilience 2008 workshop http://xcr.cenit.latech.edu/resilience2008 held in conjunction with CCGrid in Lyon, France.
Important Dates: - Paper Submission Deadline : February 25, 2009 - Notification Deadline : March 18, 2009 - Camera Ready Deadline : April 2, 2008
Submission Guidelines: Original, unpublished work is required. Submissions shall be a maximum of 10 ACM SIG style pages (http://www.acm.org/sigs/publications/proceedings-templates), including tables and illustrations. All submitted manuscripts will be reviewed by a distinguished international program committee. Accepted contributions will be published with the HPDC conference proceedings through ACM. Papers should be submitted electronically via https://ssl.linklings.net/conferences/hpdc.
Topics of interest include, but are not limited to: - Reports on current HPC system and application resiliency - HPC resiliency metrics and standards - HPC system and application resiliency analysis - HPC system and application-level fault handling and anticipation - HPC system and application health monitoring - Resiliency for HPC file and storage systems - System-level checkpoint/restart for HPC - System-level preemptive migration for HPC - Algorithm-based resiliency for HPC - Fault tolerant MPI concepts and solutions - Soft error detection and recovery in HPC systems - HPC system and application log analysis - Statistical methods to identify failure root causes - Fault injection studies in HPC environments - High availability solutions for HPC systems - Reliability and availability analysis - Hardware for fault detection and recovery
General Co-Chairs: - Stephen L. Scott Computer Science and Mathematics Division Oak Ridge National Laboratory scottsl@ornl.gov - Chokchai (Box) Leangsuksun SWEPCO Endowed Associate Professor of Computer Science, Louisiana Tech University, USA box@latech.edu
Program Chair: - Christian Engelmann Computer Science and Mathematics Division Oak Ridge National Laboratory engelmannc@ornl.gov
Program Committee: - Ann Gentile, Sandia National Laboratory, USA - Aurelien Bouteiller, University of Tennessee, USA - Chokchai (Box) Leangsuksun, Louisiana Tech University, USA - Christian Engelmann, Oak Ridge National Laboratory, USA - Daniel S. Katz, Louisiana State University, USA - Dan Stanzione, Arizona State University, USA - Franck Cappello, INRIA, France - Geoffroy Vallee, Oak Ridge National Laboratory, USA - George Bosilca, University of Tennessee, USA - George Ostrouchov, Oak Ridge National Laboratory, USA - Greg Bronevetsky, Lawrence Livermore National Laboratory, USA - Gregory M. Thorson, SGI, USA - Hong Ong, Oak Ridge National Laboratory, USA - Jim Brandt, Sandia National Laboratory, USA - John T. Daly, Center for Exceptional Computing, USA - Jon Stearley, Sandia National Laboratory, USA - Li Ou, Dell, USA - Mihaela Paun, Louisiana Tech University, USA - Nathan DeBardeleben, Los Alamos National Laboratory, USA - Paul Hargrove, Lawrence Berkeley National Laboratory, USA - Stephen Poole, Oak Ridge National Laboratory, USA - Stephen L. Scott, Oak Ridge National Laboratory, USA - Sudharshan Vazhkudai, Oak Ridge National Laboratory, USA - Thomas Naughton, Oak Ridge National Laboratory, USA - Tong Liu, Mellanox, USA - Xian-He Sun, Illinois Institute of Technology, USA - Xubin (Ben) He, Tennessee Tech University, USA - Yung-Chin Fang, Dell, USA - Zhiling Lan, Illinois Institute of Technology, USA