-------- Forwarded Message --------
We apologize if you receive multiple copies of this call for
papers.
--------------------------------------------------------------------------------
12th Workshop on Resiliency in High Performance Computing
(Resilience)
in Clusters, Clouds, and Grids
<https://www.csm.ornl.gov/srt/conferences/Resilience/2019>
in conjunction with
the 25th International European Conference on Parallel and
Distributed
Computing (Euro-Par), Göttingen, Germany
August 26 - 30, 2019
<http://2019.euro-par.org>
Overview:
Resilience is a critical challenge as high performance computing
(HPC) systems continue to increase component counts, individual
component reliability decreases (such as due to shrinking process
technology and near-threshold voltage (NTV) operation), software
complexity increases, and architectures become more heterogeneous.
Application correctness and execution efficiency, in spite of
frequent faults, errors, and failures, is essential to ensure the
success of the extreme-scale HPC systems, cluster computing
environments, Grid computing infrastructures, and Cloud computing
services.
While a fault (e.g., a bug or stuck bit) is the cause of an error,
its manifestation as a state change is considered an error (e.g.,
a bad value or incorrect execution), and the transition to an
incorrect service is observed as a failure (e.g., an application
abort or system crash). A failure in a computing system is
typically observed through an application abort or a full/partial
service or system outage. A detectable correctable error is often
transparently handled by hardware, such as a single bit flip in
memory that is protected with single-error correction double-error
detection (SECDED) error correcting code (ECC). A detectable
uncorrectable error (DUE) typically results in a failure, such as
multiple bit flips in the same addressable word that escape SECDED
ECC correction, but not detection, and ultimately cause an
application abort. An undetectable error (UE) may result in silent
data corruption (SDC), e.g., an incorrect application output.
There are many other types of hardware and software faults,
errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of
fundamental and applied research and development, including
theoretical foundations, fault detection and prediction,
monitoring and control, end-to-end data integrity, enabling
infrastructure, and resilient solvers and algorithm-based fault
tolerance. This workshop brings together experts in the community
to further research and development in HPC resilience and to
facilitate exchanges across the computational paradigms of
extreme-scale HPC, cluster computing, Grid computing, and Cloud
computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in
PDF format. Submitted manuscripts should be structured as
technical papers and BETWEEN 10 AND 12 PAGES, including figures,
tables and references, using Springer's Lecture Notes in Computer
Science (LNCS) format at
<http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>.
Papers with less than 10 or more than 12 pages will not be
accepted due to publisher guidelines. Submissions should include
abstract, key words and the e-mail address of the corresponding
author. Papers not conforming to these guidelines may be returned
without review. All manuscripts will be reviewed and will be
judged on correctness, originality, technical strength,
significance, quality of presentation, and interest and relevance
to the conference attendees. Submitted papers must represent
original unpublished research that is not currently under review
for any other conference or journal. Papers not following these
guidelines will be rejected without review and further action may
be taken, including (but not limited to) notifications sent to the
heads of the institutions of the authors and sponsors of the
conference. Submissions received after the due date or not
appropriately structured may also not be considered. The
proceedings will be published in Springer's LNCS as
post-conference proceedings. At least one author of an accepted
paper must register for and attend the workshop for inclusion in
the proceedings. Authors may contact the workshop program chairs
for more information.
Important websites:
- Resilience 2019 Website:
<https://www.csm.ornl.gov/srt/conferences/Resilience/2019>
- Resilience 2019 Submissions:
<https://easychair.org/my/conference.cgi?conf=europar2019workshops>
- Euro-Par 2019 website:
<http://2019.euro-par.org>
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience:
- Metrics and measurement
- Statistics and optimization
- Simulation and emulation
- Formal methods
- Efficiency modeling and uncertainty quantification
- Fault detection and prediction:
- Statistical analyses
- Machine learning
- Anomaly detection
- Data and information collection
- Visualization
- Monitoring and control for resilience:
- Platform and application monitoring
- Response and recovery
- RAS theory and performability
- Application and platform knobs
- Tunable fidelity and quality of service
- End-to-end data integrity:
- Fault tolerant design
- Degraded modes
- Forward migration and verification
- Fault injection
- Soft errors
- Silent data corruption
- Enabling infrastructure for resilience:
- RAS systems
- System software and middleware
- Programming models
- Tools
- Next-generation architectures, including heterogeneous
architectures
- Resilient solvers and algorithm-based fault tolerance:
- Algorithmic detection and correction of hard and soft faults
- Resilient algorithms
- Fault tolerant numerical methods
- Robust iterative algorithms
- Scalability of resilient solvers and algorithm-based fault
tolerance
Important Dates:
- Workshop papers due: May 10, 2019
- Workshop author notification: June 28, 2019
- Workshop author registration: July 15, 2019
- Workshop paper (for informal workshop proceedings): July 22,
2019
- Workshop date: August 26 or 27, 2019
- Workshop camera-ready papers: TBD (after the conference)
General Co-Chairs:
- Stephen L. Scott
Senior Research Scientist - Systems Research Team
Tennessee Tech University and Oak Ridge National Laboratory, USA
scottsl@ornl.gov
- Chokchai (Box) Leangsuksun,
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges
University of New Mexico, USA
bridges@cs.unm.edu
- Christian Engelmann
Oak Ridge National Laboratory , USA
engelmannc@ornl.gov
Program Committee:
- Rizwan Ashraf, Oak Ridge National Laboratory, USA
- Wesley Bland, Intel Corporation, USA
- Hans-Joachim Bungartz, Technical University of Munich, Germany
- Marc Casas, Barcelona Supercomputer Center, Spain
- Robert Clay, Sandia National Laboratories, USA
- Nathan DeBardeleben, Los Alamos National Laboratory, USA
- James Elliott, Sandia National Laboratories, USA
- Kurt Ferreira, Sandia National Laboratories, USA
- Saurabh Hukerikar, NVIDIA, USA
- Dieter Kranzlmueller, Ludwig-Maximilians University of Munich,
Germany
- Ignacio Laguna, Lawrence Livermore National Laboratory, USA
- Scott Levy, University of New Mexico, USA
- Dirk Pflueger, University of Stuttgart, Germany
- Alexander Reinefeld, Zuse Institute Berlin, Germany
- Rolf Riesen, Intel Corporation, USA
- Yves Robert, ENS Lyon, France
- Thomas Ropars, Universite Grenoble Alpes, France
- Martin Schulz, Technical University of Munich, Germany
- Keita Teranishi, Sandia National Laboratories, USA
--
Christian Engelmann, Ph.D.
Senior R&D Staff Scientist
Computer Science Research Group
Computer Science and Mathematics Division
Oak Ridge National Laboratory
Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA
Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491
e-Mail:
engelmannc@ornl.gov / Home:
www.christian-engelmann.info
_______________________________________________
computational.science mailing list
computational.science@lists.iccsa.org
https://lists.iccsa.org/mailman/listinfo/computational.science
Hosted by Sardina Systems: FishOS: AI-powered OpenStack
www.sardinasystems.com