-------- Forwarded Message --------
We apologize if you receive multiple copies of this call for
papers.
The workshop paper deadline has been extended to May 11, 2018 (no
further extensions).
--------------------------------------------------------------------------------
11th Workshop on Resiliency in High Performance Computing
(Resilience)
in Clusters, Clouds, and Grids
<https://www.csm.ornl.gov/srt/conferences/Resilience/2018>
in conjunction with
the 24th International European Conference on Parallel and
Distributed
Computing (Euro-Par), Turin, Italy
August 27 - 31, 2018
<https://europar2018.org>
Overview:
Resilience is a critical challenge as high performance computing
(HPC)
systems continue to increase component counts, individual
component
reliability decreases (such as due to shrinking process technology
and
near-threshold voltage (NTV) operation), and software complexity
increases.
Application correctness and execution efficiency, in spite of
frequent
faults, errors, and failures, is essential to ensure the success
of the
extreme-scale HPC systems, cluster computing environments, Grid
computing
infrastructures, and Cloud computing services.
While a fault (e.g., a bug or stuck bit) is the cause of an error,
its
manifestation as a state change is considered an error (e.g., a
bad value
or incorrect execution), and the transition to an incorrect
service is
observed as a failure (e.g., an application abort or system
crash). A
failure in a computing system is typically observed through an
application
abort or a full/partial service or system outage. A detectable
correctable
error is often transparently handled by hardware, such as a single
bit flip
in memory that is protected with single-error correction
double-error
detection (SECDED) error correcting code (ECC). A detectable
uncorrectable
error (DUE) typically results in a failure, such as multiple bit
flips in
the same addressable word that escape SECDED ECC correction, but
not
detection, and ultimately cause an application abort. An
undetectable error
(UE) may result in silent data corruption (SDC), e.g., an
incorrect
application output. There are many other types of hardware and
software
faults, errors, and failures in computing systems.
Resilience for HPC systems encompasses a wide spectrum of
fundamental and
applied research and development, including theoretical
foundations, fault
detection and prediction, monitoring and control, end-to-end data
integrity,
enabling infrastructure, and resilient solvers and algorithm-based
fault
tolerance. This workshop brings together experts in the community
to further
research and development in HPC resilience and to facilitate
exchanges
across the computational paradigms of extreme-scale HPC, cluster
computing,
Grid computing, and Cloud computing.
Submission Guidelines:
Authors are invited to submit papers electronically in English in
PDF
format. Submitted manuscripts should be structured as technical
papers and
BETWEEN 10 AND 12 PAGES, including figures, tables and references,
using
Springer's Lecture Notes in Computer Science (LNCS) format at
<http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0>.
Papers with
less than 10 or more than 12 pages will not be accepted due to
publisher
guidelines. Submissions should include abstract, key words and the
e-mail
address of the corresponding author. Papers not conforming to
these
guidelines may be returned without review. All manuscripts will be
reviewed
and will be judged on correctness, originality, technical
strength,
significance, quality of presentation, and interest and relevance
to the
conference attendees. Submitted papers must represent original
unpublished
research that is not currently under review for any other
conference or
journal. Papers not following these guidelines will be rejected
without
review and further action may be taken, including (but not limited
to)
notifications sent to the heads of the institutions of the authors
and
sponsors of the conference. Submissions received after the due
date or not
appropriately structured may also not be considered. The
proceedings
will be published in Springer's LNCS as post-conference
proceedings. At
least one author of an accepted paper must register for and attend
the
workshop for inclusion in the proceedings. Authors may contact the
workshop
program chairs for more information.
Important websites:
- Resilience 2018 Website:
<https://www.csm.ornl.gov/srt/conferences/Resilience/2018>
- Resilience 2018 Submissions:
<https://easychair.org/conferences/?conf=europar2018ws>
- Euro-Par 2018 website:
<https://europar2018.org>
Topics of interest include, but are not limited to:
- Theoretical foundations for resilience:
- Metrics and measurement
- Statistics and optimization
- Simulation and emulation
- Formal methods
- Efficiency modeling and uncertainty quantification
- Fault detection and prediction:
- Statistical analyses
- Machine learning
- Anomaly detection
- Data and information collection
- Visualization
- Monitoring and control for resilience:
- Platform and application monitoring
- Response and recovery
- RAS theory and performability
- Application and platform knobs
- Tunable fidelity and quality of service
- End-to-end data integrity:
- Fault tolerant design
- Degraded modes
- Forward migration and verification
- Fault injection
- Soft errors
- Silent data corruption
- Enabling infrastructure for resilience:
- RAS systems
- System software and middleware
- Programming models
- Tools
- Next-generation architectures
- Resilient solvers and algorithm-based fault tolerance:
- Algorithmic detection and correction of hard and soft faults
- Resilient algorithms
- Fault tolerant numerical methods
- Robust iterative algorithms
- Scalability of resilient solvers and algorithm-based fault
tolerance
Important Dates:
- Workshop papers due: May 11, 2018 (no further extensions)
- Workshop author notification: June 15, 2018
- Workshop early registration: TBD
- Workshop paper (for informal workshop proceedings): July 6, 2018
- Workshop date: August 27-28, 2018
- Workshop camera-ready papers: October 2, 2018
General Co-Chairs:
- Stephen L. Scott
Senior Research Scientist - Systems Research Team
Tennessee Tech University and Oak Ridge National Laboratory, USA
scottsl@ornl.gov
- Chokchai (Box) Leangsuksun,
SWEPCO Endowed Associate Professor of Computer Science
Louisiana Tech University, USA
box@latech.edu
Program Co-Chairs:
- Patrick G. Bridges
University of New Mexico, USA
bridges@cs.unm.edu
- Christian Engelmann
Oak Ridge National Laboratory , USA
engelmannc@ornl.gov
Program Committee:
- Ferrol Aderholdt, Oak Ridge National Laboratory, USA
- Rizwan Ashraf, Oak Ridge National Laboratory, USA
- Wesley Bland, Intel Corporation, USA
- Hans-Joachim Bungartz, Technical University of Munich, Germany
- Marc Casas, Barcelona Supercomputer Center, Spain
- Zizhong Chen, University of California at Riverside, USA
- Robert Clay, Sandia National Laboratories, USA
- Miguel Correia, Universidade de Lisboa, Portugal
- Nathan DeBardeleben, Los Alamos National Laboratory, USA
- James Elliott, Sandia National Laboratories, USA
- Kurt Ferreira, Sandia National Laboratories, USA
- Saurabh Hukerikar, NVIDIA, USA
- Dieter Kranzlmueller, Ludwig-Maximilians University of Munich,
Germany
- Ignacio Laguna, Lawrence Livermore National Laboratory, USA
- Scott Levy, University of New Mexico, USA
- Dirk Pflueger, University of Stuttgart, Germany
- Alexander Reinefeld, Zuse Institute Berlin, Germany
- Rolf Riesen, Intel Corporation, USA
- Yves Robert, ENS Lyon, France
- Thomas Ropars, Universite Grenoble Alpes, France
- Martin Schulz, Technical University of Munich, Germany
- Keita Teranishi, Sandia National Laboratories, USA
_______________________________________________
computational.science mailing list
computational.science@lists.iccsa.org
https://lists.iccsa.org/mailman/listinfo/computational.science
Hosted by Sardina Systems: FishOS: AI-powered OpenStack
www.sardinasystems.com