[Cfp] [Fwd: [computational.science] 1st CFP: Resilience@CCGrid 2010]

19 Oct 2009


      -------- Original-Nachricht --------
Betreff: 	[computational.science] 1st CFP: Resilience@CCGrid 2010
Datum: 	Sun, 18 Oct 2009 16:35:05 -0400
Von: 	Christian Engelmann engelmannc@ornl.gov
Organisation: 	"ICCSA"
An: 	Computational Science Mailing List 
computational.science@lists.iccsa.org
Call for Papers
3rd International Workshop on Resiliency in High Performance Computing
                            (Resilience 2010)
               http://xcr.cenit.latech.edu/resilience2010
                         in conjunction with the
                10th IEEE/ACM International Symposium on
             Cluster, Cloud and Grid Computing (CCGrid 2010)
                  http://www.manjrasoft.com/ccgrid2010
             May 17-20, 2010, Melbourne, Victoria, Australia
Clusters, Clouds, and Grids are three different computational paradigms
with the intent or potential to support High Performance Computing
(HPC). Currently, they consist of hardware, management, and usage
models particular to different computational regimes, e.g., high
performance cluster systems designed to support tightly coupled
scientific simulation codes typically utilize high-speed interconnects
and commercial cloud systems designed to support software as a service
(SAS) do not. However, in order to support HPC, all must at least
utilize large numbers of resources and hence effective HPC in any of
these paradigms must address the issue of resiliency at large-scale.
Recent trends in HPC systems have clearly indicated that future
increases in performance, in excess of those resulting from
improvements in single- processor performance, will be achieved through
corresponding increases in system scale, i.e., using a significantly
larger component count. As the raw computational performance of these
HPC systems increases from today's tera- and peta-scale to
next-generation multi peta-scale capability and beyond, their number of
computational, networking, and storage components will grow from the
ten-to-one-hundred thousand compute nodes of today's systems to several
hundreds of thousands of compute nodes and more in the foreseeable
future. This substantial growth in system scale, and the resulting
component count, poses a challenge for HPC system and application
software with respect to fault tolerance and resilience.
Furthermore, recent experiences on extreme-scale HPC systems with
non-recoverable soft errors, i.e., bit flips in memory, cache,
registers, and logic added another major source of concern. The
probability of such errors not only grows with system size, but also
with increasing architectural vulnerability caused by employing
accelerators, such as FPGAs and GPUs, and by shrinking nanometer
technology. Reactive fault tolerance technologies, such as
checkpoint/restart, are unable to handle high failure rates due to
associated overheads, while proactive resiliency technologies, such as
migration, simply fail as random soft errors can't be predicted.
Moreover, soft errors may even remain undetected resulting in silent
data corruption.
Resilience 2010 is the follow-on workshop to the successful Resilience
2009 held with HPDC in Munich, Germany, and the earlier Resilience 2008
held in conjunction with CCGrid in Lyon, France.
Important Web sites:
- Resilience 2010 : http://xcr.cenit.latech.edu/resilience2010
- CCGrid 2010     : http://www.manjrasoft.com/ccgrid2010
Prior conferences Web sites:
- Resilience 2009 : http://xcr.cenit.latech.edu/resilience2009
- Resilience 2008 : http://xcr.cenit.latech.edu/resilience2008
Important dates:
- Paper submission deadline : December 6, 2009 (firm)
- Notification deadline     : December 18, 2009
- Camera ready deadline     : January 25, 2010
Submission guidelines:
Authors are invited to submit papers electronically. Submitted
manuscripts should be structured as technical papers and may not exceed
6 letter size (8.5 x 11) pages including figures, tables and references
using the IEEE format for conference proceedings (print area of 6-1/2
inches (16.51 cm) wide by 8-7/8 inches (22.51 cm) high, two-column
format with columns 3-1/16 inches (7.85 cm) wide with a 3/8 inch (0.81
cm) space between them, single-spaced 10-point Times fully justified
text). Submissions not conforming to these guidelines may be returned
without review. Authors should submit the manuscript in PDF format and
make sure that the file will print on a printer that uses letter size
(8.5 x 11) paper. The official language of the meeting is English. All
manuscripts will be reviewed and will be judged on correctness,
originality, technical strength, significance, quality of presentation,
and interest and relevance to the conference attendees.
Submitted papers must represent original unpublished research that is
not currently under review for any other conference or journal. Papers
not following these guidelines will be rejected without review and
further action may be taken, including (but not limited to)
notifications sent to the heads of the institutions of the authors and
sponsors of the conference. Submissions received after the due date,
exceeding length limit, or not appropriately structured may also not be
considered. At least one author of an accepted paper must register for
and attend the workshop. Authors may contact the workshop program chair
for more information. The proceedings will be published through the
IEEE Computer Society Press, USA and will be made online through the
IEEE Digital Library.
Papers should be submitted electronically in the IEEE conference
proceedings style as PDF to the workshop submission Web site at
http://www.easychair.org/conferences/?conf=resilience2010. For
manuscript preparation with LaTeX, use the newer unofficial CTAN from
http://www.ctan.org/tex-archive/macros/latex/contrib/IEEEconf or the
older official IEEE conference proceedings template available at
<ftp://pubftp.computer.org/Press/Outgoing 
proceedings/IEEE_CS_Latex8.5x11.zip>.
For Microsoft Word, use the official proceedings template available at
ftp://pubftp.computer.org/Press/Outgoing/proceedings/instruct8.5x11.doc.
Topics of interest include, but are not limited to:
- Reports on current HPC system and application resiliency
- HPC resiliency metrics and standards
- HPC system and application resiliency analysis
- HPC system and application-level fault handling and anticipation
- HPC system and application health monitoring
- Resiliency for HPC file and storage systems
- System-level checkpoint/restart for HPC
- System-level migration for HPC
- Algorithm-based resiliency fundamentals for HPC (not Hadoop)
- Fault tolerant MPI concepts and solutions
- Soft error detection and recovery in HPC systems
- HPC system and application log analysis
- Statistical methods to identify failure root causes
- Fault injection studies in HPC environments
- High availability solutions for HPC systems
- Reliability and availability analysis
- Hardware for fault detection and recovery
- Resource management for system resiliency and availability
General Co-Chairs:
- Stephen L. Scott
  Computer Science and Mathematics Division
  Oak Ridge National Laboratory, USA
  scottsl@ornl.gov
- Chokchai (Box) Leangsuksun
  SWEPCO Endowed Associate Professor of Computer Science
  Louisiana Tech University, USA
  box@latech.edu
Program Chair:
- Christian Engelmann
  Computer Science and Mathematics Division
  Oak Ridge National Laboratory, USA
  engelmannc@ornl.gov
Publication Co-Chairs:
- James Brandt
  Sandia National Laboratories, USA
  brandt@sandia.gov
- Ann Gentile
  Sandia National Laboratories, USA
  gentile@sandia.gov
Program Committee:
- George Bosilca, University of Tennessee, USA
- Greg Bronevetsky, Lawrence Livermore National Laboratory, USA
- Franck Cappello, INRIA Paris, France
- Kasidit Chanchio, Thammasat University, Thailand
- Zizhong Chen, Colorado School of Mines, USA
- Nathan DeBardeleben, Los Alamos National Laboratory, USA
- Christian Engelmann, Oak Ridge National Laboratory, USA
- Yung-Chin Fang, Dell, USA
- Ann Gentile, Sandia National Laboratories, USA
- Paul Hargrove, Lawrence Berkeley National Laboratory, USA
- Xubin He, Tennessee Tech University, USA
- Daniel S. Katz, University of Chicago, USA
- Dieter Kranzlmueller, LMU/LRZ Munich, Germany
- Zhiling Lan, Illinois Institute of Technology, USA
- Chokchai (Box) Leangsuksun, Louisiana Tech University, USA
- Celso Mendes, University of Illinois at Urbana Champaign, USA
- Christine Morin, INRIA Rennes, France
- Thomas Naughton, Oak Ridge National Laboratory, USA
- George Ostrouchov, Oak Ridge National Laboratory, USA
- Li Ou, Dell, USA
- DK Panda, The Ohio State University, USA
- Mihaela Paun, Louisiana Tech University, USA
- Rolf Riesen, Sandia National Laboratories, USA
- Stephen L. Scott, Oak Ridge National Laboratory, USA
- Dan Stanzione, Texas Advanced Computing Center, USA
- Jon Stearley, Sandia National Laboratories, USA
- Xian-He Sun, Illinois Institute of Technology, USA
- Gregory M. Thorson, SGI, USA
- Geoffroy Vallee, Oak Ridge National Laboratory, USA
- Sudharshan Vazhkudai, Oak Ridge National Laboratory, USA
-- 
-----------------------------------------------------------------------
Dr. Christian Engelmann                        Phone: +1 (865) 574-3132
Research and Development Staff Member            Fax: +1 (865) 576-5491
Oak Ridge National Laboratory                    One Bethel Valley Road
mailto:engelmannc@ornl.gov                       P.O. Box 2008, MS-6173
http://www.csm.ornl.gov/~engelman              Oak Ridge, TN 37831, USA
-----------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: computational.science-unsubscribe@lists.iccsa.org
For additional commands, e-mail: computational.science-help@lists.iccsa.org

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

[Cfp] [Fwd: [computational.science] 1st CFP: Resilience@CCGrid 2010]