GVR available as open source software. Please download now and try it!
Global View Resilience (GVR) is a new approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. Key novel features in GVR include:
-
multi-version arrays with each versioning rate controlled separately by the application (multi-stream)
-
flexible multi-version recovery
-
unified error signalling and handling for flexible cross-layer error recovery.
With a global versioned array as a portable abstraction, GVR
enables application programmers to manage reliability (and its
overhead) in a flexible, portable fashion, tapping their deep
scientific and application code insights. We will research
algorithms and a runtime that map and adapt the
application/system’s reliability deployment based on
application-specified reliability priorities. The unified
error handling framework enables applications error detection
(checking) and recovery routines that handle diverse classes
of errors with a single application recovery. This
architecture enables applications and systems to work in
concert, exploiting semantics (algorithmic or even scientific
domain) and key capabilities (e.g., fast error detection in
hardware) to dramatically increase the range of errors that
can be detected and corrected.
Publications
- Aiman Fang, "Application-based Focused Recovery (ABFR): Convenient Management of Latent Error Resilience Using Application Knowledge" , Phd Dissertation, University of Chicago, Department of Computer Science, July 23, 2018.
-
Aiman Fang and Andrew A. Chien, ABFR: Convenient
Management of Latent Error Resilience using Application
Knowledge" , in ACM/IEEE Conference on High-Performance Distributed
Computing, June 2018, Tempe, AZ.
- A. Cavelan, A. Fang, A. Chien, and Y. Robert.
Resilient N-Body Tree Computations with Algorithm-Based
Focused Recovery: Model and Performance Analysis, in
8th International Workshop on Performance Modeling,
Benchmarking and Simulation of High Performance Computer
Systems (PMBS17) held as part of SC17, Denver, CO, USA,
November 2017.
- Aiman Fang, Aurelien Cavelan, Yves Robert, and Andrew
A. Chien, Resilience for Stencil Computations
with Latent Errors, in International Conference on Parallel
Processing (ICPP), Bristol, United Kingdom, August 2017.
-
Anshu Dubey, Hajime Fujita, Daniel T. Graves, Andrew Chien, and Devesh Tiwari.
Granularity and the cost of error recovery in resilient AMR scientific
applications. In Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis (SC '16).
-
Nan Dun, Dirk Pleiter, Aiman Fang, Nicolas Vandenbergen, and Andrew A. Chien,
Multi-Versioning Performance Opportunities in BGAS System for Resilience,
International Supercomputing Conference (ISC), June 2016, Frankfurt, Germany.
-
Hajime Fujita, Kamil Iskra, Pavan Balaji, and Andrew A. Chien,
Versioning Architectures for Local and Global Memory,
in Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS), December 2015, Melbourne, Australia.
-
Aiman Fang, Hajime Fujita and Andrew A. Chien,
Towards Understanding Post-Recovery Efficiency for Shrinking and Non-Shrinking Recovery,
in Proceedings of the 8th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, at Euro-Par 2015, Vienna, Austria, August 24, 2015
-
Anshu Dubey, Hajime Fujita, Zachary Rubenstein, Brian Van Straalen and Andrew Chien.
A Case Study Of Application Structure Aware Resilience Through Differentiated State Saving And Recovery,
in Proceedings of the 8th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, at Euro-Par 2015, Vienna, Austria, August 24, 2015
-
Hajime Fujita, Kamil Iskra, Pavan Balaji, and Andrew A. Chien,
Empirical Characterization of Versioning Architectures,
in Proceedings of IEEE Cluster, September 8-10, 2015, Chicago.
-
A. Chien, P. Balaji, N. Dun, A. Fang, H. Fujita, K. Iskra, Z. Rubenstein, Z. Zheng, J. Hammond, I. Laguna, D. Richards, A. Dubey, B. van Straalen, M Hoemmen, M. Heroux, K. Teranishi, A. Siegel.
Exploring Versioning for Resilience in Scientific Applications: Global-view Resilience,
submitted March 2015,
published in International Journal of High-Performance Computing Applications
(IJHPCA), September 2016.
(Best overall project summary)
-
Aiman Fang and Andrew A. Chien,
How Much SSD Is Useful for Resilience in Supercomputers,
in ACM Symposium on Fault-tolerance at Extreme-Scale (FTXS) associated with HPDC 2015, Portland, Oregon, June 15, 2015
(Slides)
-
Aiman Fang,
How Much SSD Is Useful for Resilience in Supercomputers”, Master's Thesis, Department of Computer Science, University of Chicago, April 2015.
-
Nan Dun, Hajime Fujita, John R. Tramm, Andrew A. Chien, Andrew R. Siegel,
Data Decomposition in Monte Carlo Neutron Transport Simulations using Global View Arrays,
International Journal of High Performance Computing Applications, March 2015.
-
A. Chien, P. Balaji, P. Beckman, N. Dun, A. Fang, H. Fujita, K. Iskra, Z. Rubenstein, Z. Zheng, R. Schreiber, J. Hammond, J. Dinan, A. Laguna, D. Richards, A. Dubey, B. van Straalen, M Hoemmen, M. Heroux, K. Teranishi, A. Siegel, and J. Tramm,
Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience,
in International Conference on Computational Science (ICCS 2015), Reykjavik, Iceland, June 2015.
-
Hajime Fujita, Nan Dun, Zachary Rubenstein, and Andrew A. Chien. Log-Structured Global Array for Efficient Multi-Version Snapshots, IEEE CCGrid 2015, May 2015. Also
UChicago CS Tech Report 2014-16
November 2014.
-
Hajime Fujita, Nan Dun, Aiman Fang, Zachary A. Rubenstein, Ziming
Zheng, Kamil Iskra, Jeff Hammond, Anshu Dubey, Pavan Balaji, Andrew A.
Chien,
Using Global View Resilience (GVR) to add Resilience to Exascale Applications
, SC14, Nov 2014 (Best Poster Finalist!)
-
The GVR Team,
Global View Resilience (GVR) Documentation, Release 1.0
, University of Chicago, Computer Science Technical Report 2014-10.
-
Nan Dun, Hajime Fujita, John Tramm, Andrew A. Chien, and Andrew R. Siegel. Data Decomposition in Monte Carlo Particle Transport Simulations using Global View Arrays, UChicago
CS Tech Report 2014-09
May 2014.
-
The GVR Team,
How Applications Use GVR: Use Cases,
University of Chicago, Computer Science Technical Report 2014-06.
-
The GVR Team,
Global View Resilience, API Documentation R0.8.1-rc0,
University of Chicago, Computer Science Technical Report 2014-05.
-
Aiman Fang and Andrew A. Chien,
Applying GVR to Molecular Dynamics: Enabling Resilience for Scientific Computations,
Tech Report, University of Chicago, Dept of Computer Science,
CS-TR-2014-04,
April 2014.
-
Ziming Zheng, Andrew A. Chien, Keita Teranishi, "Fault Tolerance in an Inner-Outer Solver: a GVR-enabled Case Study", in Proceedings of
VECPAR 2014,
July 2014, Eugene, Oregon. Proceedings available from Springer-Verlag Lecture Notes in Computer Science.
-
Z. Rubenstein, Error Checking and Snapshot-based Recovery in Preconditioned Conjugate Gradient Solver, Masters Thesis, University of Chicago, Department of Computer Science, March 2014.
-
Z. Rubenstein, J. Dinan, H. Fujita, Z. Zheng, A. Chien,
Error Checking and Snapshot-Based Recovery in a
Preconditioned Conjugate Gradient Solver,
University of Chicago, Department of Computer Science Technical Report 2013-11, December 2013
-
Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, and JackJ. Dongarra. An evaluation of User-Level Failure Mitigation support in MPI. Computing, 95(12):1171–1184, 2013.
-
Ziming Zheng, Zachary Rubenstein, and Andrew A. Chien,
GVR-Enabled Trilinos: An Outside-In Approach for Resilient Computing, in the
SIAM Conference on Parallel Processing
, February 2014, Portland Oregon.
-
Ziming Zheng, Andrew A. Chien, Mark Hoemmen, Keita Teranishi, "Fault Tolerance in an Inner-Outer Solver: a GVR-enabled Case Study", available as Technical Report from University of Chicago Department of Computer Science,
CS-TR-2014-01
, January 2014.
-
Guoming Lu, Ziming Zheng, and Andrew A. Chien,
When are Multiple Checkpoints Needed?
, in 3rd Workshop for Fault-tolerance at Extreme Scale (FTXS), at IEEE Conference on High Performance Distributed Computing, June 2013, New York, New York.
-
Hajime Fujita, Robert Schreiber, Andrew A. Chien,
It's Time for New Programming Models for Unreliable Hardware , in
ASPLOS 2013 Provocative Ideas session, March 18, 2013.
-
Sean Hogan, Jeff Hammond, and Andrew A. Chien, An Evaluation of Difference and Threshold Techniques for Efficient Checkpointing , 2nd Workshop on Fault-Tolerance at Extreme Scale
FTXS 2012 at DSN 2012, June 2012, Boston, Massachusetts.
|