RelSim: Computational Framework for Lifetime Reliability Modeling of Heterogeneous Accelerator Systems
The objective of processor designs and operations shifted from simply escalating performance to improving energy efficiency since physical constraints such as thermal and power issues became limiting factors. Heterogeneous processor designs are widely adopted in various computing systems from mobile and embedded to server-class processors to enhance energy efficiency and performance. The use of heterogeneous processing units (PUs) enables a system to improve computational efficiency by executing workloads using multiple types of PUs specialized for different tasks. Despite the computational advantage, architectural heterogeneity complicates system-level lifetime reliability concerns. It is a complex and computationally challenging problem to evaluate the reliability of heterogeneous processors comprised of multiple heterogeneous PUs subject to diverse failure mechanisms and experiencing different operating conditions. The pre-silicon analysis of computing systems enables predicting the lifetime reliability consequences of processor designs and operations in foreseen operating conditions, and it helps reduce development costs and time incurred in post-silicon phases.

RelSim is a framework for the lifetime reliability modeling and evaluation of heterogeneous computing systems. The framework can be flexibly configured to model various designs of heterogeneous processors with a various mix of failure models (e.g., electro-migration, gate oxide breakdown) and statistical distributions (e.g., lognormal, Weibull) under user-defined execution scenarios. The framework takes three sets of inputs, i) definitions of failure models, ii) system specifications, and iii) use case conditions. The definitions of failure models specify model parameters and statistical distributions to simulate failure mechanisms. System specifications include the number of PUs, size of heterogeneous components, as well as operating voltage, temperature, and stress time at which the targeted lifetime of a system is defined (i.e., product specifications). Lastly, use case conditions describe operation scenarios where system reliability is to be evaluated. RelSim conducts a sizable set of Monte Carlo simulations to estimate the lifetime reliability of the heterogeneous system. It engages multi-threaded acceleration in CPUs or GPUs to speed up compute-intensive statistical calculations. The framework is also configurable to simulate a variety of dynamic reliability management (DRM) schemes such as replacement (e.g., spare components), rotation, and k-out-of-n (e.g., graceful degradation) models.
 

Prerequisite, Download, and Build
RelSim uses g++ and nvcc to compile C++ and CUDA codes, respectively. The latest release of the RelSim framework is v1.1 (as of Sept. 2022). To obtain this version of RelSim, use the following git command. Alternatively, you may get the latest stable copy of the RelSim framework from the master branch without the ‑‑branch option in the command below.

$ git clone --branch v1.1 https://github.com/yonsei-icsl/relsim

Try building and executing an example model using the following commands for NVIDIA GPUs.

$ cd relsim/
$ make
$ ./relsim

Alternatively, RelSim can be built for CPUs by adding a target=cpu option.

$ cd relsim/
$ make target=cpu
$ ./relsim

 

Documentation
RelSim v1.1 is the latest release (as of Sept. 2022). For instructions regarding the installation and execution of RelSim, visit the GitHub repository: https://github.com/yonsei-icsl/relsim. A related publication is currently under review.

@article{jung_icsl2022,
    author    = {S. Jung and Y. Chon and J. Hwang and B. Kim and A. Trivedi and W. Song},
    title     = {{Computational Framework for Lifetime Reliability Assessment of Heterogeneous Computing Processors}},
    journal   = {under review},
    month     = {Feb.},
    year      = {2022},
}