Organizers: Behzad Salami (BSC), Leonardo Bautista-Gomez (BSC), Osman Unsal (BSC)
Abstract: FPGAs are increasingly becoming popular in the edge, IoT, and cloud platforms, thanks to the higher power efficiency than GPUs and better flexibility than ASICs. However, energy-efficiency and error-resiliency are two main concerns for modern FPGAs, not thoroughly investigated yet. (i) To improve FPGAs’ energy efficiency, we propose Undervolting, i.e., supply voltage underscaling below the nominal level set by the manufacturer. Conservative design margins aim to guarantee FPGAs’ correct execution under various operating conditions, such as accounting for the worst-case voltage noise, harsh environmental conditions, workload variability, inherent within-die, or die-to-die variability. However, such conservative guard-banding of voltage and frequency leads to limited energy efficiency. We will present our recent findings on the voltage characterization of modern FPGAs and improve the energy efficiency of such devices by eliminating the margins. (ii) To improve FPGAs’ error-resiliency, we propose Checkpointing support for FPGAs by providing a mechanism in which the developer has a clear overview of the system and a well-defined path to develop applications. Our checkpointing mechanism is based on our open-source multi-level checkpointing tool, i.e., FTI. In this mechanism, FTI exploits the fact that data is accessible from the CPU before and after the termination of a task. Therefore, it is sufficient to store data to the checkpoint only when they reside on the memory accessible directly from the CPU. Consequently, FTI already provides such support. However, it is not straightforward to compile an MPI library for FPGA systems, since typically it is required to perform cross-compiling of both MPI and FTI for the ARM architecture of the processing system (PS) of the FPGA platform. To make it easier for developers to use our fault tolerance library, we provide an extension on our build script, which semi-automatically builds the library for such a system.