Recently system integrators have dramatically increased their efforts in heterogeneous computing by integrating heterogeneous cores on die (ARM), utilizing general purpose GPUs (NVIDIA), combining CPUs and GPUs on same die (Intel, AMD), leveraging FPGAs (Altera, Xilinx), integrating CPUs with FPGAs (Xilinx), and coupling FPGAs and CPUs in the same package (IBM-Altera, Intel-Altera). Heterogeneity aims to solve the problems associated with the end of Moore´s Law by incorporating more specialized compute units in the system hardware and by utilizing the most efficient compute unit. However, while software-stack support for heterogeneity is relatively well developed for performance, software stack support for power- and energy-efficient computing it is severely lacking. The primary ambition of the LEGaTO project is to address this challenge by starting with a Made-inEurope mature software stack, and by optimizing this stack to support energy-efficient computing on a commercial cuttingedge European-developed CPU-GPU-FPGA heterogeneous hardware substrate. In this talk I will present examples of how FPGAs are being utiized in the LEGaTO project for energy savings with specific cases looking at programming environment support, FPGA undervolting, flexible communication for cloud to edge HPC computing and FPGA checkpoointing.
Organizers: Behzad Salami (BSC), Leonardo Bautista-Gomez (BSC), Osman Unsal (BSC)
Abstract: FPGAs are increasingly becoming popular in the edge, IoT, and cloud platforms, thanks to the higher power efficiency than GPUs and better flexibility than ASICs. However, energy-efficiency and error-resiliency are two main concerns for modern FPGAs, not thoroughly investigated yet. (i) To improve FPGAs’ energy efficiency, we propose Undervolting, i.e., supply voltage underscaling below the nominal level set by the manufacturer. Conservative design margins aim to guarantee FPGAs’ correct execution under various operating conditions, such as accounting for the worst-case voltage noise, harsh environmental conditions, workload variability, inherent within-die, or die-to-die variability. However, such conservative guard-banding of voltage and frequency leads to limited energy efficiency. We will present our recent findings on the voltage characterization of modern FPGAs and improve the energy efficiency of such devices by eliminating the margins. (ii) To improve FPGAs’ error-resiliency, we propose Checkpointing support for FPGAs by providing a mechanism in which the developer has a clear overview of the system and a well-defined path to develop applications. Our checkpointing mechanism is based on our open-source multi-level checkpointing tool, i.e., FTI. In this mechanism, FTI exploits the fact that data is accessible from the CPU before and after the termination of a task. Therefore, it is sufficient to store data to the checkpoint only when they reside on the memory accessible directly from the CPU. Consequently, FTI already provides such support. However, it is not straightforward to compile an MPI library for FPGA systems, since typically it is required to perform cross-compiling of both MPI and FTI for the ARM architecture of the processing system (PS) of the FPGA platform. To make it easier for developers to use our fault tolerance library, we provide an extension on our build script, which semi-automatically builds the library for such a system.