Large Scale HPC FPGA acceleration has been a key enabler to solve today’s bleeding edge computational problems. Neural Networks deep learning, data mining, cloud computing or scientific research are just a few fields where traditional servers lack computational power despite consuming a lot of energy. Recent, tremendous advances in FPGA technology has opened the door for its use in HPC applications. Aldec’s scalable, FPGA accelerators are ideal for large scale HPC applications. Today’s generation of FPGA accelerator boards feature low power, Xilinx® Ultrascale™ FPGAs providing outstanding computational capabilities with power efficiency not achievable with the GPU-based accelerators. FPGA Accelerators HES-XCVU9P-QDR - low profile form factor board with PCIe x16 that can be installed directly inside servers used in data centers. On this board FPGA is mated with high bandwidth QDR-II+ memories provide high throughput for algorithm acceleration. HES-XCVU9P-ZU7EV - board with separate host interface chip Xilinx Zynq UltraScale+ XCZU7 and another FPGA Xilinx UltraScale+ XCVU9P designated entirely for user’s application. It’s logic resources containing large number of DSP blocks (6840) making it ideal for DSP and computer vision applications. HES-US-440 - stand-alone board with external PCIe x8 cable connection contains the largest Xilinx Virtex Ultrascale device with unprecedented capacity of 5.5 million logic cells, DDR4 memory up to 64GB in two modules and fast RLDRAM. It is dedicated for accelerating very complex algorithms or those which can benefit from large number of replicated instances of the algorithm kernel. HES-XCKU11P-DDR4 - low profile form factor board with PCIe x16 that can be installed directly inside servers used for HPC/HFT. Includes Kintex UltraScale+ which belongs to the best price/performance/watt balance FPGA family. Two QSFP-DD can provide high bandwidth and low-latency communication (up to 400 Gbps). HES-XCVU9P-QDR HES-XCVU9P-ZU7EV HES-US-440 HES-XCKU11P-DDR4 Logic Cells 2.5 Million 2.5 Million 5.5 Million 653,000 DSP Blocks 6840 6840 2880 2,928 On-chip RAM 75.9 Mb BlockRAM270 Mb UltraRAM 75.9 Mb BlockRAM270 Mb UltraRAM 88.6 Mb BlockRAM 21.1 Mb BlockRAM22.5 Mb UltraRAM Off-chip RAM 432 Mb QDR-II (3x 144 Mb)Or in *-DDR version:32 Gb DDR4 (2x 16Gb)144 Mb QDR-II 32GB DDR4 (2x 16GB)2x 576Mb RLD3 32GB DDR4 (2x 16GB)1152 Mb RLD3 (2x 576Mb) SODIMM DDR4 Memory socket512 Mb Flash Memory 2x 64 kb I2C EEPROM Host Interface PCI Express x16, gen3 PCI Express x8, gen3Zynq UltraScale+ XCZU7 PCI Express x8, gen2Zynq-7000, XC7Z100 PCIe x16 gen3 endpoint or PCIe x8 gen42x QSFP-DD (total up to 400Gbps) Host Interface Connecting the FPGA accelerator board with a host workstation via PCIe is not trivial and if done from ground up would require extensive knowledge of hardware design. Software developers need a ready-to-use hardware platform without low-level hardware integration implications. Understanding such a use model, Aldec provides HES Proto-AXI interface that hides low level PCI Express implementation details and saves your development time. The user receives HES Proto-AXI IP-core which is based on AMBA AXI standard and bridges accelerated algorithm kernels to the PCIe bus of host computer. The HES Proto-AXI has been optimized to achieve high data throughput above 2 GB/s for transfers between the Host and the HES board. It provides an easy to use memory mapped interface for integration with the Compute Device and it can be also easily converted to a streaming AXI interface. Use of external on-board memories like DDR3, DDR4 or QDR-II is also facilitated by HES Proto-AXI that contains appropriate controller and provides memory access from the same AXI interface Quick Integration An algorithm can be converted to the FPGA directly from C using Xilinx High Level Synthesis (HLS) or similar tools and then easily integrated with the HES Proto-AXI infrastructure. Alongside, provided high level C API is easy to use on either Linux or Windows OS and there is no need to develop low level PCIe drivers. An example HPC design flow is based on the Xilinx Vivado HLS tool for direct compilation from software language C to hardware description language HDL for running in FPGA. The flow is divided into five stages and as you will see it is well integrated with Aldec HPC platform components. Convert The program or algorithm to accelerate is partitioned in two parts – one designated for acceleration and the other that runs on the host. Such partitioning can be made based on the results of profiling that indicate pieces of C code that are computational intensive. Next, the Xilinx Vivado HLS tool is used to convert from C to Verilog or VHDL RTL code that is appropriate for further automatic processing (synthesis and implementation in FPGA). User should choose to include AMBA AXI interface in the RTL code which will be required for the next stage. Integrate Once the HDL code is available it needs to be integrated with Aldec HES Proto-AXI - that is connected to AMBA AXI ports. Using HDL editor tool such as one from Aldec’s Riviera-PRO is sufficient for this stage. Concurrently, the main application intended to run in the host computer is modified to replace calls of algorithm functions with their counterparts using the FPGA via HES Proto-AXI API. Simulate Before running the whole project with the FPGA accelerator board, you can verify it against any integration/connectivity mistakes by using Aldec’s high performance Riviera-PRO simulator and the HES Proto-AXI simulation model included in the Large Scale HPC solution. Configure The last stage is automatic Synthesis and Implementation using Xilinx Vivado environment that generates FPGA bitstream and configuration files for your main application. Run Aldec provides run-time environment that makes FPGA accelerator boards usage straightforward. The PCI Express device driver is installed and accelerator board housekeeping functions are included in the Proto-AXI API library linked with your program. When you launch your main application on the host computer the FPGA is configured automatically, so any special knowledge of FPGA operation or programming is not required, thus it’s a very convenient environment for software developers. Main Features Choice of several FPGA accelerator board to match project requirements Scalability with multiple-board configurations support Supports hot-reconfiguration of FPGA Integrated with FPGA development and verification environment Solution Contents HES-HPC FPGA Accelerator board HES Proto-AXI host interface module and software stack AXI Bus Functional Model (BFM) for RTL simulation Riviera-PRO high performance HDL simulator Reference designs, technical documentation, tutorials and white papers Integration services