FPGA accelerator for floating-point matrix multiplication

FPGA accelerator for floating-point matrix multiplication

For access to this article, please select a purchase option:

Buy article PDF
(plus tax if applicable)
Buy Knowledge Pack
10 articles for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
IET Computers & Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

This study treats architecture and implementation of a field-programmable gate array (FPGA) accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering and simplifies placement and routing on the chip. The authors show that such architecture is especially well suited for full-duplex communication links between the accelerator and the host processor. The architecture requires the result blocks to be accumulated by the host processor; however, the authors show that typically more than 99% of all arithmetic operations are performed by the accelerator. The implementation focuses on efficient use of embedded FPGA resources, in order to allow for a large number of processing elements (PEs). Each PE uses eight Virtex-6 DSP blocks. Both adders and multipliers are deeply pipelined and use several FPGA-specific techniques to achieve small area size and high clock frequency. Finally, the authors quantify the performance of accelerator implemented in Xilinx Virtex-6 FPGA, with 252 PEs running at 403 MHz (achieving 203.1 Giga FLOPS (GFLOPS)), by comparing it to double-precision matrix multiplication function from MKL, ACML, GotoBLAS and ATLAS libraries executing on Intel Core2Quad and AMD Phenom X4 microprocessors running at 2.8 GHz. The accelerator performs 4.5 times faster than the fastest processor/library pair.


    1. 1)
    2. 2)
    3. 3)
    4. 4)
    5. 5)
    6. 6)
      • Underwood, K.: `FPGAs vs. CPUs: trends in peak floating-point performance', Proc. ACM/SIGDA 12th Int. Symp. on Field-Programmable Gate Arrays (FPGA), February 2004, Monterey, CA, USA, p. 171–180.
    7. 7)
      • Dou, Y., Vassiliadis, S., Kuzmanov, G.K., Gaydadjiev, G.N.: `64-bit floating-point FPGA matrix multiplication', Proc. ACM/SIGDA 13th Int. Symp. on Field-Programmable Gate Arrays (FPGA), February 2005, Monterey, CA, USA, p. 86–95.
    8. 8)
    9. 9)
    10. 10)
      • IEEE Standard 754–2008: ‘Standard for Floating-Point Arithmetic’, IEEE Computer Society, 2008.
    11. 11)
    12. 12)
    13. 13)
    14. 14)
    15. 15)
    16. 16)
    17. 17)
      • Xilinx Virtex-6 FPGA Configurable Logic Block (UG364), ver. 1.1. Available at, accessed September 2009.
    18. 18)
    19. 19)
      • Yuffe, M., Knoll, E., Mehalel, M., Shor, J., Kurts, T.: `A fully integrated multi-CPU, GPU and memory controller 32 nm processor', Dig. Tech. Papers 2011 IEEE Int. Solid-State Circuits Conf. (ISSCC), February 2011, San Francisco, CA, USA, p. 264–266.
    20. 20)
      • Y.N. Patt . Future microprocessors: What must we do differently if we are to effectively utilize multi-core and many-core chips. Trans. Internet Res. , 1 , 5 - 9

Related content

This is a required field
Please enter a valid email address