Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration

Access Full Text

Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration

For access to this article, please select a purchase option:

Buy article PDF
£12.50
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IEE Proceedings - Computers and Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Recent advances in multi-million-gate platform field-programmable gate arrays (FPGAs) have made it possible to design and implement complex parallel systems on a programmable chip that also incorporate hardware floating-point units (FPUs). These options take advantage of resource reconfiguration. In contrast to the majority of the FPGA community that still employs reconfigurable logic to develop algorithm-specific circuitry, our FPGA-based mixed-mode reconfigurable computing machine can implement simultaneously a variety of parallel execution modes and is also user programmable. Our heterogeneous reconfigurable architecture (HERA) machine can implement the single-instruction, multiple-data (SIMD), multiple-instruction, multiple-data (MIMD) and multiple-SIMD (M-SIMD) execution modes. Each processing element (PE) is centred on a single-precision IEEE 754 FPU with tightly-coupled local memory, and supports dynamic switching between SIMD and MIMD at runtime. Mixed-mode parallelism has the potential to best match the characteristics of all subtasks in applications, thus resulting in sustained high performance. HERA's performance is evaluated by two common computation-intensive testbenches: matrix–matrix multiplication (MMM) and LU factorisation of sparse doubly-bordered-block-diagonal (DBBD) matrices. Experimental results with electrical power network matrices show that the mixed-mode scheduling for LU factorisation can result in speedups of about 19% and 15.5% compared to the SIMD and MIMD implementations, respectively.

Inspec keywords: mixed analogue-digital integrated circuits; parallel architectures; matrix decomposition; sparse matrices; floating point arithmetic; reconfigurable architectures; field programmable gate arrays

Other keywords: LU factorisation; programmable chip; FPGA; mixed-mode parallelism; matrix-matrix multiplication; mixed-mode reconfigurable computing machine; multiple-instruction multiple-data execution; hardware floating-point units; multiple-SIMD; MIMD; heterogeneous reconfigurable architecture; SIMD; multimillion-gate platform field programmable gate arrays; sparse doubly-bordered-block-diagonal matrices; single-instruction multiple-data execution

Subjects: Digital arithmetic methods; Logic and switching circuits; Parallel architecture; Linear algebra (numerical analysis); Logic circuits; Linear algebra (numerical analysis); Mixed analogue-digital circuits

References

    1. 1)
      • I.S. Duff , A.M. Erisman , J.K. Reid . (1990) Direct methods for sparse matrices.
    2. 2)
      • Intel Math Kernel Library (MKL) 8.0. Available at: http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/219823.htm.
    3. 3)
      • Rajopadhye, S.V.: `Systolic arrays for LU decomposition', IEEE Int. Symp. Circuits and Systems, June 1988, 3, p. 2513–2516.
    4. 4)
      • Underwood, K.: `FPGAs vs. CPUs: trends in peak floating-point performance', 12thACM/SIGDA Int. Symp. on Field Programmable Gate Arrays, February 2004, Monterey, CA, p. 171–180.
    5. 5)
      • X. Wang , S.G. Ziavras . A multiprocessor-on-a-programmable-chip reconfigurable system for matrix operations with power-grid case studies. Int. J. Comput. Sci. Eng.
    6. 6)
      • Yi, Y., Woods, R., McCanny, J.V.: `Hierarchical synthesis of complex DSP functions on FPGAs', 37thAsilomar Conf. Signals, Systems and Computers, November 2003, 2, p. 1421–1425.
    7. 7)
      • A.M. Tyrrell , R.A. Krohling , Y. Zhou . Evolutionary algorithm for the promotion of evolvable hardware. IEE Proc., Comput. Digit. Tech. , 4 , 267 - 275
    8. 8)
      • Mirsky, E., DeHon, A.: `MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources', 1996 IEEE Symp. FPGAs for Custom Computing Machines, 1996, p. 157–166.
    9. 9)
      • Liang, J., Tessier, R., Mencer, O.: `Floating point unit generation and evaluation for FPGAs', 11thAnnual IEEE Symp. on Field-Programmable Custom Computing Machines, April 2003, p. 185–194.
    10. 10)
      • TMS320C6711/11B/11C/11D Floating-Point Digital Signal Processors. Available at: http://focus.ti.com/docs/prod/folders/print/tms320c6711.html.
    11. 11)
      • Hannig, F., Dutta, H., Teich, J.: `Regular mapping for coarse-grained reconfigurable architectures', 2004 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, May 2004, Montréal, Canada, V, p. 57–60.
    12. 12)
      • R. Ronen , A. Mendelson , K. Lai , S.-L. Lu , F. Pollack , J. Shen . Coming challenges in microarchitecture and architecture. Proc. IEEE , 3 , 325 - 340
    13. 13)
      • Meilander, W.C., Baker, J.W., Jin, M.: `Importance of SIMD computation reconsidered', 17thIEEE Int. Parallel Distributed Processing Symp. (IPDPS2003), April 2003, p. 266–273.
    14. 14)
      • Wang, X., Ziavras, S.G.: `A framework for dynamic resource management and scheduling on reconfigurable mixed-mode multiprocessors', IEEE Int. Conf. on Field-Programmable Technology, December 2005, Singapore, p. 51–58.
    15. 15)
      • Matrix Market, Available at: http://math.nist.gov/MatrixMarket/.
    16. 16)
      • Zhuo, L., Prasanna, V.K.: `Scalable and modular algorithms for floating-point matrix multiplication on FPGAs', 18thInt. Parallel and Distributed Processing Symp., April 2004, p. 92–101.
    17. 17)
      • H. Singh , M.-H. Lee , G. Lu , F.J. Kurdahi , N. Bagherzadeh , E.M.C. Filho . MorphoSys: an integrated reconfigurable system for data-parallel and computation-Intensive Applications. IEEE Trans. Comput. , 5 , 465 - 481
    18. 18)
      • A. Sangiovanni-Vincentelli , L.K. Chen , L.O. Chua . An efficient heuristic cluster algorithm for tearing large-scale networks. IEEE Trans. Circuits Syst. , 12 , 709 - 717
    19. 19)
      • Govindu, G., Choi, S., Prasanna, V.K., Daga, V., Gangadharpalli, S., Sridhar, V.: `A high-performance and energy-efficient architecture for floating-point based LU decomposition on FPGAs', 12thReconfigurable Architectures Workshop, April 2004.
    20. 20)
      • K. Compton , S. Hauck . Reconfigurable computing: a survey of systems and software. ACM Comput. Surv. , 2 , 171 - 210
    21. 21)
      • S.G. Ziavras , A. Gerbessiotis , R. Bafna . Coprocessor design to support MPI primitives in configurable multiprocessors. Integr. VLSI J.
    22. 22)
      • Codito Technologies Pvt. Ltd. Available at: http://www.codito.com/prodtech_framework.html.
    23. 23)
      • Tensilica. Available at: http://tensilica.com.
    24. 24)
      • Baker, J.M., Bennett, S., Bucciero, M., Gold, B., Mahajan, R.: `SCMP: a single-chip message-passing parallel computer', The 2002 Int. Conf. on Parallel and Distributed Processing Techniques and Applications, June 2002, Las Vegas, NV, p. 1485–1491.
    25. 25)
      • Annapolis Micro Systems, Inc., Available at http://www.annapmicro.com/.
    26. 26)
      • Wang, X., Ziavras, S.G.: `HERA: a reconfigurable and mixed-mode parallel computing engine on platform FPGAs', 16thInt. Conf. Parallel and Distributed Computing and Systems, 9–11 November 2004, Boston, Massachusetts, p. 374–379.
    27. 27)
      • X. Wang , S.G. Ziavras . Parallel LU factorization of sparse matrices on FPGA-based configurable computing engines. Concurrency Comput. Pract. Exp. , 4 , 319 - 343
    28. 28)
      • Olukotun, K., Nayfeh, B.A., Hammond, L., Wilson, K., Chang, K.: `The case for a single-chip multiprocessor', Seventh Int. Symp. Architectural Support for Programming Languages and Operating Systems, October 1996, p. 2–11.
    29. 29)
      • Parhami, B.: `SIMD machines: do they have a significant future?', Report on a Panel Discussion, 5th Symp. Frontiers Massively Parallel Computation, February 1995, McLean, LA, p. 19–22.
    30. 30)
      • Wunderlich, R., Püschel, M., Hoe, J.: `Accelerating blocked matrix-matrix multiplication using a software-managed memory hierarchy with DMA', High Performance Embedded Computing Workshop, 2005, MIT.
    31. 31)
      • R. Bergamaschi , I. Bolsens , R. Gupta , R. Harr , A. Jerraya , K. Keutzer , K. Olukotun , K. Vissers . Are single-chip multiprocessors in reach?. IEEE Des. Test Comput. , 1 , 82 - 89
    32. 32)
      • OpenMP. Available at: http://www.openmp.org.
    33. 33)
      • J. Demmel , K. Yelick . (2002) Automatic performance tuning of linear algebra kernels, TOPS-SciDAC (http://www.tops-scidac.org).
    34. 34)
      • Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., Asanovic, K.: `The vector-thread architecture', IEEE 31st Int. Symp. on Computer Architecture, June 2004, Munich, Germany, p. 52–63.
    35. 35)
      • Khawam, S., Arslan, T., Westall, F.: `Synthesizable reconfigurable array targeting distributed arithmetic for system-on-chip applications', 12thReconfigurable Architectures Workshop, 2004.
    36. 36)
      • F. Barat , L. Rudy , D. Geert . Reconfigurable instruction set processors from a hardware/software perspective. IEEE Trans. Softw. Eng. , 9 , 847 - 862
    37. 37)
      • H.J. Siegel , M. Maheswaran , D.W. Watson , J.K. Antonio , M.J. Atallah , M.M. Eshaghian . (1996) Mixed-mode system heterogeneous computing, Heterogeneous computing.
    38. 38)
      • Cannon, L.E.: `A cellular computer to implement the Kalman filter algorithm', 1969, PhD, Montana State University.
    39. 39)
      • R. Tessier , W. Burleson . Reconfigurable computing and digital signal processing: a survey. J. VLSI Signal Process. , 7 - 27
    40. 40)
      • M. Taylor . The RAW microprocessor: a computational fabric for software circuits and general purpose programs. IEEE Micro , 2 , 25 - 35
    41. 41)
      • Dou, Y., Vassiliadis, S., Kuzmanov, G.K., Gaydadjiev, G.N.: `64-bit floating-point FPGA matrix multiplication', ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays, February 2005, Monterey, CA, p. 86–95.
    42. 42)
      • Bensaali, F., Amira, A., Bouridane, A.: `An FPGA based coprocessor for large matrix product implementation', 2003 IEEE Int. Conf. Field-Programmable Technology, December 2003, p. 292–295.
    43. 43)
      • Wang, X., Ziavras, S.G.: `Parallel direct solution of linear equations on FPGA-based machines', 11thIEEE Int. Workshop on Parallel and Distributed Real-Time Systems (Proc. 17th IEEE International Parallel and Distributed Processing Symp.), 22–26 April 2003, Nice, France.
http://iet.metastore.ingenta.com/content/journals/10.1049/ip-cdt_20045136
Loading

Related content

content/journals/10.1049/ip-cdt_20045136
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading