Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration

Author(s): X. Wang and S.G. Ziavras
DOI: 10.1049/ip-cdt:20045136

For access to this article, please select a purchase option:

Buy article PDF

Buy Knowledge Pack

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership

Recommend Title Publication to library

IEE Proceedings - Computers and Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Author(s): X. Wang ¹ and S.G. Ziavras ²
- Affiliations: 1: Department of Electrical and Computer Engineering, Villanova University, Villanova, USA
  2: Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, USA
Source: Volume 153, Issue 4, July 2006, p. 249 – 260
DOI: 10.1049/ip-cdt:20045136 , Print ISSN 1350-2387, Online ISSN 1359-7027

Recent advances in multi-million-gate platform field-programmable gate arrays (FPGAs) have made it possible to design and implement complex parallel systems on a programmable chip that also incorporate hardware floating-point units (FPUs). These options take advantage of resource reconfiguration. In contrast to the majority of the FPGA community that still employs reconfigurable logic to develop algorithm-specific circuitry, our FPGA-based mixed-mode reconfigurable computing machine can implement simultaneously a variety of parallel execution modes and is also user programmable. Our heterogeneous reconfigurable architecture (HERA) machine can implement the single-instruction, multiple-data (SIMD), multiple-instruction, multiple-data (MIMD) and multiple-SIMD (M-SIMD) execution modes. Each processing element (PE) is centred on a single-precision IEEE 754 FPU with tightly-coupled local memory, and supports dynamic switching between SIMD and MIMD at runtime. Mixed-mode parallelism has the potential to best match the characteristics of all subtasks in applications, thus resulting in sustained high performance. HERA's performance is evaluated by two common computation-intensive testbenches: matrix–matrix multiplication (MMM) and LU factorisation of sparse doubly-bordered-block-diagonal (DBBD) matrices. Experimental results with electrical power network matrices show that the mixed-mode scheduling for LU factorisation can result in speedups of about 19% and 15.5% compared to the SIMD and MIMD implementations, respectively.

References

1. 1)
  - I.S. Duff , A.M. Erisman , J.K. Reid . (1990) Direct methods for sparse matrices.
2. 2)
  - Intel Math Kernel Library (MKL) 8.0. Available at: http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/219823.htm.
3. 3)
  - Rajopadhye, S.V.: `Systolic arrays for LU decomposition', IEEE Int. Symp. Circuits and Systems, June 1988, 3, p. 2513–2516.
4. 4)
  - Underwood, K.: `FPGAs vs. CPUs: trends in peak floating-point performance', 12thACM/SIGDA Int. Symp. on Field Programmable Gate Arrays, February 2004, Monterey, CA, p. 171–180.
5. 5)
  - X. Wang , S.G. Ziavras . A multiprocessor-on-a-programmable-chip reconfigurable system for matrix operations with power-grid case studies. Int. J. Comput. Sci. Eng.
6. 6)
  - Yi, Y., Woods, R., McCanny, J.V.: `Hierarchical synthesis of complex DSP functions on FPGAs', 37thAsilomar Conf. Signals, Systems and Computers, November 2003, 2, p. 1421–1425.
7. 7)
  - A.M. Tyrrell , R.A. Krohling , Y. Zhou . Evolutionary algorithm for the promotion of evolvable hardware. IEE Proc., Comput. Digit. Tech. , 4 , 267 - 275
8. 8)
  - Mirsky, E., DeHon, A.: `MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources', 1996 IEEE Symp. FPGAs for Custom Computing Machines, 1996, p. 157–166.
9. 9)
  - Liang, J., Tessier, R., Mencer, O.: `Floating point unit generation and evaluation for FPGAs', 11thAnnual IEEE Symp. on Field-Programmable Custom Computing Machines, April 2003, p. 185–194.
10. 10)
  - TMS320C6711/11B/11C/11D Floating-Point Digital Signal Processors. Available at: http://focus.ti.com/docs/prod/folders/print/tms320c6711.html.
11. 11)
  - Hannig, F., Dutta, H., Teich, J.: `Regular mapping for coarse-grained reconfigurable architectures', 2004 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, May 2004, Montréal, Canada, V, p. 57–60.
12. 12)
  - R. Ronen , A. Mendelson , K. Lai , S.-L. Lu , F. Pollack , J. Shen . Coming challenges in microarchitecture and architecture. Proc. IEEE , 3 , 325 - 340
13. 13)
  - Meilander, W.C., Baker, J.W., Jin, M.: `Importance of SIMD computation reconsidered', 17thIEEE Int. Parallel Distributed Processing Symp. (IPDPS2003), April 2003, p. 266–273.
14. 14)
  - Wang, X., Ziavras, S.G.: `A framework for dynamic resource management and scheduling on reconfigurable mixed-mode multiprocessors', IEEE Int. Conf. on Field-Programmable Technology, December 2005, Singapore, p. 51–58.
15. 15)
  - Matrix Market, Available at: http://math.nist.gov/MatrixMarket/.
16. 16)
  - Zhuo, L., Prasanna, V.K.: `Scalable and modular algorithms for floating-point matrix multiplication on FPGAs', 18thInt. Parallel and Distributed Processing Symp., April 2004, p. 92–101.
17. 17)
  - H. Singh , M.-H. Lee , G. Lu , F.J. Kurdahi , N. Bagherzadeh , E.M.C. Filho . MorphoSys: an integrated reconfigurable system for data-parallel and computation-Intensive Applications. IEEE Trans. Comput. , 5 , 465 - 481
18. 18)
  - A. Sangiovanni-Vincentelli , L.K. Chen , L.O. Chua . An efficient heuristic cluster algorithm for tearing large-scale networks. IEEE Trans. Circuits Syst. , 12 , 709 - 717
19. 19)
  - Govindu, G., Choi, S., Prasanna, V.K., Daga, V., Gangadharpalli, S., Sridhar, V.: `A high-performance and energy-efficient architecture for floating-point based LU decomposition on FPGAs', 12thReconfigurable Architectures Workshop, April 2004.
20. 20)
  - K. Compton , S. Hauck . Reconfigurable computing: a survey of systems and software. ACM Comput. Surv. , 2 , 171 - 210
21. 21)
  - S.G. Ziavras , A. Gerbessiotis , R. Bafna . Coprocessor design to support MPI primitives in configurable multiprocessors. Integr. VLSI J.
22. 22)
  - Codito Technologies Pvt. Ltd. Available at: http://www.codito.com/prodtech_framework.html.
23. 23)
  - Tensilica. Available at: http://tensilica.com.
24. 24)
  - Baker, J.M., Bennett, S., Bucciero, M., Gold, B., Mahajan, R.: `SCMP: a single-chip message-passing parallel computer', The 2002 Int. Conf. on Parallel and Distributed Processing Techniques and Applications, June 2002, Las Vegas, NV, p. 1485–1491.
25. 25)
  - Annapolis Micro Systems, Inc., Available at http://www.annapmicro.com/.
26. 26)
  - Wang, X., Ziavras, S.G.: `HERA: a reconfigurable and mixed-mode parallel computing engine on platform FPGAs', 16thInt. Conf. Parallel and Distributed Computing and Systems, 9–11 November 2004, Boston, Massachusetts, p. 374–379.
27. 27)
  - X. Wang , S.G. Ziavras . Parallel LU factorization of sparse matrices on FPGA-based configurable computing engines. Concurrency Comput. Pract. Exp. , 4 , 319 - 343
28. 28)
  - Olukotun, K., Nayfeh, B.A., Hammond, L., Wilson, K., Chang, K.: `The case for a single-chip multiprocessor', Seventh Int. Symp. Architectural Support for Programming Languages and Operating Systems, October 1996, p. 2–11.
29. 29)
  - Parhami, B.: `SIMD machines: do they have a significant future?', Report on a Panel Discussion, 5th Symp. Frontiers Massively Parallel Computation, February 1995, McLean, LA, p. 19–22.
30. 30)
  - Wunderlich, R., Püschel, M., Hoe, J.: `Accelerating blocked matrix-matrix multiplication using a software-managed memory hierarchy with DMA', High Performance Embedded Computing Workshop, 2005, MIT.
31. 31)
  - R. Bergamaschi , I. Bolsens , R. Gupta , R. Harr , A. Jerraya , K. Keutzer , K. Olukotun , K. Vissers . Are single-chip multiprocessors in reach?. IEEE Des. Test Comput. , 1 , 82 - 89
32. 32)
  - OpenMP. Available at: http://www.openmp.org.
33. 33)
  - J. Demmel , K. Yelick . (2002) Automatic performance tuning of linear algebra kernels, TOPS-SciDAC (http://www.tops-scidac.org).
34. 34)
  - Krashinsky, R., Batten, C., Hampton, M., Gerding, S., Pharris, B., Casper, J., Asanovic, K.: `The vector-thread architecture', IEEE 31st Int. Symp. on Computer Architecture, June 2004, Munich, Germany, p. 52–63.
35. 35)
  - Khawam, S., Arslan, T., Westall, F.: `Synthesizable reconfigurable array targeting distributed arithmetic for system-on-chip applications', 12thReconfigurable Architectures Workshop, 2004.
36. 36)
  - F. Barat , L. Rudy , D. Geert . Reconfigurable instruction set processors from a hardware/software perspective. IEEE Trans. Softw. Eng. , 9 , 847 - 862
37. 37)
  - H.J. Siegel , M. Maheswaran , D.W. Watson , J.K. Antonio , M.J. Atallah , M.M. Eshaghian . (1996) Mixed-mode system heterogeneous computing, Heterogeneous computing.
38. 38)
  - Cannon, L.E.: `A cellular computer to implement the Kalman filter algorithm', 1969, PhD, Montana State University.
39. 39)
  - R. Tessier , W. Burleson . Reconfigurable computing and digital signal processing: a survey. J. VLSI Signal Process. , 7 - 27
40. 40)
  - M. Taylor . The RAW microprocessor: a computational fabric for software circuits and general purpose programs. IEEE Micro , 2 , 25 - 35
41. 41)
  - Dou, Y., Vassiliadis, S., Kuzmanov, G.K., Gaydadjiev, G.N.: `64-bit floating-point FPGA matrix multiplication', ACM/SIGDA Int. Symp. on Field Programmable Gate Arrays, February 2005, Monterey, CA, p. 86–95.
42. 42)
  - Bensaali, F., Amira, A., Bouridane, A.: `An FPGA based coprocessor for large matrix product implementation', 2003 IEEE Int. Conf. Field-Programmable Technology, December 2003, p. 292–295.
43. 43)
  - Wang, X., Ziavras, S.G.: `Parallel direct solution of linear equations on FPGA-based machines', 11thIEEE Int. Workshop on Parallel and Distributed Real-Time Systems (Proc. 17th IEEE International Parallel and Distributed Processing Symp.), 22–26 April 2003, Nice, France.

Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration

Exploiting mixed-mode parallelism for matrix operations on the HERA architecture through reconfiguration

Buy article PDF

Buy Knowledge Pack

Thank you

References

Related content