In this study, the authors want to evaluate the performances of the PowerXCell 8i processor, which is based on Cell Broadband Engine architecture. For this purpose, the authors chose an algorithm for the k-nearest neighbour problem. The authors optimised this algorithm for efficient exploitation of the facilities provided by this architecture. The authors evaluated the PowerXCell 8i performances by algorithm execution with single- and double-precision calculations. For both cases, the performances were evaluated with and without SIMDisation. For single-precision calculations, the authors achieved a maximum speed-up of 43.85 with SIMDisation by activating 6 synergetic processor element (SPE) processors and 39.73 without SIMDisation by activating 16 SPE processors. For double-precision calculations, the authors achieved a maximum speed-up of 34.79 with SIMDisation by activating 9 SPE processors and 32.71 without SIMDisation by activating 12 SPE processors. These values related to the execution on the PowerPC processor element processor and are due to the accessing way of the main memory by the SPE cores, through the DMA transfers who are performed in parallel with the computing operations. The authors conclude that this process can be efficiently used for the execution of algorithms that require intensive computations on huge data volume.

References

1. 1)
  - 8. Khoury, R., Burgstaller, B., Scholz, B.: ‘Accelerating the execution of matrix languages on the cell broadband engine architecture’, IEEE Trans. Parallel Distrib. Syst., 2011, 22, (1), pp. 7–21 (doi: 10.1109/TPDS.2010.58).
2. 2)
  - 16. Vogt, J.-S., Land, R., Boettiger, H., Krnjajic, Z., Baier, H.: ‘IBM BladeCenter QS22: design, performance, and utilization in hybrid computing systems’, IBM J. Res. Dev., 2009, 53, (5), pp. 3:1–3:14 (doi: 10.1147/JRD.2009.5429069).
3. 3)
  - 12. Shahbahrami, A., Pham, T., Bertels, K.: ‘Parallel implementation of gray level co-occurrence matrices and Haralick texture features on cell architecture’, J. Supercomput., 2012, 59, (3), pp. 1455–1477 (doi: 10.1007/s11227-011-0556-x).
4. 4)
  - 13. Shi, G., Kindratenko, V., Gottlieb, S.: ‘The bottom-up implementation of one MILC lattice QCD application on the cell blade’, Int. J. Parallel Program., 2009, 37, (5), pp. 488–507 (doi: 10.1007/s10766-009-0102-0).
5. 5)
  - 4. Hsieh, K.-Y., Lai, C.-H., Lai, S.-H., Lee, J.K.: ‘Parallelization of belief propagation on cell processors for stereo vision’, ACM Trans. Embed. Comput. Syst., 2012, 11S, (1), pp. 13:1–13:15 (doi: 10.1145/2180887.2180889).
6. 6)
  - 11. Varbanescu, A.L., Sips, H., Ross, K.A., et al: ‘Evaluating application mapping scenarios on the Cell/B.E.’, Concurrency Comput. Pract. Exp., 2009, 21, (1), pp. 85–100 (doi: 10.1002/cpe.1335).
7. 7)
  - 20. Deza, M., Deza, E.: ‘Encyclopedia of distances’ (Springer, 2009).
8. 8)
  - 21. An implementation of the k-NN algorithm on STI's Cell processor, https://code.google.com/p/cell-knn/.
9. 9)
  - 3. Song, Y., Akoglu, A.: ‘Parallel implementation of the irregular terrain model (ITM) for radio transmission loss prediction using GPU and cell BE processors’, IEEE Trans. Parallel Distrib. Syst., 2011, 22, (8), pp. 1276–1283 (doi: 10.1109/TPDS.2011.21).
10. 10)
  - 1. Chen, T., Raghavan, R., Dale, J.N., Iwata, E.: ‘Cell broadband engine architecture and its first implementation – a performance view’, IBM J. Res. Dev., 2007, 51, (5), pp. 559–572 (doi: 10.1147/rd.515.0559).
11. 11)
  - 18. Tanase, C.A., Gaitan, V.G.: ‘Dynamic, unbalanced distribution of tasks on a PS3 cluster system for double precision calculation’, J. Supercomput., 2012, 62 (3), pp. 1502–1518 (doi: 10.1007/s11227-012-0814-6).
12. 12)
  - 17. Turner, J.A.: ‘The Los Alamos roadrunner Petascale hybrid supercomputer – overview of applications, results and programming’, Roadrunner Technical Seminar Series, Los Alamos National Laboratory, March 2008.
13. 13)
  - 19. Tanase, C.A., Gaitan, V.G.: ‘Threads pipelining on the CellBE systems’, Adv. Electr. Comput. Eng., 2013, 13, (3), pp. 121–126 (doi: 10.4316/aece.2013.03019).
14. 14)
  - 10. Sarje, A., Aluru, S.: ‘Parallel genomic alignments on the cell broadband engine’, IEEE Trans. Parallel Distrib. Syst., 2009, 20, (11), pp. 1600–1610 (doi: 10.1109/TPDS.2008.254).
15. 15)
  - 5. Xia, Y., Prasanna, V.K.: ‘Parallel exact inference on the cell broadband engine processor’, J. Parallel Distrib. Comput., 2010, 70, (5), pp. 558–572 (doi: 10.1016/j.jpdc.2010.01.008).
16. 16)
  - 23. Jimenez-Gonzalez, D., Martorell, X., Ramirez, A.: ‘Performance analysis of cell broadband engine for high memory bandwidth applications’. IEEE Int. Symp. Performance Analysis of Systems & Software (ISPASS 2007), 25–27 April 2007, pp. 210–219.
17. 17)
  - 2. IBM: Software Development Kit for Multicore Acceleration Version 3.1 – Programming Tutorial [Online]. Available at http://public.dhe.ibm.com/software/dw/cell/CBE Programming Tutorial v3.1.pdf.
18. 18)
  - 14. Katosi, K., Hosino, T.: ‘Multi-GPU algorithm for k-nearest neighbor problem’, Concurrency Comput. Pract. Exp., 2011, 24, (1), pp. 45–53.
19. 19)
  - 15. Johns, C.R., Brokenshire, D.A.: ‘Introduction to the cell broadband engine architecture’, IBM J. Res. Dev., 2007, 51, (5), pp. 503–519 (doi: 10.1147/rd.515.0503).
20. 20)
  - 9. Rabie, T., Kidwai, H.K., Sibai, F.N.: ‘Massive video-surveillance parallelization on the cell broadband engine processor’, IBM J. Res. Dev., 2010, 54, (6), pp. 11:1–11:8 (doi: 10.1147/JRD.2010.2074930).
21. 21)
  - 7. Ismail, L., Guerchi, D.: ‘Performance evaluation of convolution on the cell broadband engine processor’, IEEE Trans. Parallel Distrib. Syst., 2011, 22, (2), pp. 337–351 (doi: 10.1109/TPDS.2010.70).
22. 22)
  - 22. Altevogt, P., Boettiger, H., Kiss, T., Krnjajic, Z.: ‘Evaluating IBM BladeCenter QS21 hardware performance’ (IBM Multicore Acceleration Technical Library, 2008), http://www.ibm.com/developerworks/library/pa-qs21perf/index.html.
23. 23)
  - 6. Xu, M., Thulasiraman, P., Noghanian, S.: ‘Microwave tomography for breast cancer detection on Cell broadband engine processors’, J. Parallel Distrib. Comput., 2012, 72, (9), pp. 1106–1116 (doi: 10.1016/j.jpdc.2011.10.013).

Intensive computing on a large data volume with a short-vector single instruction multiple data processor

References

Related content