Owing to a new platform for high performance and general-purpose computing, graphics processing unit (GPU) is one of the most promising candidates for faster improvement in peak processing speed, low latency and high performance. As GPUs employ multithreading to hide latency, there is a small private data cache in each single instruction multiple thread (SIMT) core. Hence, these cores communicate in many applications through the global memory. Access to this public memory takes long time and consumes large amount of power. Moreover, the memory bandwidth is limited which is quite challenging in parallel processing. The missed memory requests in last level cache that are followed by accesses to the slow off-chip memory harm power and performance significantly. In this research, the authors introduce a light overhead mechanism to reduce off-chip memory requests which are triggering by miss events in on-chip caches. The authors propose a cluster-based architecture to capture the similarity of memory requests between SIMT cores and provide data for missed requests by adjacent cores. Simulation results reveal that the proposed architecture enhances the geometric mean of instructions per cycle by 6.3% for evaluated benchmarks, whereas the maximum gain is 22%. Furthermore, the geometric mean of total energy consumption overhead is 4.8% for evaluated applications.

References

1. 1)
  - 14. Mu, S., Deng, Y., Chen, Y., et al: ‘Orchestrating cache management and memory scheduling for GPGPU applications’, IEEE Trans. Very Large Scale Integr. Syst. (TVLSI), 2014, 22, (8), pp. 1803–1814 (doi: 10.1109/TVLSI.2013.2278025).
2. 2)
  - 3. Agarwal, V., Hrishikesh, M.S., Keckler, S.W., Burger, D.: ‘Clock rate versus IPC: the end of the road for conventional microarchitectures’. Proc. Int. Symp. on Computer Architecture (ISCA), Vancouver, Canada, June 2000, pp. 248–259.
3. 3)
  - E. Lindholm , J. Nickolls , S. Oberman , J. Montrym . NVIDIA Tesla: a unified graphics and computing architecture. Micro IEEE , 2 , 39 - 55
4. 4)
  - 26. Leng, J., Hetherington, T., ElTantawy, A., et al: ‘Gpuwattch: Enabling energy optimizations in gpgpus’. Proc. Int. Symp. Computer Architecture (ISCA), Tel-Aviv, Israel, June 2013, pp. 487–498.
5. 5)
  - 21. Singh, I., Shriraman, A., Fung, W., O'Connor, M., Aamodt, T.: ‘Cache coherence for GPU architectures’. Proc. Int. Symp. High Performance Computing Architecture (HPCA), Shenzhen, China, February 2013, pp. 578–590.
6. 6)
  - 20. Merrill, D., Garland, M., Grimshaw, A.: ‘Scalable GPU graph traversal’, ACM SIGPLAN Notices, 2012, 47, (8), pp. 117–128 (doi: 10.1145/2370036.2145832).
7. 7)
  - 6. Gou, C., Gaydadjiev, G.N.: ‘Elastic pipeline: addressing GPU on-chip shared memory bank conflicts’. Proc. ACM. Int. Conf. Computing Frontires (CF), Ischia, Italy, May 2011, pp. 3:1–3:11.
8. 8)
  - 17. Hennessey, J.L., Patterson, D.A.: ‘Computer architecture, a quantitative approach’ (Elsevier, 2012, 5th edn.).
9. 9)
  - 1. Keckler, S.W., Hoftee, H.P. (Eds.): ‘Multicore processors and systems’ (Springer, Berlin, 2009).
10. 10)
  - J.D. Owens , M. Houston , D. Luebke , S. Green , J.E. Stone , J.C. Phillips . GPU computing. Proc. IEEE , 5 , 879 - 899
11. 11)
  - 11. ‘CUDA C programming guide’, http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html, accessed October 2014.
12. 12)
  - 4. Aamodt, T.M.: ‘Architecting graphics processors for non-graphics compute acceleration’. Proc. IEEE Pacific Rim. Conf. Communications, Computers and Signal Processing, Victoria, Canada, August 2009, pp. 963–968.
13. 13)
  - 13. Lee, J., Kim, H.: ‘TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture’. Int. Symp. High Performance Computing Architecture (HPCA), New Orleans, USA, February 2012, pp. 1–12.
14. 14)
  - 22. Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: ‘Analyzing CUDA workloads using a detailed GPU simulator’. Proc. Int. Symp. Performance Analysis of Systems and Software (ISPASS), Boston, USA, April 2009, pp. 163–174.
15. 15)
  - 19. Hetherington, T.H., Rogers, T.G., Hsu, L., O'Connor, M., Aamodt, T.M.: ‘Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems’. Proc. Int. Symp. Performance Analysis of Systems and Software (ISPASS), New Brunswick, Canada, April 2012, pp. 88–98.
16. 16)
  - 28. Keramidas, G., Spiliopoulos, V., Kaxiras, S.: ‘Interval-based models for run-time DVFS orchestration in superscalar processors’. Proc. ACM. Int. Conf. Computing Frontires (CF), Bertinoro, Italy, May 2010, pp. 287–296.
17. 17)
  - 29. Jia, W., Shaw, K.A., Martonosi, M.: ‘Characterizing and improving the use of demand-fetched caches in GPUs’. ACM. Int. Conf Supercomputing, Venice, Italy, June 2012, pp. 15–24.
18. 18)
  - 27. Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P.: ‘McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures’. Proc. Int. Symp. Microarchitecture, New York, USA, December 2009, pp. 469–480.
19. 19)
  - 15. Nvidia, C.: ‘NVIDIA's next generation CUDA compute architecture: FERMI’, Comput. Syst., 2009, 26, pp. 63–72.
20. 20)
  - 16. AMD: ‘GPU computing: Past, present and future with ATI stream technology’ (AMD, 2010), pp. 1–42.
21. 21)
  - 25. ‘NVIDIA CUDA Downloads’, https://developer.nvidia.com/cuda-downloads accessed September 2013.
22. 22)
  - 8. Owens, J.D., Luebke, D., Govindaraju, N., et al: ‘A Survey of general-purpose computation on graphics hardware’, Comput. Graph. Forum, 2007, 26, (1), pp. 80–113 (doi: 10.1111/j.1467-8659.2007.01012.x).
23. 23)
  - 12. Tarjan, D., Skadron, K.: ‘The sharing tracker: Using ideas from cache coherence hardware to reduce off-chip memory traffic with non-coherent caches’. Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC), New Orleans, USA, November 2010, pp. 1–10.
24. 24)
  - 23. ‘GPGPU-Sim 3.x Manual’, http://gpgpu-sim.org/manual/index.php5/GPGPU-Sim_3.x_Manual, accessed October 2014.
25. 25)
  - 18. Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: ‘Dynamic warp formation and scheduling for efficient GPU control flow’. Proc. Int. Symp. Microarchitecture, Chicago, USA, December 2007, pp. 407–420.
26. 26)
  - 31. Technical Brief: ‘NVIDIA GeForce® GTX 200 GPU architectural overview’ (NVIDIA, 2008), pp. 9–20.
27. 27)
  - 24. ‘Rodinia Benchmark Suite Site’, https://www.cs.virginia.edu/∼skadron/wiki/rodinia/index.php, accessed September 2013.
28. 28)
  - 7. Gebhart, M., Keckler, S.W., Khailany, B., Krashinsky, R., Dally, W.J.: ‘Unifying primary cache, scratch, and register file memories in a throughput processor’. Proc. Int. Symp. Microarchitecture, Vancouver, Canada, December 2012, pp. 96–106.
29. 29)
  - 2. ‘International Technology Roadmap for Semiconductors 2008 Update’, http://www.itrs.net/Links/2008ITRS/Home2008.htm, accessed April 2014.
30. 30)
  - 9. ‘NVIDIA Corporation’, http://www.nvidia.com/object/what-is-gpucomputing.html, accessed April 2014.
31. 31)
  - 30. Lee, J., Lakshminarayana, N.B., Kim, H., Vuduc, R.: ‘Many-thread aware prefetching mechanisms for GPGPU applications’. Proc. IEEE/ACM. Int. Symp. Microarchitecture (MICRO), Atlanta, Georgia, USA, December 2010, pp. 213–224.

Cluster-based approach for improving graphics processing unit performance by inter streaming multiprocessors locality

References

Related content