© The Institution of Engineering and Technology
Owing to a new platform for high performance and general-purpose computing, graphics processing unit (GPU) is one of the most promising candidates for faster improvement in peak processing speed, low latency and high performance. As GPUs employ multithreading to hide latency, there is a small private data cache in each single instruction multiple thread (SIMT) core. Hence, these cores communicate in many applications through the global memory. Access to this public memory takes long time and consumes large amount of power. Moreover, the memory bandwidth is limited which is quite challenging in parallel processing. The missed memory requests in last level cache that are followed by accesses to the slow off-chip memory harm power and performance significantly. In this research, the authors introduce a light overhead mechanism to reduce off-chip memory requests which are triggering by miss events in on-chip caches. The authors propose a cluster-based architecture to capture the similarity of memory requests between SIMT cores and provide data for missed requests by adjacent cores. Simulation results reveal that the proposed architecture enhances the geometric mean of instructions per cycle by 6.3% for evaluated benchmarks, whereas the maximum gain is 22%. Furthermore, the geometric mean of total energy consumption overhead is 4.8% for evaluated applications.
References
-
-
1)
-
14. Mu, S., Deng, Y., Chen, Y., et al: ‘Orchestrating cache management and memory scheduling for GPGPU applications’, IEEE Trans. Very Large Scale Integr. Syst. (TVLSI), 2014, 22, (8), pp. 1803–1814 (doi: 10.1109/TVLSI.2013.2278025).
-
2)
-
3. Agarwal, V., Hrishikesh, M.S., Keckler, S.W., Burger, D.: ‘Clock rate versus IPC: the end of the road for conventional microarchitectures’. Proc. Int. Symp. on Computer Architecture (ISCA), Vancouver, Canada, June 2000, pp. 248–259.
-
3)
-
E. Lindholm ,
J. Nickolls ,
S. Oberman ,
J. Montrym
.
NVIDIA Tesla: a unified graphics and computing architecture.
Micro IEEE
,
2 ,
39 -
55
-
4)
-
26. Leng, J., Hetherington, T., ElTantawy, A., et al: ‘Gpuwattch: Enabling energy optimizations in gpgpus’. Proc. Int. Symp. Computer Architecture (ISCA), Tel-Aviv, Israel, June 2013, pp. 487–498.
-
5)
-
21. Singh, I., Shriraman, A., Fung, W., O'Connor, M., Aamodt, T.: ‘Cache coherence for GPU architectures’. Proc. Int. Symp. High Performance Computing Architecture (HPCA), Shenzhen, China, February 2013, pp. 578–590.
-
6)
-
20. Merrill, D., Garland, M., Grimshaw, A.: ‘Scalable GPU graph traversal’, ACM SIGPLAN Notices, 2012, 47, (8), pp. 117–128 (doi: 10.1145/2370036.2145832).
-
7)
-
6. Gou, C., Gaydadjiev, G.N.: ‘Elastic pipeline: addressing GPU on-chip shared memory bank conflicts’. Proc. ACM. Int. Conf. Computing Frontires (CF), Ischia, Italy, May 2011, pp. 3:1–3:11.
-
8)
-
17. Hennessey, J.L., Patterson, D.A.: ‘Computer architecture, a quantitative approach’ (Elsevier, 2012, 5th edn.).
-
9)
-
1. Keckler, S.W., Hoftee, H.P. (Eds.): ‘Multicore processors and systems’ (Springer, Berlin, 2009).
-
10)
-
J.D. Owens ,
M. Houston ,
D. Luebke ,
S. Green ,
J.E. Stone ,
J.C. Phillips
.
GPU computing.
Proc. IEEE
,
5 ,
879 -
899
-
11)
-
12)
-
4. Aamodt, T.M.: ‘Architecting graphics processors for non-graphics compute acceleration’. Proc. IEEE Pacific Rim. Conf. Communications, Computers and Signal Processing, Victoria, Canada, August 2009, pp. 963–968.
-
13)
-
13. Lee, J., Kim, H.: ‘TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture’. Int. Symp. High Performance Computing Architecture (HPCA), New Orleans, USA, February 2012, pp. 1–12.
-
14)
-
22. Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: ‘Analyzing CUDA workloads using a detailed GPU simulator’. Proc. Int. Symp. Performance Analysis of Systems and Software (ISPASS), Boston, USA, April 2009, pp. 163–174.
-
15)
-
19. Hetherington, T.H., Rogers, T.G., Hsu, L., O'Connor, M., Aamodt, T.M.: ‘Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems’. Proc. Int. Symp. Performance Analysis of Systems and Software (ISPASS), New Brunswick, Canada, April 2012, pp. 88–98.
-
16)
-
28. Keramidas, G., Spiliopoulos, V., Kaxiras, S.: ‘Interval-based models for run-time DVFS orchestration in superscalar processors’. Proc. ACM. Int. Conf. Computing Frontires (CF), Bertinoro, Italy, May 2010, pp. 287–296.
-
17)
-
29. Jia, W., Shaw, K.A., Martonosi, M.: ‘Characterizing and improving the use of demand-fetched caches in GPUs’. ACM. Int. Conf Supercomputing, Venice, Italy, June 2012, pp. 15–24.
-
18)
-
27. Li, S., Ahn, J.H., Strong, R.D., Brockman, J.B., Tullsen, D.M., Jouppi, N.P.: ‘McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures’. Proc. Int. Symp. Microarchitecture, New York, USA, December 2009, pp. 469–480.
-
19)
-
15. Nvidia, C.: ‘NVIDIA's next generation CUDA compute architecture: FERMI’, Comput. Syst., 2009, 26, pp. 63–72.
-
20)
-
16. AMD: ‘GPU computing: Past, present and future with ATI stream technology’ (AMD, 2010), pp. 1–42.
-
21)
-
22)
-
8. Owens, J.D., Luebke, D., Govindaraju, N., et al: ‘A Survey of general-purpose computation on graphics hardware’, Comput. Graph. Forum, 2007, 26, (1), pp. 80–113 (doi: 10.1111/j.1467-8659.2007.01012.x).
-
23)
-
12. Tarjan, D., Skadron, K.: ‘The sharing tracker: Using ideas from cache coherence hardware to reduce off-chip memory traffic with non-coherent caches’. Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC), New Orleans, USA, November 2010, pp. 1–10.
-
24)
-
25)
-
18. Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: ‘Dynamic warp formation and scheduling for efficient GPU control flow’. Proc. Int. Symp. Microarchitecture, Chicago, USA, December 2007, pp. 407–420.
-
26)
-
31. Technical Brief: ‘NVIDIA GeForce® GTX 200 GPU architectural overview’ (, 2008), pp. 9–20.
-
27)
-
28)
-
7. Gebhart, M., Keckler, S.W., Khailany, B., Krashinsky, R., Dally, W.J.: ‘Unifying primary cache, scratch, and register file memories in a throughput processor’. Proc. Int. Symp. Microarchitecture, Vancouver, Canada, December 2012, pp. 96–106.
-
29)
-
30)
-
31)
-
30. Lee, J., Lakshminarayana, N.B., Kim, H., Vuduc, R.: ‘Many-thread aware prefetching mechanisms for GPGPU applications’. Proc. IEEE/ACM. Int. Symp. Microarchitecture (MICRO), Atlanta, Georgia, USA, December 2010, pp. 213–224.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cdt.2014.0092
Related content
content/journals/10.1049/iet-cdt.2014.0092
pub_keyword,iet_inspecKeyword,pub_concept
6
6