Locality-protected cache allocation scheme with low overhead on GPUs

Locality-protected cache allocation scheme with low overhead on GPUs

For access to this article, please select a purchase option:

Buy article PDF
(plus tax if applicable)
Buy Knowledge Pack
10 articles for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
IET Computers & Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Graphics processing units (GPUs) are playing more important roles in parallel computing. Using their multi-threaded execution model, GPUs can accelerate many parallel programmes and save energy. In contrast to their strong computing power, GPUs have limited on-chip memory space which is easy to be inadequate. The throughput-oriented execution model in GPU introduces thousands of hardware threads, which may access the small cache simultaneously. This will cause cache thrashing and contention problems and limit GPU performance. Motivated by these issues, the authors put forward a locality-protected method based on instruction programme counter (LPC) to make use of data locality in L1 data cache with very low hardware overhead. First, they use a simple Program Counter (PC)-based locality detector to collect reuse information of each cache line. Then, a hardware-efficient prioritised cache allocation unit is proposed to coordinate data reuse information with time-stamp information to predict the reuse possibility of each cache line, and to evict the line with the least reuse possibility. Their experiment on the simulator shows that LPC provides an up to 17.8% speedup and an average of 5.0% improvement over the baseline method with very low overhead.


    1. 1)
      • 1. NVIDIA CUDA C programming guide v7.5, 2015, 2015. Available at NVIDIA, accessed date 2016.
    2. 2)
      • 2. NVIDIA corp.: NVIDIA’s next generation CUDA compute architecture: Fermi, 2009. v.1.1. NVIDIA, 2009.
    3. 3)
      • 3. Maxwell: The most advanced CUDA GPU ever made, 2014. Available at, accessed date 2015.
    4. 4)
      • 4. Chen, X., Chang, L.-W., Rodrigues, C.I., et al: ‘Adaptive cache management for energy-efficient GPU computing’. Proc. 47th Annual IEEE/ACM Int. Symp. Microarchitecture, 2014, pp. 343355.
    5. 5)
      • 5. Xie, X., Liang, Y., Wang, Y., et al: ‘Coordinated static and dynamic cache bypassing for GPUs’. 2015 IEEE 21st Int. Symp. High Performance Computer Architecture (HPCA), 2015, pp. 7688.
    6. 6)
      • 6. Rogers, T.G., O'Connor, M., Aamodt, T.M.: ‘Cache-conscious wave-front scheduling’. Proc. 2012 45th Annual IEEE/ACM Int. Symp. Microarchitecture, 2012, pp. 7283.
    7. 7)
      • 7. Rogers, T.G., O'Connor, M., Aamodt, T.M.: ‘Divergence-aware warp scheduling’. Proc. 46th Annual IEEE/ACM Int. Symp. Microarchitecture, 2013, pp. 99110.
    8. 8)
      • 8. Sethia, A., Anoushe Jamshidi, D., Mahlke, S.: ‘Mascar: speeding up GPU warps by reducing memory pitstops’. 2015 IEEE 21st Int. Symp. High Performance Computer Architecture (HPCA), 2015, pp. 174185.
    9. 9)
      • 9. Lee, M., Song, S., Moon, J., et al: ‘Improving GPGPU resource utilization through alternative thread block scheduling’. 2014 IEEE 20th Int. Symp. High Performance Computer Architecture (HPCA), 2014, pp. 260271.
    10. 10)
      • 10. Kayiran, O., Jog, A., Kandemir, M.T., et al: ‘Neither more nor less: optimizing thread-level parallelism for GPGPUs’. Proc. 22nd Int. Conf. Parallel Architectures and Compilation Techniques, 2013, pp. 157166.
    11. 11)
      • 11. Rhu, M., Sullivan, M., Leng, J., et al: ‘A locality-aware memory hierarchy for energy-efficient GPU architectures’. Proc. 46th Annual IEEE/ACM Int. Symp. Microarchitecture, 2013, pp. 8698.
    12. 12)
      • 12. Nugteren, C., van den Braak, G.-J., Corporaal, H., et al: ‘A detailed GPU cache model based on reuse distance theory’. 2014 IEEE 20th Int. Symp. High Performance Computer Architecture (HPCA), 2014, pp. 3748.
    13. 13)
      • 13. Dally, W.J., Labonte, F., Das, A., et al: ‘Merrimac: supercomputing with streams’. Proc. 2003 ACM/IEEE Conf. Supercomputing, 2003, p. 35.
    14. 14)
      • 14. Drew, Y.: ‘A closer look at GPUs’, Commun. ACM, 2008, 51, (10).
    15. 15)
      • 15. Fung, W.W.L., Sham, I., Yuan, G., et al: ‘Dynamic warp formation and scheduling for efficient GPU control flow’. Proc. 40th Annual IEEE/ACM Int. Symp. Microarchitecture, 2007, pp. 407420.
    16. 16)
      • 16. Singh, I., Shriraman, A., Fung, W.W.L., et al: ‘Cache coherence for GPU architectures’. 2013 IEEE 19th Int. Symp. High Performance Computer Architecture (HPCA2013), 2013, pp. 578590.
    17. 17)
      • 17. Anssari, N.: Using hybrid shared and distributed caching for mixed-coherency GPU workloads. 2013.
    18. 18)
      • 18. Hechtman, B.A., Che, S., Hower, D.R., et al: ‘Quickrelease: a throughput-oriented approach to release consistency on GPUs’. 2014 IEEE 20th Int. Symp. High Performance Computer Architecture (HPCA), 2014, pp. 189200.
    19. 19)
      • 19. Gee, J.D., Hill, M.D., Pnevmatikatos, D.N, et al: ‘Cache performance of the spec92 benchmark suite’. IEEE Micro, 1993, 13, (4), pp. 1727.
    20. 20)
      • 20. Hill, M.D., Smith, A.J.: ‘Evaluating associativity in CPU caches’, IEEE Trans. Comput., 1989, 38, (12), pp. 16121630.
    21. 21)
      • 21. Brehob, M.W.: ‘On the mathematics of caching’. PhD thesis, Michigan State University, 2003.
    22. 22)
      • 22. Bakhoda, A., Yuan, G.L., Fung, W.W.L., et al: ‘Analyzing CUDA workloads using a detailed GPU simulator’. IEEE Int. Symp. Performance Analysis of Systems and Software, 2009 ISPASS 2009, 2009, pp. 163174.
    23. 23)
      • 23. Che, S., Boyer, M., Meng, J., et al: ‘Rodinia: a benchmark suite for heterogeneous computing’. IEEE Int. Symp. Workload Characterization, 2009 IISWC 2009, 2009, pp. 4454.
    24. 24)
      • 24. Fang, W., He, B., Luo, Q., et al: ‘Mars: accelerating mapreduce with graphics processors’, IEEE Trans. Parallel Distrib. Syst., 2011, 22, (4), pp. 608620.
    25. 25)
      • 25. Gebhart, M., Johnson, D.R., Tarjan, D., et al: ‘Energy-efficient mechanisms for managing thread context in throughput processors’. ACM SIGARCH Computer Architecture News, 2011, vol. 39, pp. 235246.
    26. 26)
      • 26. Narasiman, V., Shebanow, M., Lee, C.J., et al: ‘Improving GPU performance via large warps and two-level warp scheduling’. Proc. 44th Annual IEEE/ACM Int. Symp. Microarchitecture, 2011, pp. 308317.
    27. 27)
      • 27. Gupta, S., Xiang, P., Zhou, H.: ‘Analyzing locality of memory references in GPU architectures’. Proc. ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, 2013, p. 12.
    28. 28)
      • 28. Chen, J., Tao, X., Yang, Z., et al: ‘Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency’. 2013 IEEE 27th Int. Symp. Parallel & Distributed Processing (IPDPS), 2013, pp. 441451.
    29. 29)
      • 29. Jog, A., Kayiran, O., Nachiappan, N.C, et al: ‘OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance’, ACM SIGARCH Comput. Archit. News, 2013, 41, (1), pp. 395406.
    30. 30)
      • 30. Jia, W., Shaw, K.A., Martonosi, M.: ‘MRPB: memory request prioritization for massively parallel processors’. 2014 IEEE 20th Int. Symp. High Performance Computer Architecture (HPCA), 2014, pp. 272283.
    31. 31)
      • 31. Lee, S.-Y., Arunkumar, A., Wu, C.-J.: ‘CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU work-loads’. Proc. 42nd Annual Int. Symp. Computer Architecture, 2015, pp. 515527.

Related content

This is a required field
Please enter a valid email address