Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

access icon free Locality-protected cache allocation scheme with low overhead on GPUs

Graphics processing units (GPUs) are playing more important roles in parallel computing. Using their multi-threaded execution model, GPUs can accelerate many parallel programmes and save energy. In contrast to their strong computing power, GPUs have limited on-chip memory space which is easy to be inadequate. The throughput-oriented execution model in GPU introduces thousands of hardware threads, which may access the small cache simultaneously. This will cause cache thrashing and contention problems and limit GPU performance. Motivated by these issues, the authors put forward a locality-protected method based on instruction programme counter (LPC) to make use of data locality in L1 data cache with very low hardware overhead. First, they use a simple Program Counter (PC)-based locality detector to collect reuse information of each cache line. Then, a hardware-efficient prioritised cache allocation unit is proposed to coordinate data reuse information with time-stamp information to predict the reuse possibility of each cache line, and to evict the line with the least reuse possibility. Their experiment on the simulator shows that LPC provides an up to 17.8% speedup and an average of 5.0% improvement over the baseline method with very low overhead.

References

    1. 1)
      • 25. Gebhart, M., Johnson, D.R., Tarjan, D., et al: ‘Energy-efficient mechanisms for managing thread context in throughput processors’. ACM SIGARCH Computer Architecture News, 2011, vol. 39, pp. 235246.
    2. 2)
      • 23. Che, S., Boyer, M., Meng, J., et al: ‘Rodinia: a benchmark suite for heterogeneous computing’. IEEE Int. Symp. Workload Characterization, 2009 IISWC 2009, 2009, pp. 4454.
    3. 3)
      • 22. Bakhoda, A., Yuan, G.L., Fung, W.W.L., et al: ‘Analyzing CUDA workloads using a detailed GPU simulator’. IEEE Int. Symp. Performance Analysis of Systems and Software, 2009 ISPASS 2009, 2009, pp. 163174.
    4. 4)
      • 21. Brehob, M.W.: ‘On the mathematics of caching’. PhD thesis, Michigan State University, 2003.
    5. 5)
      • 11. Rhu, M., Sullivan, M., Leng, J., et al: ‘A locality-aware memory hierarchy for energy-efficient GPU architectures’. Proc. 46th Annual IEEE/ACM Int. Symp. Microarchitecture, 2013, pp. 8698.
    6. 6)
      • 4. Chen, X., Chang, L.-W., Rodrigues, C.I., et al: ‘Adaptive cache management for energy-efficient GPU computing’. Proc. 47th Annual IEEE/ACM Int. Symp. Microarchitecture, 2014, pp. 343355.
    7. 7)
      • 8. Sethia, A., Anoushe Jamshidi, D., Mahlke, S.: ‘Mascar: speeding up GPU warps by reducing memory pitstops’. 2015 IEEE 21st Int. Symp. High Performance Computer Architecture (HPCA), 2015, pp. 174185.
    8. 8)
      • 1. NVIDIA CUDA C programming guide v7.5, 2015, 2015. Available at http://developer.nvidia.com/nvidia-gpu-computing-documentation. NVIDIA, accessed date 2016.
    9. 9)
      • 31. Lee, S.-Y., Arunkumar, A., Wu, C.-J.: ‘CAWA: coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU work-loads’. Proc. 42nd Annual Int. Symp. Computer Architecture, 2015, pp. 515527.
    10. 10)
      • 19. Gee, J.D., Hill, M.D., Pnevmatikatos, D.N, et al: ‘Cache performance of the spec92 benchmark suite’. IEEE Micro, 1993, 13, (4), pp. 1727.
    11. 11)
      • 10. Kayiran, O., Jog, A., Kandemir, M.T., et al: ‘Neither more nor less: optimizing thread-level parallelism for GPGPUs’. Proc. 22nd Int. Conf. Parallel Architectures and Compilation Techniques, 2013, pp. 157166.
    12. 12)
      • 14. Drew, Y.: ‘A closer look at GPUs’, Commun. ACM, 2008, 51, (10).
    13. 13)
      • 5. Xie, X., Liang, Y., Wang, Y., et al: ‘Coordinated static and dynamic cache bypassing for GPUs’. 2015 IEEE 21st Int. Symp. High Performance Computer Architecture (HPCA), 2015, pp. 7688.
    14. 14)
      • 7. Rogers, T.G., O'Connor, M., Aamodt, T.M.: ‘Divergence-aware warp scheduling’. Proc. 46th Annual IEEE/ACM Int. Symp. Microarchitecture, 2013, pp. 99110.
    15. 15)
      • 26. Narasiman, V., Shebanow, M., Lee, C.J., et al: ‘Improving GPU performance via large warps and two-level warp scheduling’. Proc. 44th Annual IEEE/ACM Int. Symp. Microarchitecture, 2011, pp. 308317.
    16. 16)
      • 24. Fang, W., He, B., Luo, Q., et al: ‘Mars: accelerating mapreduce with graphics processors’, IEEE Trans. Parallel Distrib. Syst., 2011, 22, (4), pp. 608620.
    17. 17)
      • 28. Chen, J., Tao, X., Yang, Z., et al: ‘Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency’. 2013 IEEE 27th Int. Symp. Parallel & Distributed Processing (IPDPS), 2013, pp. 441451.
    18. 18)
      • 18. Hechtman, B.A., Che, S., Hower, D.R., et al: ‘Quickrelease: a throughput-oriented approach to release consistency on GPUs’. 2014 IEEE 20th Int. Symp. High Performance Computer Architecture (HPCA), 2014, pp. 189200.
    19. 19)
      • 12. Nugteren, C., van den Braak, G.-J., Corporaal, H., et al: ‘A detailed GPU cache model based on reuse distance theory’. 2014 IEEE 20th Int. Symp. High Performance Computer Architecture (HPCA), 2014, pp. 3748.
    20. 20)
      • 16. Singh, I., Shriraman, A., Fung, W.W.L., et al: ‘Cache coherence for GPU architectures’. 2013 IEEE 19th Int. Symp. High Performance Computer Architecture (HPCA2013), 2013, pp. 578590.
    21. 21)
      • 29. Jog, A., Kayiran, O., Nachiappan, N.C, et al: ‘OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance’, ACM SIGARCH Comput. Archit. News, 2013, 41, (1), pp. 395406.
    22. 22)
      • 9. Lee, M., Song, S., Moon, J., et al: ‘Improving GPGPU resource utilization through alternative thread block scheduling’. 2014 IEEE 20th Int. Symp. High Performance Computer Architecture (HPCA), 2014, pp. 260271.
    23. 23)
      • 17. Anssari, N.: Using hybrid shared and distributed caching for mixed-coherency GPU workloads. 2013.
    24. 24)
      • 6. Rogers, T.G., O'Connor, M., Aamodt, T.M.: ‘Cache-conscious wave-front scheduling’. Proc. 2012 45th Annual IEEE/ACM Int. Symp. Microarchitecture, 2012, pp. 7283.
    25. 25)
      • 27. Gupta, S., Xiang, P., Zhou, H.: ‘Analyzing locality of memory references in GPU architectures’. Proc. ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, 2013, p. 12.
    26. 26)
      • 30. Jia, W., Shaw, K.A., Martonosi, M.: ‘MRPB: memory request prioritization for massively parallel processors’. 2014 IEEE 20th Int. Symp. High Performance Computer Architecture (HPCA), 2014, pp. 272283.
    27. 27)
      • 3. Maxwell: The most advanced CUDA GPU ever made, 2014. Available at https://devblogs.nvidia.com/parallelforall/maxwell-most-advanced-cuda-gpu-ever-made/.NVIDIA, accessed date 2015.
    28. 28)
      • 13. Dally, W.J., Labonte, F., Das, A., et al: ‘Merrimac: supercomputing with streams’. Proc. 2003 ACM/IEEE Conf. Supercomputing, 2003, p. 35.
    29. 29)
      • 2. NVIDIA corp.: NVIDIA’s next generation CUDA compute architecture: Fermi, 2009. v.1.1. NVIDIA, 2009.
    30. 30)
      • 15. Fung, W.W.L., Sham, I., Yuan, G., et al: ‘Dynamic warp formation and scheduling for efficient GPU control flow’. Proc. 40th Annual IEEE/ACM Int. Symp. Microarchitecture, 2007, pp. 407420.
    31. 31)
      • 20. Hill, M.D., Smith, A.J.: ‘Evaluating associativity in CPU caches’, IEEE Trans. Comput., 1989, 38, (12), pp. 16121630.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cdt.2017.0004
Loading

Related content

content/journals/10.1049/iet-cdt.2017.0004
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address