access icon free ECAP: energy-efficient caching for prefetch blocks in tiled chip multiprocessors

With the increase in processing cores performance have increased, but energy consumption and memory access latency have become a crucial factor in determining system performance. In tiled chip multiprocessor, tiles are interconnected using a network and different application runs in different tiles. Non-uniform load distribution of applications results in varying L1 cache usage pattern. Application with larger memory footprint uses most of its L1 cache. Prefetching on top of such application may cause cache pollution by evicting useful demand blocks from the cache. This generates further cache misses which increases the network traffic. Therefore, an inefficient prefetch block placement strategy may result in generating more traffic that may increase congestion and power consumption in the network. This also dampens the packet movement rate which increases miss penalty at the cores thereby affecting Average Memory Access Time (AMAT). The authors propose an energy-efficient caching strategy for prefetch blocks, ECAP. It uses the less used cache set of nearby tiles running light applications as virtual cache memories for the tiles running high applications to place the prefetch blocks. ECAP reduces AMAT, router and link power in NoC by 23.54%, 14.42%, and 27%, respectively as compared to the conventional prefetch placement technique.

Inspec keywords: energy conservation; power consumption; multiprocessing systems; telecommunication traffic; cache storage; power aware computing; virtual storage; microprocessor chips

Other keywords: average memory access time; system performance; cache pollution; virtual cache memories; packet movement rate; energy-efficient caching for prefetch blocks; energy consumption; cache misses; prefetching; nonuniform load distribution; L1 cache usage pattern; memory access latency; memory footprint; tiled chip multiprocessor; prefetch block placement strategy; ECAP; prefetch placement technique; power consumption; network traffic

Subjects: File organisation; Memory circuits; Microprocessor chips; Multiprocessing systems; Semiconductor storage; Electrical/electronic equipment (energy utilisation); Microprocessors and microcomputers; Performance evaluation and testing

References

    1. 1)
      • 22. Mittal, S., Vetter, J.S.: ‘A survey of architectural approaches for data compression in cache and main memory systems’, IEEE Trans. Parallel Distrib. Syst., 2016, 27, (5), pp. 15241536.
    2. 2)
      • 17. Vangal, S.R., Howard, J., Ruhl, G., et al: ‘An 80-tile sub-100 W TeraFLOPS processor in 65 nm CMOS’, IEEE J. Solid-State Circuits, 2008, 43, (1), pp. 2941.
    3. 3)
      • 13. Cireno, M., Aziz, A., Barros, E.: ‘Temporized data prefetching algorithm for NoC-based multiprocessor systems’. 27th Int. Conf. Application-specific Systems, Architectures and Processors, London, UK, 2016, pp. 235236.
    4. 4)
      • 30. Khan, S.M., Tian, Y., Jimenez, D.A.: ‘Sampling dead block prediction for last-level caches’. 2010 43rd Annual IEEE/ACM Int. Symp. Microarchitecture, Atlanta, GA, USA, December 2010, pp. 175186.
    5. 5)
      • 9. Lee, J., Kim, H., Vuduc, R.: ‘When prefetching works, when it doesn't, and why’, ACM Trans. Archit. Code Optim., 2012, 9, (1), pp. 2:12:29.
    6. 6)
      • 26. Zhang, C., McKee, S.A.: ‘Hardware-only stream prefetching and dynamic access ordering’. Proc. 14th Int. Conf. Supercomputing, Santa Fe, NM, USA, 2000, pp. 167175.
    7. 7)
      • 10. Mittal, S.: ‘A survey of recent prefetching techniques for processor caches’, ACM Comput. Surv., 2016, 49, (2), pp. 35:135:35.
    8. 8)
      • 34. Muralimanohar, N., Balasubramonian, R., Jouppi, N.: ‘Cacti 6.0: a tool to model large caches’, HP Laboratories, Chicago, IL, USA, 01 2009.
    9. 9)
      • 24. Jiang, N., Becker, D.U., Micheologiannakis, G., et al: ‘A detailed and flexible cycle-accurate network-on-chip simulator’. Proc. Performance Analysis of Systems and Software, Austin, TX, USA, 2013, pp. 8696.
    10. 10)
      • 7. Jose, J., Nayak, B., Kumar, K., et al: ‘DeBAR: deflection based adaptive router with minimal buffering’. Proc. Conf. Design, Automation and Test in Europe, Grenoble, France, 2013, pp. 15831588.
    11. 11)
      • 4. Dally, W.J., Towles, B.: ‘Route packets, not wires: on-chip interconnection networks’. Proc. 38th Design Automation Conf., Las Vegas, NV, USA, 2001, pp. 684689.
    12. 12)
      • 11. Jorge, A., Rubén, G., Pablo, I., et al: ‘ABS: a low-cost adaptive controller for prefetching in a banked shared last-level cache’, ACM Trans. Archit. Code Optim., 2012, 8, (4), pp. 19:119:20.
    13. 13)
      • 23. Binkert, N., Beckmann, B., Black, G., et al: ‘The gem5 simulator’, SIGARCH Comput. Archit. News, 2011, 39, (2), pp. 17.
    14. 14)
      • 5. Bjerregaard, T., Mahadevan, S.: ‘A survey of research and practices of network-on-chip’, ACM Comput. Surv., 2006, 38, (1).
    15. 15)
      • 27. Seshadri, V., Yedkar, S., Xin, H., et al: ‘Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks’, ACM Trans. Archit. Code Optim., 2015, 11, (4), pp. 51:151:22.
    16. 16)
      • 1. Sodani, A., Gramunt, R., Corbal, J., et al: ‘Knights landing: second-generation Intel Xeon Phi product’, IEEE Micro, 2016, 36, (2), pp. 3446.
    17. 17)
      • 32. Huh, J., Kim, C., Shafi, H., et al: ‘A NUCA substrate for flexible CMP cache sharing’. Proc. 19th Annual Int. Conf. Supercomputing, series ICS ‘05, Cambridge, MA, USA, 2005, pp. 3140.
    18. 18)
      • 15. Srinath, S., Mutlu, O., Kim, H., et al: ‘Feedback directed prefetching: improving the performance and bandwidth-efficiency of hardware prefetchers’. 2007 IEEE 13th Int. Symp. High Performance Computer Architecture, Scottsdale, AZ, USA, February 2007, pp. 6374.
    19. 19)
      • 6. Kim, C., Burger, D., Keckler, S.W.: ‘An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches’, SIGARCH Comput. Archit. News, 2002, 30, (5), pp. 211222.
    20. 20)
      • 20. Fallin, C., Craik, C., Mutlu, O.: ‘CHIPPER: a low-complexity bufferless deflection router’. 2011 IEEE 17th Int. Symp. High Performance Computer Architecture, San Antonio, TX, USA, 2011, pp. 144155.
    21. 21)
      • 38. Jouppi, N.P.: ‘Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers’. Proc. 17th Annual Int. Symp. Computer Architecture, Seattle, WA, USA, 1990, pp. 364373.
    22. 22)
      • 21. Moscibroda, T., Mutlu, O.: ‘A case for bufferless routing in on-chip networks’. Proc. 36th Annual Int. Symp. Computer Architecture, series ISCA ‘09, Austin, TX, USA, 2009, pp. 196207. Accessed June 2009, available at http://doi.acm.org/10.1145/1555754.1555781.
    23. 23)
      • 19. Shang, L., Peh, L.-S., Jha, N.K.: ‘Dynamic voltage scaling with links for power optimization of interconnection networks’. Ninth Int. Symp. High-Performance Computer Architecture 2003 HPCA-9 2003 Proc., Anaheim, CA, USA, February 2003, pp. 91102.
    24. 24)
      • 16. Mehta, S., Fang, Z., Zhai, A., et al: ‘Multi-stage coordinated prefetching for present-day processors’. Proc. 28th ACM Int. Conf. Supercomputing, Munich, Germany, 2014, pp. 7382.
    25. 25)
      • 36. Batten, C., Joshi, A., Orcutt, J., et al: ‘Building manycore processor-to-DRAM networks with monolithic silicon photonics’. Proc. 16th IEEE Symp. High Performance Interconnects, Stanford, CA, USA, August 2008, pp. 2130.
    26. 26)
      • 28. Seshadri, V., Mutlu, O., Kozuch, M.A., et al: ‘The evicted-address filter: a unified mechanism to address both cache pollution and thrashing’. Proc. 21st Int. Conf. Parallel Architectures and Compilation Techniques, series PACT ‘12, Minneapolis, MN, USA, 2012, pp. 355366.
    27. 27)
      • 8. Fallin, C., Nazario, G., Yu, X., et al: ‘MinBD: minimally buffered deflection routing for energy-efficient interconnect’. IEEE/ACM Sixth Int. Symp. Networks-on-Chip, Copenhagen, Denmark, May 2012.
    28. 28)
      • 2. Bell, S., Edwards, B., Amann, J., et al: ‘TILE64 – processor: a 64-core SoC with mesh interconnect’. 2008 IEEE Int. Solid-State Circuits Conf. – Digest of Technical Papers, San Francisco, CA, USA, 2008, pp. 88598.
    29. 29)
      • 18. Wang, H., Peh, L.-S., Malik, S.: ‘Power-driven design of router microarchitectures in on-chip networks’. Proc. 36th Annual IEEE/ACM Int. Symp. Microarchitecture, series MICRO 36, San Diego, CA, USA, 2003.
    30. 30)
      • 37. Yedlapalli, P., Kotra, J., Kultursay, E., et al: ‘Meeting midway: improving CMP performance with memory-side prefetching’. Proc. 22nd Int. Conf. Parallel Architectures and Compilation Techniques, Edinburgh, UK, September 2013, pp. 289298.
    31. 31)
      • 39. Aziz, A., Cireno, M., Barros, E., et al: ‘Balanced prefetching aggressiveness controller for NoC-based multiprocessor’. 27th Symp. Integrated Circuits and Systems Design (SBCCI), Aracaju, Brazil, September 2014, pp. 17.
    32. 32)
      • 25. Henning, J.L.: ‘SPEC CPU2006 benchmark descriptions’, ACM SIGARCH Comput. Archit. News, 2006, 34, (4), pp. 117.
    33. 33)
      • 33. Deb, D., Jose, J., Palesi, M.: ‘Performance enhancement of caches in TCMPs using near vicinity prefetcher’. Proc. 2019 32nd Int. Conf. VLSI Design and 2019 18th Int. Conf. Embedded Systems (VLSID), New Delhi, India, 2019.
    34. 34)
      • 29. Beckmann, N., Sanchez, D.: ‘Modeling cache performance beyond LRU’. 2016 IEEE Int. Symp. High Performance Computer Architecture (HPCA), Barcelona, Spain, March 2016, pp. 225236.
    35. 35)
      • 3. Balasubramonian, R., Jouppi, N.P., Muralimanohar, N.: ‘Multi-core cache hierarchies’ (Morgan and Claypool Publishers, San Rafael, CA, USA, 2011).
    36. 36)
      • 12. Ebrahimi, E., Mutlu, O., Lee, C.J., et al: ‘Coordinated control of multiple prefetchers in multi-core systems’. Proc. 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, New York, NY, USA, 2009, pp. 316326.
    37. 37)
      • 35. Kahng, A.B., Li, B., Peh, L., et al: ‘ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration’. Design, Automation Test in Europe Conf. Exhibition, Nice, France, April 2009, pp. 423428.
    38. 38)
      • 31. Faldu, P., Grot, B.: ‘Leeway: addressing variability in dead-block prediction for last-level caches’. 2017 26th Int. Conf. Parallel Architectures and Compilation Techniques (PACT), Portland, OR, USA, September 2017, pp. 180193.
    39. 39)
      • 14. Lai, A.-C., Fide, C., Falsafi, B.: ‘Dead-block prediction amp; dead-block correlating prefetchers’. Proc. 28th Annual Int. Symp. Computer Architecture, Gothenburg, Sweden, 2001, pp. 144154.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cdt.2019.0035
Loading

Related content

content/journals/10.1049/iet-cdt.2019.0035
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading