http://iet.metastore.ingenta.com
1887

ECAP: energy-efficient caching for prefetch blocks in tiled chip multiprocessors

ECAP: energy-efficient caching for prefetch blocks in tiled chip multiprocessors

For access to this article, please select a purchase option:

Buy article PDF
$19.95
(plus tax if applicable)
Buy Knowledge Pack
10 articles for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IET Computers & Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

With the increase in processing cores performance have increased, but energy consumption and memory access latency have become a crucial factor in determining system performance. In tiled chip multiprocessor, tiles are interconnected using a network and different application runs in different tiles. Non-uniform load distribution of applications results in varying L1 cache usage pattern. Application with larger memory footprint uses most of its L1 cache. Prefetching on top of such application may cause cache pollution by evicting useful demand blocks from the cache. This generates further cache misses which increases the network traffic. Therefore, an inefficient prefetch block placement strategy may result in generating more traffic that may increase congestion and power consumption in the network. This also dampens the packet movement rate which increases miss penalty at the cores thereby affecting Average Memory Access Time (AMAT). The authors propose an energy-efficient caching strategy for prefetch blocks, ECAP. It uses the less used cache set of nearby tiles running light applications as virtual cache memories for the tiles running high applications to place the prefetch blocks. ECAP reduces AMAT, router and link power in NoC by 23.54%, 14.42%, and 27%, respectively as compared to the conventional prefetch placement technique.

References

    1. 1)
      • 1. Sodani, A., Gramunt, R., Corbal, J., et al: ‘Knights landing: second-generation Intel Xeon Phi product’, IEEE Micro, 2016, 36, (2), pp. 3446.
    2. 2)
      • 2. Bell, S., Edwards, B., Amann, J., et al: ‘TILE64 – processor: a 64-core SoC with mesh interconnect’. 2008 IEEE Int. Solid-State Circuits Conf. – Digest of Technical Papers, San Francisco, CA, USA, 2008, pp. 88598.
    3. 3)
      • 3. Balasubramonian, R., Jouppi, N.P., Muralimanohar, N.: ‘Multi-core cache hierarchies’ (Morgan and Claypool Publishers, San Rafael, CA, USA, 2011).
    4. 4)
      • 4. Dally, W.J., Towles, B.: ‘Route packets, not wires: on-chip interconnection networks’. Proc. 38th Design Automation Conf., Las Vegas, NV, USA, 2001, pp. 684689.
    5. 5)
      • 5. Bjerregaard, T., Mahadevan, S.: ‘A survey of research and practices of network-on-chip’, ACM Comput. Surv., 2006, 38, (1).
    6. 6)
      • 6. Kim, C., Burger, D., Keckler, S.W.: ‘An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches’, SIGARCH Comput. Archit. News, 2002, 30, (5), pp. 211222.
    7. 7)
      • 7. Jose, J., Nayak, B., Kumar, K., et al: ‘DeBAR: deflection based adaptive router with minimal buffering’. Proc. Conf. Design, Automation and Test in Europe, Grenoble, France, 2013, pp. 15831588.
    8. 8)
      • 8. Fallin, C., Nazario, G., Yu, X., et al: ‘MinBD: minimally buffered deflection routing for energy-efficient interconnect’. IEEE/ACM Sixth Int. Symp. Networks-on-Chip, Copenhagen, Denmark, May 2012.
    9. 9)
      • 9. Lee, J., Kim, H., Vuduc, R.: ‘When prefetching works, when it doesn't, and why’, ACM Trans. Archit. Code Optim., 2012, 9, (1), pp. 2:12:29.
    10. 10)
      • 10. Mittal, S.: ‘A survey of recent prefetching techniques for processor caches’, ACM Comput. Surv., 2016, 49, (2), pp. 35:135:35.
    11. 11)
      • 11. Jorge, A., Rubén, G., Pablo, I., et al: ‘ABS: a low-cost adaptive controller for prefetching in a banked shared last-level cache’, ACM Trans. Archit. Code Optim., 2012, 8, (4), pp. 19:119:20.
    12. 12)
      • 12. Ebrahimi, E., Mutlu, O., Lee, C.J., et al: ‘Coordinated control of multiple prefetchers in multi-core systems’. Proc. 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, New York, NY, USA, 2009, pp. 316326.
    13. 13)
      • 13. Cireno, M., Aziz, A., Barros, E.: ‘Temporized data prefetching algorithm for NoC-based multiprocessor systems’. 27th Int. Conf. Application-specific Systems, Architectures and Processors, London, UK, 2016, pp. 235236.
    14. 14)
      • 14. Lai, A.-C., Fide, C., Falsafi, B.: ‘Dead-block prediction amp; dead-block correlating prefetchers’. Proc. 28th Annual Int. Symp. Computer Architecture, Gothenburg, Sweden, 2001, pp. 144154.
    15. 15)
      • 15. Srinath, S., Mutlu, O., Kim, H., et al: ‘Feedback directed prefetching: improving the performance and bandwidth-efficiency of hardware prefetchers’. 2007 IEEE 13th Int. Symp. High Performance Computer Architecture, Scottsdale, AZ, USA, February 2007, pp. 6374.
    16. 16)
      • 16. Mehta, S., Fang, Z., Zhai, A., et al: ‘Multi-stage coordinated prefetching for present-day processors’. Proc. 28th ACM Int. Conf. Supercomputing, Munich, Germany, 2014, pp. 7382.
    17. 17)
      • 17. Vangal, S.R., Howard, J., Ruhl, G., et al: ‘An 80-tile sub-100 W TeraFLOPS processor in 65 nm CMOS’, IEEE J. Solid-State Circuits, 2008, 43, (1), pp. 2941.
    18. 18)
      • 18. Wang, H., Peh, L.-S., Malik, S.: ‘Power-driven design of router microarchitectures in on-chip networks’. Proc. 36th Annual IEEE/ACM Int. Symp. Microarchitecture, series MICRO 36, San Diego, CA, USA, 2003.
    19. 19)
      • 19. Shang, L., Peh, L.-S., Jha, N.K.: ‘Dynamic voltage scaling with links for power optimization of interconnection networks’. Ninth Int. Symp. High-Performance Computer Architecture 2003 HPCA-9 2003 Proc., Anaheim, CA, USA, February 2003, pp. 91102.
    20. 20)
      • 20. Fallin, C., Craik, C., Mutlu, O.: ‘CHIPPER: a low-complexity bufferless deflection router’. 2011 IEEE 17th Int. Symp. High Performance Computer Architecture, San Antonio, TX, USA, 2011, pp. 144155.
    21. 21)
      • 21. Moscibroda, T., Mutlu, O.: ‘A case for bufferless routing in on-chip networks’. Proc. 36th Annual Int. Symp. Computer Architecture, series ISCA ‘09, Austin, TX, USA, 2009, pp. 196207. Accessed June 2009, available at http://doi.acm.org/10.1145/1555754.1555781.
    22. 22)
      • 22. Mittal, S., Vetter, J.S.: ‘A survey of architectural approaches for data compression in cache and main memory systems’, IEEE Trans. Parallel Distrib. Syst., 2016, 27, (5), pp. 15241536.
    23. 23)
      • 23. Binkert, N., Beckmann, B., Black, G., et al: ‘The gem5 simulator’, SIGARCH Comput. Archit. News, 2011, 39, (2), pp. 17.
    24. 24)
      • 24. Jiang, N., Becker, D.U., Micheologiannakis, G., et al: ‘A detailed and flexible cycle-accurate network-on-chip simulator’. Proc. Performance Analysis of Systems and Software, Austin, TX, USA, 2013, pp. 8696.
    25. 25)
      • 25. Henning, J.L.: ‘SPEC CPU2006 benchmark descriptions’, ACM SIGARCH Comput. Archit. News, 2006, 34, (4), pp. 117.
    26. 26)
      • 26. Zhang, C., McKee, S.A.: ‘Hardware-only stream prefetching and dynamic access ordering’. Proc. 14th Int. Conf. Supercomputing, Santa Fe, NM, USA, 2000, pp. 167175.
    27. 27)
      • 27. Seshadri, V., Yedkar, S., Xin, H., et al: ‘Mitigating prefetcher-caused pollution using informed caching policies for prefetched blocks’, ACM Trans. Archit. Code Optim., 2015, 11, (4), pp. 51:151:22.
    28. 28)
      • 28. Seshadri, V., Mutlu, O., Kozuch, M.A., et al: ‘The evicted-address filter: a unified mechanism to address both cache pollution and thrashing’. Proc. 21st Int. Conf. Parallel Architectures and Compilation Techniques, series PACT ‘12, Minneapolis, MN, USA, 2012, pp. 355366.
    29. 29)
      • 29. Beckmann, N., Sanchez, D.: ‘Modeling cache performance beyond LRU’. 2016 IEEE Int. Symp. High Performance Computer Architecture (HPCA), Barcelona, Spain, March 2016, pp. 225236.
    30. 30)
      • 30. Khan, S.M., Tian, Y., Jimenez, D.A.: ‘Sampling dead block prediction for last-level caches’. 2010 43rd Annual IEEE/ACM Int. Symp. Microarchitecture, Atlanta, GA, USA, December 2010, pp. 175186.
    31. 31)
      • 31. Faldu, P., Grot, B.: ‘Leeway: addressing variability in dead-block prediction for last-level caches’. 2017 26th Int. Conf. Parallel Architectures and Compilation Techniques (PACT), Portland, OR, USA, September 2017, pp. 180193.
    32. 32)
      • 32. Huh, J., Kim, C., Shafi, H., et al: ‘A NUCA substrate for flexible CMP cache sharing’. Proc. 19th Annual Int. Conf. Supercomputing, series ICS ‘05, Cambridge, MA, USA, 2005, pp. 3140.
    33. 33)
      • 33. Deb, D., Jose, J., Palesi, M.: ‘Performance enhancement of caches in TCMPs using near vicinity prefetcher’. Proc. 2019 32nd Int. Conf. VLSI Design and 2019 18th Int. Conf. Embedded Systems (VLSID), New Delhi, India, 2019.
    34. 34)
      • 34. Muralimanohar, N., Balasubramonian, R., Jouppi, N.: ‘Cacti 6.0: a tool to model large caches’, HP Laboratories, Chicago, IL, USA, 01 2009.
    35. 35)
      • 35. Kahng, A.B., Li, B., Peh, L., et al: ‘ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration’. Design, Automation Test in Europe Conf. Exhibition, Nice, France, April 2009, pp. 423428.
    36. 36)
      • 36. Batten, C., Joshi, A., Orcutt, J., et al: ‘Building manycore processor-to-DRAM networks with monolithic silicon photonics’. Proc. 16th IEEE Symp. High Performance Interconnects, Stanford, CA, USA, August 2008, pp. 2130.
    37. 37)
      • 37. Yedlapalli, P., Kotra, J., Kultursay, E., et al: ‘Meeting midway: improving CMP performance with memory-side prefetching’. Proc. 22nd Int. Conf. Parallel Architectures and Compilation Techniques, Edinburgh, UK, September 2013, pp. 289298.
    38. 38)
      • 38. Jouppi, N.P.: ‘Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers’. Proc. 17th Annual Int. Symp. Computer Architecture, Seattle, WA, USA, 1990, pp. 364373.
    39. 39)
      • 39. Aziz, A., Cireno, M., Barros, E., et al: ‘Balanced prefetching aggressiveness controller for NoC-based multiprocessor’. 27th Symp. Integrated Circuits and Systems Design (SBCCI), Aracaju, Brazil, September 2014, pp. 17.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cdt.2019.0035
Loading

Related content

content/journals/10.1049/iet-cdt.2019.0035
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address