Impact of on-chip network parameters on NUCA cache performances

A. Bardine; M. Comparetti; P. Foglia; G. Gabrielli; C.A. Prete

Impact of on-chip network parameters on NUCA cache performances

Access Full Text

Impact of on-chip network parameters on NUCA cache performances

Author(s): A. Bardine ; M. Comparetti ; P. Foglia ; G. Gabrielli ; C.A. Prete
DOI: 10.1049/iet-cdt.2008.0078

For access to this article, please select a purchase option:

Buy article PDF

Buy Knowledge Pack

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership

Recommend Title Publication to library

IET Computers & Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Author(s): A. Bardine ¹ ; M. Comparetti ¹ ; P. Foglia ¹ ; G. Gabrielli ¹ ; C.A. Prete ¹
- Affiliations: 1: Dipartimento di Ingegneria dell'Informazione, Università di Pisa, Pisa, Italy
Source: Volume 3, Issue 5, September 2009, p. 501 – 512
DOI: 10.1049/iet-cdt.2008.0078 , Print ISSN 1751-8601, Online ISSN 1751-861X

Published

Non-uniform cache architectures (NUCAs) are a novel design paradigm for large last-level on-chip caches, which have been introduced to deliver low access latencies in wire-delay-dominated environments. Their structure is partitioned into sub-banks and the resulting access latency is a function of the physical position of the requested data. Typically, NUCA caches employ a switched network, made up of links and routers with buffered queues, to connect the different sub-banks and the cache controller, and the characteristics of the network elements may affect the performance of the entire system. This work analyses how different parameters for the network routers, namely cut-through latency and buffering capacity, affect the overall performance of NUCA-based systems for the single processor case, assuming a reference NUCA organisation proposed in literature. The entire analysis is performed utilising a cycle-accurate execution-driven simulator of the entire system and real workloads. The results indicate that the sensitivity of the system to the cut-through latency is very high, thus limiting the effectiveness of the NUCA solution, and that modest buffering capacity is sufficient to achieve a good performance level. As a consequence, in this work we propose an alternative clustered NUCA organisation that limits the average number of hops experienced by cache accesses. This organisation is better performing and scales better as the cut-through latency increases, thus simplifying the implementation of routers, and it is also more effective than another latency reduction solution proposed in literature (hybrid network).

References

1. 1)
  - Agarwal, V., Hrishikesh, M.S., Keckler, S.W., Burger, D.: `Clock rate versus IPC: the end of the road for conventional microarchitectures', Proc. 27th Int. Symp. Computer Architecture, June 2000, Vancouver, Canada, p. 248–259.
2. 2)
  - Jin, Y., Kin, E.J., Yum, K.H.: `A domain-specific on-chip network design for large scale cache systems', Proc. 13th Int. Symp. High-Performance Computer Architecture, February 2007, Phoenix, AZ, USA, p. 318–327.
3. 3)
  - H. Wang , L.-S. Peh , S. Malik . A power model for routers: modeling Alpha 21364 and InfiniBand routers. IEEE Micro , 1 , 26 - 35
4. 4)
  - Wang, H., Peh, L.-S., Malik, S.: `Power-driven design of router microarchitectures in on-chip networks', Proc. 36th Int. Symp. Microarchitecture, December 2003, San Diego, CA, USA, p. 105–116.
5. 5)
  - L. Benini , G. De Micheli . Networks on chips: a new SoC paradigm. IEEE Comput. , 1 , 70 - 78
6. 6)
  - Chishti, Z., Powell, M.D., Vijaykumar, T.N.: `Optimizing replication, communication, and capacity allocation in CMPs', Proc. 32nd Int. Symp. Computer Architecture, June 2005, Madison, WI, USA, p. 357–368.
7. 7)
  - Muralimanohar, N., Balasubramonian, R., Jouppi, N.P.: `Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0', Proc. 40th Int. Symp. Microarchitecture, December 2007, Chicago, IL, USA, p. 3–14.
8. 8)
  - Dally, W.J., Towles, B.: `Route packets, not wires: on-chip interconnection networks', Proc. 38th Design Automation Conf., June 2001, Las Vegas, NV, USA, p. 684–689.
9. 9)
  - Beckmann, B.M., Wood, D.A.: `Managing wire delay in large chip-multiprocessor caches', Proc. 37th Int. Symp. Microarchitecture, December 2004, Portland, OR, USA, p. 319–330.
10. 10)
  - A. Bardine , P. Foglia , G. Gabrielli , C.A. Prete , P. Stenström . Improving power efficiency of D-NUCA caches. ACM SIGARCH Comput. Arch. News , 4 , 53 - 58
11. 11)
  - Peh, L.-S., Dally, W.J.: `A delay model and speculative architecture for pipelined routers', Proc. 7th Int. Symp. High-Performance Computer Architecture, January 2001, Nuevo Leone, Mexico, p. 255–266.
12. 12)
  - P. Foglia , D. Mangano , C.A. Prete . A cache design for high performance embedded systems. J. Embedded Comput. , 4 , 587 - 597
13. 13)
  - Sankaralingam, K.: `Distributed microarchitectural protocols in the TRIPS prototype processor', Proc. 39th Int. Symp. Microarchitecture, December 2006, Orlando, FL, USA, p. 480–491.
14. 14)
  - D. Wentzlaff . On-chip interconnection architecture of the Tile processor. IEEE Micro , 5 , 15 - 31
15. 15)
  - W.J. Dally , B. Towles . (2005) Principles and practices of interconnection networks.
16. 16)
  - S. Wilton , N. Jouppi . Cacti: An enhanced cache access and cycle time model. IEEE J. Solid-State Circuits , 5 , 677 - 688
17. 17)
  - Chishti, Z., Powell, M.D., Vijaykumar, T.N.: `Distance associativity for high-performance energy-efficient non-uniform cache architectures', Proc. 36th Int. Symp. Microarchitecture, December 2003, San Diego, CA, USA, p. 55–66.
18. 18)
  - Mullins, R., West, A., Moore, S.: `Low-latency virtual-channel routers for on-chip networks', Proc. 31st Int. Symp. Computer Architecture, June 2004, München, Germany, p. 188–197.
19. 19)
  - Kroft, D.: `Lockup-free instruction fetch/prefetch cache organization', Proc. 8th Int. Symp. Computer Architecture, May 1981, Minneapolis, MN, USA, p. 81–87.
20. 20)
  - Balfour, J., Dally, W.J.: `Design tradeoffs for tiled CMP on-chip networks', Proc. 20th Int. Conf. Supercomputing, June 2006, Queensland, Australia, p. 187–198.
21. 21)
  - Vangal, S.: `An 80-tile 1.28TFLOPS network-on-chip in 65 nm CMOS', Digest of Technical Papers, Int. Solid-State Circuits Conf., February 2007, San Francisco, CA, USA, p. 98–589.
22. 22)
  - S. Thoziyoor , N. Muralimanohar , N.P. Jouppi . (2007) CACTI 5.0.
23. 23)
  - R. Desikan , D. Burger , S.W. Keckler , T. Austin . (2001) Sim-alpha: a validated, execution-driven Alpha 21264 simulator.
24. 24)
  - Kim, C., Burger, D., Keckler, S.W.: `An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches', Proc. 10th Int. Conf. Architectural Support for Programming Languages and Operating Systems, October 2002, San Jose, CA, USA, p. 211–222.
25. 25)
  - Huh, J., Kim, C., Shafi, H.: `A NUCA substrate for flexible CMP cache sharing', Proc. 19th Int. Conf. Supercomputing, June 2005, Cambridge, MA, USA, p. 31–40.
26. 26)
  - Ho, R.: `On-chip wires: scaling and efficiency', 2003, PhD, Stanford University.
27. 27)
  - M.B. Taylor , J. Kim , J. Miller . The RAW microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE Micro , 6 , 25 - 35
28. 28)
  - International Technology Roadmap for Semiconductors: Edition Report, 2005.
29. 29)
  - Sankaralingam, K., Singh, V.A., Keckler, S.W., Burger, D.: `Routed inter-ALU networks for ILP scalability and performance', Proc. 21st Int. Conf. Computer Design, October 2003, San Jose, CA, USA, p. 170–177.
30. 30)
  - Muralimanohar, N., Balasubramonian, R.: `Interconnect design considerations for large NUCA caches', Proc. 34th Int. Symp. Computer Architecture, June 2007, San Diego, CA, USA, p. 369–380.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Impact of on-chip network parameters on NUCA cache performances

Impact of on-chip network parameters on NUCA cache performances

Buy article PDF

Buy Knowledge Pack

Thank you

References

Related content