Your browser does not support JavaScript!
http://iet.metastore.ingenta.com
1887

Impact of on-chip network parameters on NUCA cache performances

Impact of on-chip network parameters on NUCA cache performances

For access to this article, please select a purchase option:

Buy article PDF
£12.50
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IET Computers & Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Non-uniform cache architectures (NUCAs) are a novel design paradigm for large last-level on-chip caches, which have been introduced to deliver low access latencies in wire-delay-dominated environments. Their structure is partitioned into sub-banks and the resulting access latency is a function of the physical position of the requested data. Typically, NUCA caches employ a switched network, made up of links and routers with buffered queues, to connect the different sub-banks and the cache controller, and the characteristics of the network elements may affect the performance of the entire system. This work analyses how different parameters for the network routers, namely cut-through latency and buffering capacity, affect the overall performance of NUCA-based systems for the single processor case, assuming a reference NUCA organisation proposed in literature. The entire analysis is performed utilising a cycle-accurate execution-driven simulator of the entire system and real workloads. The results indicate that the sensitivity of the system to the cut-through latency is very high, thus limiting the effectiveness of the NUCA solution, and that modest buffering capacity is sufficient to achieve a good performance level. As a consequence, in this work we propose an alternative clustered NUCA organisation that limits the average number of hops experienced by cache accesses. This organisation is better performing and scales better as the cut-through latency increases, thus simplifying the implementation of routers, and it is also more effective than another latency reduction solution proposed in literature (hybrid network).

References

    1. 1)
      • Agarwal, V., Hrishikesh, M.S., Keckler, S.W., Burger, D.: `Clock rate versus IPC: the end of the road for conventional microarchitectures', Proc. 27th Int. Symp. Computer Architecture, June 2000, Vancouver, Canada, p. 248–259.
    2. 2)
      • Jin, Y., Kin, E.J., Yum, K.H.: `A domain-specific on-chip network design for large scale cache systems', Proc. 13th Int. Symp. High-Performance Computer Architecture, February 2007, Phoenix, AZ, USA, p. 318–327.
    3. 3)
    4. 4)
      • Wang, H., Peh, L.-S., Malik, S.: `Power-driven design of router microarchitectures in on-chip networks', Proc. 36th Int. Symp. Microarchitecture, December 2003, San Diego, CA, USA, p. 105–116.
    5. 5)
      • L. Benini , G. De Micheli . Networks on chips: a new SoC paradigm. IEEE Comput. , 1 , 70 - 78
    6. 6)
      • Chishti, Z., Powell, M.D., Vijaykumar, T.N.: `Optimizing replication, communication, and capacity allocation in CMPs', Proc. 32nd Int. Symp. Computer Architecture, June 2005, Madison, WI, USA, p. 357–368.
    7. 7)
      • Muralimanohar, N., Balasubramonian, R., Jouppi, N.P.: `Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0', Proc. 40th Int. Symp. Microarchitecture, December 2007, Chicago, IL, USA, p. 3–14.
    8. 8)
      • Dally, W.J., Towles, B.: `Route packets, not wires: on-chip interconnection networks', Proc. 38th Design Automation Conf., June 2001, Las Vegas, NV, USA, p. 684–689.
    9. 9)
      • Beckmann, B.M., Wood, D.A.: `Managing wire delay in large chip-multiprocessor caches', Proc. 37th Int. Symp. Microarchitecture, December 2004, Portland, OR, USA, p. 319–330.
    10. 10)
      • A. Bardine , P. Foglia , G. Gabrielli , C.A. Prete , P. Stenström . Improving power efficiency of D-NUCA caches. ACM SIGARCH Comput. Arch. News , 4 , 53 - 58
    11. 11)
      • Peh, L.-S., Dally, W.J.: `A delay model and speculative architecture for pipelined routers', Proc. 7th Int. Symp. High-Performance Computer Architecture, January 2001, Nuevo Leone, Mexico, p. 255–266.
    12. 12)
      • P. Foglia , D. Mangano , C.A. Prete . A cache design for high performance embedded systems. J. Embedded Comput. , 4 , 587 - 597
    13. 13)
      • Sankaralingam, K.: `Distributed microarchitectural protocols in the TRIPS prototype processor', Proc. 39th Int. Symp. Microarchitecture, December 2006, Orlando, FL, USA, p. 480–491.
    14. 14)
    15. 15)
      • W.J. Dally , B. Towles . (2005) Principles and practices of interconnection networks.
    16. 16)
    17. 17)
      • Chishti, Z., Powell, M.D., Vijaykumar, T.N.: `Distance associativity for high-performance energy-efficient non-uniform cache architectures', Proc. 36th Int. Symp. Microarchitecture, December 2003, San Diego, CA, USA, p. 55–66.
    18. 18)
      • Mullins, R., West, A., Moore, S.: `Low-latency virtual-channel routers for on-chip networks', Proc. 31st Int. Symp. Computer Architecture, June 2004, München, Germany, p. 188–197.
    19. 19)
      • Kroft, D.: `Lockup-free instruction fetch/prefetch cache organization', Proc. 8th Int. Symp. Computer Architecture, May 1981, Minneapolis, MN, USA, p. 81–87.
    20. 20)
      • Balfour, J., Dally, W.J.: `Design tradeoffs for tiled CMP on-chip networks', Proc. 20th Int. Conf. Supercomputing, June 2006, Queensland, Australia, p. 187–198.
    21. 21)
      • Vangal, S.: `An 80-tile 1.28TFLOPS network-on-chip in 65 nm CMOS', Digest of Technical Papers, Int. Solid-State Circuits Conf., February 2007, San Francisco, CA, USA, p. 98–589.
    22. 22)
      • S. Thoziyoor , N. Muralimanohar , N.P. Jouppi . (2007) CACTI 5.0.
    23. 23)
      • R. Desikan , D. Burger , S.W. Keckler , T. Austin . (2001) Sim-alpha: a validated, execution-driven Alpha 21264 simulator.
    24. 24)
      • Kim, C., Burger, D., Keckler, S.W.: `An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches', Proc. 10th Int. Conf. Architectural Support for Programming Languages and Operating Systems, October 2002, San Jose, CA, USA, p. 211–222.
    25. 25)
      • Huh, J., Kim, C., Shafi, H.: `A NUCA substrate for flexible CMP cache sharing', Proc. 19th Int. Conf. Supercomputing, June 2005, Cambridge, MA, USA, p. 31–40.
    26. 26)
      • Ho, R.: `On-chip wires: scaling and efficiency', 2003, PhD, Stanford University.
    27. 27)
    28. 28)
      • International Technology Roadmap for Semiconductors: Edition Report, 2005.
    29. 29)
      • Sankaralingam, K., Singh, V.A., Keckler, S.W., Burger, D.: `Routed inter-ALU networks for ILP scalability and performance', Proc. 21st Int. Conf. Computer Design, October 2003, San Jose, CA, USA, p. 170–177.
    30. 30)
      • Muralimanohar, N., Balasubramonian, R.: `Interconnect design considerations for large NUCA caches', Proc. 34th Int. Symp. Computer Architecture, June 2007, San Diego, CA, USA, p. 369–380.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cdt.2008.0078
Loading

Related content

content/journals/10.1049/iet-cdt.2008.0078
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address