Exploiting temporal loads for low latency and high bandwidth memory

S. Kim; N. Vijaykrishnan; M. Kandemir; M.J. Irwin

Exploiting temporal loads for low latency and high bandwidth memory

Access Full Text

Exploiting temporal loads for low latency and high bandwidth memory

Author(s): S. Kim ; N. Vijaykrishnan ; M. Kandemir ; M.J. Irwin
DOI: 10.1049/ip-cdt:20045124

For access to this article, please select a purchase option:

Buy article PDF

Buy Knowledge Pack

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership

Recommend Title Publication to library

IEE Proceedings - Computers and Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Author(s): S. Kim ¹ ; N. Vijaykrishnan ² ; M. Kandemir ² ; M.J. Irwin ²
- Affiliations: 1: Department of Computer Science and Engineering, University of South Florida, Tampa, USA
  2: Department of Computer Science and Engineering, Pennsylvania State University, University Park, USA
Source: Volume 152, Issue 4, July 2005, p. 457 – 466
DOI: 10.1049/ip-cdt:20045124 , Print ISSN 1350-2387, Online ISSN 1359-7027

« Previous Article
Table of contents
Next Article »

IEE

Published

Increasing clock frequencies and issue rates aggravates the memory latency problem, imposing higher memory bandwidth requirements. While caches can be multi-ported to provide high memory bandwidth, the increase in access latency with the increase in the number of ports limits their potential. The paper proposes a novel technique, called the ‘temporal load cache architecture’, to reduce load latencies and provide higher memory bandwidths. The key motivation for the technique is that temporal loads – dynamic instances of a static load instruction that access the same address as that accessed by the last dynamic instance of the same static load – constitute 48% of all dynamic loads on average for the SPEC2000 benchmarks. When a load is predicted to be temporal, the data predicted to be accessed by it are read early in the pipeline from a small temporal load cache that stores the temporal data. The proposed temporal load cache architecture has two main advantages. First, since instructions dependent on a temporal load are provided with their data early in the pipeline, they can be issued as soon as they resolve their remaining data dependences and resource conflicts. Second, since a large percentage of loads can be filtered by the temporal load cache, the main data cache can service other (nontemporal) loads better, providing higher memory bandwidth. The experimental results show that the proposed temporal load cache architecture improves performance by 8.3% on average for the SPEC2000 integer benchmarks.

References

1. 1)
  - Sodani, A., Sohi, G.: `Dynamic instruction reuse', Proc. Int. Symp. on Computer Architecture, July 1997.
2. 2)
  - Bekerman, M., Jourdan, S., Ronen, R., Kirshenboim, G., Rappoport, L., Yoaz, A., Weiser, U.: `Correlated load address predictors', Proc. Int. Symp. on Computer Architecture, 1999.
3. 3)
  - M. Burtscher , B.G. Zorn . Hybrid load-value predictors. IEEE Trans. Comput. , 7 , 759 - 774
4. 4)
  - SPEC2000 Benchmarkshttp://www.specbench.org/osg/cpu2000/.
5. 5)
  - Rivers, J.A., Tyson, G.S., Davidson, E.S., Austin, T.M.: `On high-bandwidth data cache design for multi-issue processors', Proc. Int. Symp. on Microarchitecture, December 1997.
6. 6)
  - Lipasti, M.H., Wilkerson, C.B., Shen, J.P.: `Value locality and load value prediction', Proc. Int. Conf. on Architectural Support for Programming Languages and Operating Systems, 1996.
7. 7)
  - Cho, S., Yew, P-C., Lee, G.: `Decoupling local variable accesses in a wide-issue superscalar processor', Proc. Int. Symp. on Computer Architecture, 1999.
8. 8)
  - Reinman, G., Jouppi, N.: `An integrated cache timing and power model', Technical, 1999.
9. 9)
  - Chrysos, G., Emer, J.: `Memory dependence prediction using store sets', Proc. Int. Symp. on Computer Architecture, 1998.
10. 10)
  - , : `21164 Alpha Microprocessor Hardware Reference Manual Digital Equipment Corporation', 1997.
11. 11)
  - Yoaz, A., Erez, M., Ronen, R., Jourdan, S.: `Speculation techniques for improving load related instruction scheduling', Proc. Int. Symp. on Computer Architecture, 1999.
12. 12)
  - Kin, J., Gupta, M., Mangione-smith, W.H.: `The filter cache: an energy efficient memory structure', Proc. Int. Symp. on Microarchitecture, 1997.
13. 13)
  - S. Weiss , J. Smith . (1994) POWER and PowerPC.
14. 14)
  - Moshovos, A., Breach, S.E., Vijaykumar, T.N., Sohi, G.S.: `Dynamic speculation and synchronization of data dependences', Proc. Int. Symp. on Computer Architecture, 1997.
15. 15)
  - Sazeides, Y., Smith, J.E.: `The predictability of data values', Proc. Int. Symp. on Computer Architecture, 1997.
16. 16)
  - Kim, C., Burger, D., Keckler, S.W.: `An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches', Proc. Int. Conf. on Architectural Support for Programming Languages and Operating Systems, 2002.
17. 17)
  - Moshovos, A., Sohi, G.S.: `Streamlining inter-operation memory communication via data dependence prediction', Proc. Int. Symp. on Microarchitecture, 1997.
18. 18)
  - Tyson, G.S., Austin, T.M.: `Improving the accuracy and performance of memory communication through renaming', Proc. Int. Symp. on Microarchitecture, 1997.
19. 19)
  - Borch, E., Tune, E., Manne, S., Emer, J.: `Loose loops sink chips', Proc. Int. Symp. on High Performance Computer Architecture, 2002.
20. 20)
  - G. Hinton , D. Sager , M. Upton , D. Boggs , D. Carmean , A. Kyker , P. Roussel . The microarchitecture of the pentium 4 processor. Intel Technol. J.
21. 21)
  - Perelman, E., Hamerly, G., Biesbrouck, M.V., Sherwood, T., Calder, B.: `Using SimPoint for accurate and efficient simulation', Proc. Int. Conf. on Measurement and Modeling of Computer Systems, 2003.
22. 22)
  - Gowan, M.K., Biro, L.L., Jackson, D.B.: `Power considerations in the design of the alpha 21264 microprocessor', Proc. Design Automation Conf., 1998.
23. 23)
  - SPEC2000 Binarieshttp://www.eecs.umich.edu/ ∼ chriswea/benchmarks/spec2000.html.
24. 24)
  - Onder, S., Gupta, R.: `Load and store reuse using register file contents', Proc. Int. Conf. on Supercomputing, 2001.
25. 25)
  - Fields, B., Bodik, R., Jill, M.D.: `Slack: maximizing performance under technological constraints', Proc. Int. Symp. on Computer Architecture, 2002.
26. 26)
  - Moshovos, A., Sohi, G.S.: `Read-after-read memory dependence prediction', Proc. Int. Symp. on Microarchitecture, 1999.
27. 27)
  - R.E. Kessler . The alpha 21264 microprocessor. IEEE Micro , 2 , 24 - 36
28. 28)
  - Agarwal, V., Hrishikesh, M.S., Keckler, S.W., Burger, D.: `Clock rate versus IPC: the end of the road for conventional microarchitecture', Proc. Int. Symp. on Computer Architecture, 2000.
29. 29)
  - Cho, S., Yew, P-C., Lee, G.: `Access region locality for high-bandwidth processor memory system design', Proc. Int. Symp. on Microarchitecture, 1999.
30. 30)
  - Wilson, K.M., Olukotun, K., Rosenblum, M.: `Increasing cache port efficiency for dynamic superscalar processors', Proc. Int. Symp. on Computer Architecture, 1996.
31. 31)
  - Bekerman, M., Yoaz, A., Gabbay, F., Jourdan, S., Kalaev, M., Ronen, R.: `Early load address resolution via register tracking', Proc. Int. Symp. on Computer Architecture, 2000.
32. 32)
  - Neefs, H., Vandierendonck, H., DeBosschere, K.: `A technique for high-bandwidth and deterministic low-latency load/store access to multiple cache banks', Proc. Int. Symp. on High Performance Computer Architecture, 2000.
33. 33)
  - D. Sima , T. Fountain , P. Kacsuk . (1998) Advanced computer architectures: A design space approach.
34. 34)
  - Olukotun, K., Wilson, K.M.: `Designing high bandwidth on-chip caches', Proc. Int. Symp. on Computer Architecture, 1997.
35. 35)
  - Yang, J., Gupta, R.: `Load redundancy removal through instruction reuse', Proc. Int. Conf. on Parallel Processing, 2000.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Exploiting temporal loads for low latency and high bandwidth memory

Exploiting temporal loads for low latency and high bandwidth memory

Buy article PDF

Buy Knowledge Pack

Thank you

References

Related content