http://iet.metastore.ingenta.com
1887

Simulation study of memory performance of SMP multiprocessors running a TPC-W workload

Simulation study of memory performance of SMP multiprocessors running a TPC-W workload

For access to this article, please select a purchase option:

Buy article PDF
£12.50
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IEE Proceedings - Computers and Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

The infrastructure to support electronic commerce is one of the areas where more processing power is needed. A multiprocessor system can offer advantages for running electronic commerce applications. The memory performance of an electronic commerce server, i.e. a system running electronic commerce applications, is evaluated in the case of shared-bus multiprocessor architecture. The software architecture of this server is based on a three-tier model and the workloads have been setup as specified by the TPC-W benchmark. The hardware configurations are: a single SMP running tiers two and three, and two SMPs each one running a single tier. The influence of memory subsystem on performance and scalability is analysed and several solutions aimed at reducing the latency of memory considered. After initial experiments, which validate the methodology, choices for cache, scheduling algorithm, and coherence protocol are explored to enhance performance and scalability. As in previous studies on shared-bus multiprocessors, it was found that the memory performance is highly influenced by cache parameters. While scaling the machine, the coherence overhead weighs more and more on the memory performance. False sharing in the kernel is among the main causes of this overhead. Unlike previous studies, passive sharing i.e. the useless sharing of the private data of the migrating processes, is shown to be an important factor that influences performance. This is especially true when multiprocessors with a higher number of processors are considered: an increase in the number of processors produces real benefits only if advanced techniques for reducing the coherence overhead are properly adopted. Scheduling techniques limiting process migration may reduce passive sharing, while restructuring techniques of the kernel data may reduce false sharing misses. However, even when process migration is reduced through cache-affinity techniques, standard coherence protocols like MESI protocol don't allow the best performance. Coherence protocols such as PSCR and AMSD produce performance benefits. PSCR, in particular, eliminates coherence overhead due to passive sharing and minimises the number of coherence misses. The adoption of PSCR and cache-affinity scheduling allows the multiprocessor scalability to be extended to 20 processors for a 128-bit shared bus and current values of main-memory-to-processor speed gap.

References

    1. 1)
      • J.M. Andreoli , F. Pacull , R. Pareschi . XPECT: a framework for electronic commerce. IEEE Internet Comput. , 4 , 40 - 48
    2. 2)
      • V. Milutinovic . (2001) Infrastructure for electronic business on the internet.
    3. 3)
      • D.W. Walker . Free-market computing and the global economic infrastructure. IEEE Parallel Distrib. Technol. , 3 , 60 - 62
    4. 4)
      • B. Brandau , T. Confrey , A. D'Silva , C.J. Matheus , R. Weihmayer . Reinventing GTE with information technology. IEEE Comput. , 3 , 50 - 58
    5. 5)
      • J. Edwards . (1999) 3-tier client/server at work.
    6. 6)
      • T. Lewis . The legacy maturity model. IEEE Comput. , 11 , 125 - 128
    7. 7)
      • R. Buyya . (1999) High performance clustered computing.
    8. 8)
      • Short, R., Gamache, R., Vert, J., Massa, M.: `Windows NT clusters for availability and scalability', Proc. 42nd IEEE Int. Computer Conf., San Jose, CA, February 1997, p. 8–13.
    9. 9)
      • P. Stenström , E. Hagersten , D.J. Li , M. Martonosi , M. Venugopal . Trends in shared-memory multiprocessing. IEEE Comput. , 12 , 44 - 50
    10. 10)
      • C.A. Prete . RST cache memory design for a tightly coupled multiprocessor system. IEEE Micro , 2 , 16 - 19
    11. 11)
      • M. Tomasevic , V. Milutinovic . (1993) The cache coherence problem in shared-memory multiprocessors – hardware solutions.
    12. 12)
      • M. Tomasevic , V. Milutinovic . Hardware approaches to cache coherence in shared-memory multiprocessors. IEEE Micro , 5 , 52 - 59
    13. 13)
      • Sweazey, P., Smith, A.J.: `A class of compatible cache consistency protocols and their support by the IEEE futurebus', Proc. 13th Int Symp. on Computer Architecture, Tokyo, Japan, June 1986, p. 414–423.
    14. 14)
      • R. Giorgi , C.A. Prete . PSCR: a coherence protocol for eliminating passive sharing in shared-bus shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. , 7 , 742 - 763
    15. 15)
      • C.A. Prete , G. Prina , R. Giorgi , L. Ricciardi . Some considerations about passive sharing in shared-memory multiprocessors. IEEE TCCA Newsletter , 34 - 40
    16. 16)
      • J. Torrellas , M.S. Lam , J.L. Hennessy . False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput. , 6 , 651 - 663
    17. 17)
      • TPC BENCHMARK W (Web Commerce) Specification, v. 1.0.1 (2000) Transaction Processing Performance Council.
    18. 18)
      • Robinson, D., the Apache Group: APACHE – An HTTP Server, Reference Manual, 1995, http://www.apache.org, accessed January 1995.
    19. 19)
      • Yu, A., Chen, J.: `The POSTGRES95 user manual', Computer Science Div., Dept. of EECS, University of California at Berkeley, July 1995.
    20. 20)
      • R. Giorgi , C.A. Prete , G. Prina , L. Ricciardi . Trace Factory: generating workloads for trace-driven simulation of shared-bus multiprocessors. IEEE Concurr. , 4 , 54 - 68
    21. 21)
      • C.A. Prete , G. Prina , L. Ricciardi . A trace-driven simulator for performance evaluation of cache-based multiprocessor system. IEEE Trans. Parallel Distrib. Syst. , 9 , 915 - 929
    22. 22)
      • M.S. Squillante , D.E. Lazowska . Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst. , 2 , 131 - 143
    23. 23)
      • J. Torrellas , A. Tucker , A. Gupta . Evaluating the performance of cache-affinity scheduling in shared-memory multiprocessors. J. Parallel Distrib. Comput. , 2 , 139 - 151
    24. 24)
      • Barroso, L.A., Gharachorloo, K., Bugnion, E.: `Memory system characterization of commercial workloads', Proc. 25th Int. Symp. on Computer Architecture, Barcelona, Spain, June 1998.
    25. 25)
      • Cao, Q., Torrellas, J., Trancoso, P., Larriba-Pey, J.L., Knighten, B., Won, Y.: `Detailed characterization of a quad Pentium Pro server running TPC-D', Proc. Int. Conf. on Computer Design, Austin, TX, October 1999, p. 108–115.
    26. 26)
      • Keeton, K., Patterson, D., He, Y., Raphael, R., Baker, W.: `Performance characterization of a quad Pentium Pro SMP using OLTP workloads', Proc. 25th Int. Symp. on Computer Architecture, Barcelona, Spain, June 1998, p. 15–26.
    27. 27)
      • Trancoso, P., Larriba-Pey, J.L., Zhang, Z., Torrellas, J.: `The memory performance of DSS commercial workloads in shared-memory multiprocessors', Proc. 3rd Int. Symp. on High-performance Computer Architecture, San Antonio, TX, February 1997, p. 250–260.
    28. 28)
      • V.S. Pai , P. Ranganathan , H. Abdel-Shafi , S. Adve . The impact of exploiting instruction-level parallelism on shared-memory multiprocessors. IEEE Trans. Comput. , 2 , 218 - 226
    29. 29)
      • Saulsbury, A., Pong, F., Nowatzyk, A.: `Missing the memory wall: the case for processor/memory integration', Proc. 23rd Int. Symp. on Computer Architecture, Philadelphia, PA, May 1996, p. 90–103.
    30. 30)
      • M.J. Flynn . (1995) Computer architecture, pipelined and parallel processor design.
    31. 31)
      • D.A. Patterson , J.L. Hennessy . (2007) Computer architecture: a quantitative approach.
    32. 32)
      • K. Hwang , Z. Xu . (1998) Scalable parallel computing: technology, architecture, programming.
    33. 33)
      • Agarwal, A., Gupta, A.: `Memory reference characteristics of multiprocessor applications under Mach', Proc. ACM Sigmetrics, Santa Fe, NM, May 1998, p. 215–225.
    34. 34)
      • Cox, A.L., Fowler, R.J.: `Adaptive cache coherency for detecting migratory shared data', Proc. 20th Int. Symp. on Computer Architecture, San Diego, CA, May 1993, p. 98–108.
    35. 35)
      • T.E. Jeremiassen , S.J. Eggers . Reducing false sharing on shared- memory multiprocessors through compile time data transformations. ACM SIGPLAN Notice , 8 , 179 - 188
    36. 36)
      • Prete, C.A.: `A new solution of coherence protocol for tightly coupled multiprocessor systems', Proc. EUROMICRO 90: Hardware and Software in System Engineering, Microprocessing and Microprogramming, Vienna, Austria, August 1990, 30, (1-5), p. 207–214.
    37. 37)
      • Stenström, P., Brorsson, M., Sandberg, L.: `An adaptive cache coherence protocol optimized for migratory sharing', Proc. 20th Annual Int. Symp. on Computer Architecture, San Diego, CA, May 1993, p. 109–118.
    38. 38)
      • M. Tomasevic , V. Milutinovic . The word-invalidate cache coherence protocol. Microprocess. Microsyst. , 3 , 3 - 16
    39. 39)
      • Eggers, S.J.: `Simulation analysis of data sharing in shared-memory multiprocessors', , PhD thesis, UCB/CSD 89/501, University of California, Berkeley, April.
    40. 40)
      • A. Gupta , W.-D. Weber . Cache invalidation patterns in shared-memory multiprocessors. IEEE Trans. Comput. , 7 , 794 - 810
    41. 41)
      • Chandra, R., Devine, S., Verghese, B., Gupta, A., Rosenblum, M.: `Scheduling and page migration for multiprocessor compute servers', Proc. 6th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1994, p. 12–24.
    42. 42)
      • Grizzaffi Maynard, A.M., Donnelly, C.M., Olszewski, B.R.: `Contrasting characteristics and cache performance of technical and multiuser commercial workloads', Proc. 6th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1994, p. 158–170.
    43. 43)
      • Torrellas, J., Gupta, A., Hennessy, J.: `Characterizing the caching and synchronization performance of a multiprocessor operating system', Proc. 5th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Boston, MA, September 1992, p. 162–174.
    44. 44)
      • J. Handy . (1998) The Cache Memory Book.
    45. 45)
      • T. Shanley , Mindshare Inc . (1999) Pentium Pro and Pentium II system architecture.
    46. 46)
      • ‘AMD x86-64 Architecture Programmer's Manual Vol. 2: System Programming’, Advanced Micro Device Inc., September 2002.
    47. 47)
      • J.M. Tendler , J.S. Dodson , J.S. Fields , H. Le , B. Sinharoy . POWER4 system microarchitecture. IBM J. Res. Dev. , 1 , 5 - 26
    48. 48)
      • Martin, M.M.K., Sorin, D.J., Hill, M.D., Wood, D.A.: `Bandwidth-adaptive snooping', Proc. 8th Int. Symp. on High-performance Computer Architecture, Anaheim, CA, February 2002, p. 224–235.
    49. 49)
      • Martin, M.M.K., Sorin, D.J., Ailamaki, A., Alameldeen, A.R., Dickson, R.M., Mauer, C.J., Moore, K.E., Plakal, M., Hill, M.D., Wood, D.A.: `Timestamp snooping: an approach for extending SMPs', Proc. 9th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, November 2000, p. 25–36.
    50. 50)
      • A.R. Alameldeen , M.M.K. Martin , C.J. Mauer , K.E. Moore , M. Xu , M.D. Hill , D.A. Wood , D.J. Sorin . Simulating a $2M commercial server on a $2K PC. IEEE Comput. , 2 , 50 - 57
    51. 51)
      • Limprecht, R.: `Microsoft transaction server', Proc. 42nd IEEE Int. Computer Conf., San Jose, CA, February 1997, p. 14–18.
    52. 52)
      • J. Edwards . The changing face of freeware. IEEE Comput. , 10 , 11 - 13
    53. 53)
      • GNU Free Software Foundation, http://www.gnu.org/software/, accessed February 2003.
    54. 54)
      • C.B. Stunkel , B. Janssens , W.K. Fuchs . Address tracing for parallel machines. IEEE Comput. , 1 , 31 - 45
    55. 55)
      • R.A. Uhlig , T.N. Mudge . Trace-driven memory simulation: a survey. ACM Comput. Surv. , 128 - 170
    56. 56)
      • Linux on SGI/MIPS, http://oss.sgi.com/mips/, accessed February 2003.
    57. 57)
      • Mauer, C.J., Hill, M.D., Wood, D.A.: `Full-system timing-first simulation', Proc. 2002 ACM SIGMETRICS Int. Conf. on Measurement and Modeling of Computer Systems, Marina del Rey, CA, June 2002, p. 108–116.
    58. 58)
      • Goldschmidt, S.R., Hennessy, J.L.: `The accuracy of trace-driven simulations of multiprocessors', Proc. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, Santa Clara, CA, May 1993, p. 146–157.
    59. 59)
      • K.C. Yeager . The MIPS R10000 superscalar microprocessor. IEEE Micro , 4 , 42 - 50
    60. 60)
      • Ranganathan, P., Gharachorloo, K., Adve, S.V., Barroso, L.: `Performance of database workloads on shared-memory systems with out-of-order processors', Proc. 8th Int Conf. on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1998, p. 307–318.
    61. 61)
      • Kroft, D.: `Lockup-free instruction fetch/prefetch cache organization', Proc. 8th Annual Int. Symp. on Computer Architecture, Minneapolis, MN, June 1981, p. 81–87.
    62. 62)
      • S.V. Adve , K. Gharachorloo . Shared memory consistency models: a tutorial. IEEE Comput. , 12 , 66 - 76
    63. 63)
      • Gharachorloo, K., Gupta, A., Hennessy, J.: `Performance evaluation of memory consistency models for shared-memory multiprocessors', Proc. 4th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, April 1991, p. 245–357.
    64. 64)
      • Nanda, A.K., Nguyen, A., Michael, M., Joseph, D.: `High throughput coherence controller', Proc. 6th Int. Symp. on High-performance Computer Architecture, Toulouse, France, January 2000, p. 145–155.
    65. 65)
      • An overview of the UltraSPARC III Cu Processor v1.1, Sun Microsystems, Inc., Palo Alto, CA, June 2002.
    66. 66)
      • The POWER4 Processor Introduction and Tuning Guide, Int. Business Machine Corp., Austin, TX, November 2001.
    67. 67)
      • J.K. Archibald , J.L. Baer . Cache coherence protocols: evaluation using a multiprocessor simulation model. ACM Trans. Comput. Syst. , 273 - 298
    68. 68)
      • P. Foglia . An algorithm for the classification of coherence-related overhead in shared-bus shared-memory multiprocessors. IEEE TCCA News. , 53 - 58
    69. 69)
      • R.L. Hyde , B.D. Fleisch . An analysis of degenerate sharing and false coherence. J. Parallel Distrib. Comput. , 2 , 183 - 195
    70. 70)
      • Dubois, M., Skeppstedt, J., Ricciulli, L., Ramamurthy, K., Stenström, P.: `The detection and elimination of useless miss in multiprocessors', Proc. 20th Int. Symp. on Computer Architecture, San Diego, CA, May 1993, p. 88–97.
    71. 71)
      • Lepak, K.M., Lipasti, M.H.: `On the value locality of store instructions', Proc. 27th Annual Int. Symp. on Computer Architecture, Vancouver, Canada, June 2000, p. 182–191.
    72. 72)
      • J. Kalamatianos , A. Khalafi , D. Kaeli , W. Meleis . Analysis of temporal-based program behavior for improved instruction cache performance. IEEE Trans. Comput. , 2 , 168 - 175
    73. 73)
      • Lorenzini, S., Luculli, G., Prete, C.A.: `A fast procedure placement algorithm for optimal cache use', Proc. 9th IEEE Mediterranean Electrotechnical Conference MELECON, Tel Aviv, Israel, May 1998, p. 1279–1284.
    74. 74)
      • J. Torrellas , R. Daigle . Optimizing the instruction cache performance of the operating system. IEEE Trans. Comput. , 12 , 1363 - 1381
    75. 75)
      • Torrellas, J., Tucker, A., Gupta, A.: `Benefits of cache-affinity scheduling in shared-memory multiprocessors', Proc. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, Santa Clara, CA, May 1993, p. 272–274.
    76. 76)
      • Cain, T., Rajwar, R., Marden, M., Lipasti, M.: `An architectural characterization of Java TPC-W', Proc. 7th Int. Symp. on High-performance Computer Architecture, Monterrey, Mexico, January 2001, p. 229–240.
    77. 77)
      • Karlsson, M., Moore, K.E., Hagersten, E., Wood, D.A.: `Memory system behavior of Java-based middleware', Proc. 9th Int. Symp. on High-performance Computer Architecture, Anaheim, CA, February 2003, p. 217–228.
    78. 78)
      • ‘TPC Benchmark B (Online Transaction Processing) Standard Specification’. Transaction Processing Performance Council, 1994.
    79. 79)
      • ‘TPC Benchmark D (Decision Support) Standard Specification’. Transaction Processing Performance Council, Santa Margherita Ligure, Italy, 1995.
    80. 80)
      • Woo, S.C., Ohara, M., Torrie, E., Shingh, J.P., Gupta, A.: `The SPLASH-2 programs: characterization and methodological considerations', Proc. 22nd Int. Symp. on Computer Architecture, May 1994, p. 24–36.
    81. 81)
      • Cvetanovic, Z., Bhandarkar, D.: `Characterization of Alpha AXP performance using TP and SPEC workloads', Proc. 21st Int. Symp. on Computer Architecture, Chicago, IL, April 1994, p. 60–70.
    82. 82)
      • Chapin, J., Herrod, S., Rosenblum, M., Gupta, A.: `Memory system performance of UNIX on CC-NUMA multiprocessors', Proc. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, Ottawa, Canada, May 1995, p. 1–13.
http://iet.metastore.ingenta.com/content/journals/10.1049/ip-cdt_20040349
Loading

Related content

content/journals/10.1049/ip-cdt_20040349
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address