http://iet.metastore.ingenta.com
1887

Simulation study of memory performance of SMP multiprocessors running a TPC-W workload

Simulation study of memory performance of SMP multiprocessors running a TPC-W workload

For access to this article, please select a purchase option:

Buy article PDF
£12.50
(plus tax if applicable)
Buy Knowledge Pack
10 articles for £75.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Name:*
Email:*
Your details
Name:*
Email:*
Department:*
Why are you recommending this title?
Select reason:
 
 
 
 
 
IEE Proceedings - Computers and Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

The infrastructure to support electronic commerce is one of the areas where more processing power is needed. A multiprocessor system can offer advantages for running electronic commerce applications. The memory performance of an electronic commerce server, i.e. a system running electronic commerce applications, is evaluated in the case of shared-bus multiprocessor architecture. The software architecture of this server is based on a three-tier model and the workloads have been setup as specified by the TPC-W benchmark. The hardware configurations are: a single SMP running tiers two and three, and two SMPs each one running a single tier. The influence of memory subsystem on performance and scalability is analysed and several solutions aimed at reducing the latency of memory considered. After initial experiments, which validate the methodology, choices for cache, scheduling algorithm, and coherence protocol are explored to enhance performance and scalability. As in previous studies on shared-bus multiprocessors, it was found that the memory performance is highly influenced by cache parameters. While scaling the machine, the coherence overhead weighs more and more on the memory performance. False sharing in the kernel is among the main causes of this overhead. Unlike previous studies, passive sharing i.e. the useless sharing of the private data of the migrating processes, is shown to be an important factor that influences performance. This is especially true when multiprocessors with a higher number of processors are considered: an increase in the number of processors produces real benefits only if advanced techniques for reducing the coherence overhead are properly adopted. Scheduling techniques limiting process migration may reduce passive sharing, while restructuring techniques of the kernel data may reduce false sharing misses. However, even when process migration is reduced through cache-affinity techniques, standard coherence protocols like MESI protocol don't allow the best performance. Coherence protocols such as PSCR and AMSD produce performance benefits. PSCR, in particular, eliminates coherence overhead due to passive sharing and minimises the number of coherence misses. The adoption of PSCR and cache-affinity scheduling allows the multiprocessor scalability to be extended to 20 processors for a 128-bit shared bus and current values of main-memory-to-processor speed gap.

References

    1. 1)
      • XPECT: a framework for electronic commerce
    2. 2)
      • Infrastructure for electronic business on the internet
    3. 3)
      • Free-market computing and the global economic infrastructure
    4. 4)
      • Reinventing GTE with information technology
    5. 5)
      • 3-tier client/server at work
    6. 6)
      • The legacy maturity model
    7. 7)
      • High performance clustered computing
    8. 8)
      • Short, R., Gamache, R., Vert, J., Massa, M.: `Windows NT clusters for availability and scalability', Proc. 42nd IEEE Int. Computer Conf., San Jose, CA, February 1997, p. 8–13
    9. 9)
      • Trends in shared-memory multiprocessing
    10. 10)
      • RST cache memory design for a tightly coupled multiprocessor system
    11. 11)
      • The cache coherence problem in shared-memory multiprocessors – hardware solutions
    12. 12)
      • Hardware approaches to cache coherence in shared-memory multiprocessors
    13. 13)
      • Sweazey, P., Smith, A.J.: `A class of compatible cache consistency protocols and their support by the IEEE futurebus', Proc. 13th Int Symp. on Computer Architecture, Tokyo, Japan, June 1986, p. 414–423
    14. 14)
      • PSCR: a coherence protocol for eliminating passive sharing in shared-bus shared-memory multiprocessors
    15. 15)
      • Some considerations about passive sharing in shared-memory multiprocessors
    16. 16)
      • False sharing and spatial locality in multiprocessor caches
    17. 17)
      • TPC BENCHMARK W (Web Commerce) Specification, v. 1.0.1 (2000) Transaction Processing Performance Council
    18. 18)
      • Robinson, D., the Apache Group: APACHE – An HTTP Server, Reference Manual, 1995, http://www.apache.org, accessed January 1995
    19. 19)
      • Yu, A., Chen, J.: `The POSTGRES95 user manual', Computer Science Div., Dept. of EECS, University of California at Berkeley, July 1995
    20. 20)
      • Trace Factory: generating workloads for trace-driven simulation of shared-bus multiprocessors
    21. 21)
      • A trace-driven simulator for performance evaluation of cache-based multiprocessor system
    22. 22)
      • Using processor-cache affinity information in shared-memory multiprocessor scheduling
    23. 23)
      • Evaluating the performance of cache-affinity scheduling in shared-memory multiprocessors
    24. 24)
      • Barroso, L.A., Gharachorloo, K., Bugnion, E.: `Memory system characterization of commercial workloads', Proc. 25th Int. Symp. on Computer Architecture, Barcelona, Spain, June 1998
    25. 25)
      • Cao, Q., Torrellas, J., Trancoso, P., Larriba-Pey, J.L., Knighten, B., Won, Y.: `Detailed characterization of a quad Pentium Pro server running TPC-D', Proc. Int. Conf. on Computer Design, Austin, TX, October 1999, p. 108–115
    26. 26)
      • Keeton, K., Patterson, D., He, Y., Raphael, R., Baker, W.: `Performance characterization of a quad Pentium Pro SMP using OLTP workloads', Proc. 25th Int. Symp. on Computer Architecture, Barcelona, Spain, June 1998, p. 15–26
    27. 27)
      • Trancoso, P., Larriba-Pey, J.L., Zhang, Z., Torrellas, J.: `The memory performance of DSS commercial workloads in shared-memory multiprocessors', Proc. 3rd Int. Symp. on High-performance Computer Architecture, San Antonio, TX, February 1997, p. 250–260
    28. 28)
      • The impact of exploiting instruction-level parallelism on shared-memory multiprocessors
    29. 29)
      • Saulsbury, A., Pong, F., Nowatzyk, A.: `Missing the memory wall: the case for processor/memory integration', Proc. 23rd Int. Symp. on Computer Architecture, Philadelphia, PA, May 1996, p. 90–103
    30. 30)
      • Computer architecture, pipelined and parallel processor design
    31. 31)
      • Computer architecture: a quantitative approach
    32. 32)
      • Scalable parallel computing: technology, architecture, programming
    33. 33)
      • Agarwal, A., Gupta, A.: `Memory reference characteristics of multiprocessor applications under Mach', Proc. ACM Sigmetrics, Santa Fe, NM, May 1998, p. 215–225
    34. 34)
      • Cox, A.L., Fowler, R.J.: `Adaptive cache coherency for detecting migratory shared data', Proc. 20th Int. Symp. on Computer Architecture, San Diego, CA, May 1993, p. 98–108
    35. 35)
      • Reducing false sharing on shared- memory multiprocessors through compile time data transformations
    36. 36)
      • Prete, C.A.: `A new solution of coherence protocol for tightly coupled multiprocessor systems', Proc. EUROMICRO 90: Hardware and Software in System Engineering, Microprocessing and Microprogramming, Vienna, Austria, August 1990, 30, (1-5), p. 207–214
    37. 37)
      • Stenström, P., Brorsson, M., Sandberg, L.: `An adaptive cache coherence protocol optimized for migratory sharing', Proc. 20th Annual Int. Symp. on Computer Architecture, San Diego, CA, May 1993, p. 109–118
    38. 38)
      • The word-invalidate cache coherence protocol
    39. 39)
      • Eggers, S.J.: `Simulation analysis of data sharing in shared-memory multiprocessors', , PhD thesis, UCB/CSD 89/501, University of California, Berkeley, April
    40. 40)
      • Cache invalidation patterns in shared-memory multiprocessors
    41. 41)
      • Chandra, R., Devine, S., Verghese, B., Gupta, A., Rosenblum, M.: `Scheduling and page migration for multiprocessor compute servers', Proc. 6th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1994, p. 12–24
    42. 42)
      • Grizzaffi Maynard, A.M., Donnelly, C.M., Olszewski, B.R.: `Contrasting characteristics and cache performance of technical and multiuser commercial workloads', Proc. 6th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1994, p. 158–170
    43. 43)
      • Torrellas, J., Gupta, A., Hennessy, J.: `Characterizing the caching and synchronization performance of a multiprocessor operating system', Proc. 5th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Boston, MA, September 1992, p. 162–174
    44. 44)
      • The Cache Memory Book
    45. 45)
      • Pentium Pro and Pentium II system architecture
    46. 46)
      • ‘AMD x86-64 Architecture Programmer's Manual Vol. 2: System Programming’, Advanced Micro Device Inc., September 2002
    47. 47)
      • POWER4 system microarchitecture
    48. 48)
      • Martin, M.M.K., Sorin, D.J., Hill, M.D., Wood, D.A.: `Bandwidth-adaptive snooping', Proc. 8th Int. Symp. on High-performance Computer Architecture, Anaheim, CA, February 2002, p. 224–235
    49. 49)
      • Martin, M.M.K., Sorin, D.J., Ailamaki, A., Alameldeen, A.R., Dickson, R.M., Mauer, C.J., Moore, K.E., Plakal, M., Hill, M.D., Wood, D.A.: `Timestamp snooping: an approach for extending SMPs', Proc. 9th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, November 2000, p. 25–36
    50. 50)
      • Simulating a $2M commercial server on a $2K PC
    51. 51)
      • Limprecht, R.: `Microsoft transaction server', Proc. 42nd IEEE Int. Computer Conf., San Jose, CA, February 1997, p. 14–18
    52. 52)
      • The changing face of freeware
    53. 53)
      • GNU Free Software Foundation, http://www.gnu.org/software/, accessed February 2003
    54. 54)
      • Address tracing for parallel machines
    55. 55)
      • Trace-driven memory simulation: a survey
    56. 56)
      • Linux on SGI/MIPS, http://oss.sgi.com/mips/, accessed February 2003
    57. 57)
      • Mauer, C.J., Hill, M.D., Wood, D.A.: `Full-system timing-first simulation', Proc. 2002 ACM SIGMETRICS Int. Conf. on Measurement and Modeling of Computer Systems, Marina del Rey, CA, June 2002, p. 108–116
    58. 58)
      • Goldschmidt, S.R., Hennessy, J.L.: `The accuracy of trace-driven simulations of multiprocessors', Proc. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, Santa Clara, CA, May 1993, p. 146–157
    59. 59)
      • The MIPS R10000 superscalar microprocessor
    60. 60)
      • Ranganathan, P., Gharachorloo, K., Adve, S.V., Barroso, L.: `Performance of database workloads on shared-memory systems with out-of-order processors', Proc. 8th Int Conf. on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1998, p. 307–318
    61. 61)
      • Kroft, D.: `Lockup-free instruction fetch/prefetch cache organization', Proc. 8th Annual Int. Symp. on Computer Architecture, Minneapolis, MN, June 1981, p. 81–87
    62. 62)
      • Shared memory consistency models: a tutorial
    63. 63)
      • Gharachorloo, K., Gupta, A., Hennessy, J.: `Performance evaluation of memory consistency models for shared-memory multiprocessors', Proc. 4th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, April 1991, p. 245–357
    64. 64)
      • Nanda, A.K., Nguyen, A., Michael, M., Joseph, D.: `High throughput coherence controller', Proc. 6th Int. Symp. on High-performance Computer Architecture, Toulouse, France, January 2000, p. 145–155
    65. 65)
      • An overview of the UltraSPARC III Cu Processor v1.1, Sun Microsystems, Inc., Palo Alto, CA, June 2002
    66. 66)
      • The POWER4 Processor Introduction and Tuning Guide, Int. Business Machine Corp., Austin, TX, November 2001
    67. 67)
      • Cache coherence protocols: evaluation using a multiprocessor simulation model
    68. 68)
      • An algorithm for the classification of coherence-related overhead in shared-bus shared-memory multiprocessors
    69. 69)
      • An analysis of degenerate sharing and false coherence
    70. 70)
      • Dubois, M., Skeppstedt, J., Ricciulli, L., Ramamurthy, K., Stenström, P.: `The detection and elimination of useless miss in multiprocessors', Proc. 20th Int. Symp. on Computer Architecture, San Diego, CA, May 1993, p. 88–97
    71. 71)
      • Lepak, K.M., Lipasti, M.H.: `On the value locality of store instructions', Proc. 27th Annual Int. Symp. on Computer Architecture, Vancouver, Canada, June 2000, p. 182–191
    72. 72)
      • Analysis of temporal-based program behavior for improved instruction cache performance
    73. 73)
      • Lorenzini, S., Luculli, G., Prete, C.A.: `A fast procedure placement algorithm for optimal cache use', Proc. 9th IEEE Mediterranean Electrotechnical Conference MELECON, Tel Aviv, Israel, May 1998, p. 1279–1284
    74. 74)
      • Optimizing the instruction cache performance of the operating system
    75. 75)
      • Torrellas, J., Tucker, A., Gupta, A.: `Benefits of cache-affinity scheduling in shared-memory multiprocessors', Proc. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, Santa Clara, CA, May 1993, p. 272–274
    76. 76)
      • Cain, T., Rajwar, R., Marden, M., Lipasti, M.: `An architectural characterization of Java TPC-W', Proc. 7th Int. Symp. on High-performance Computer Architecture, Monterrey, Mexico, January 2001, p. 229–240
    77. 77)
      • Karlsson, M., Moore, K.E., Hagersten, E., Wood, D.A.: `Memory system behavior of Java-based middleware', Proc. 9th Int. Symp. on High-performance Computer Architecture, Anaheim, CA, February 2003, p. 217–228
    78. 78)
      • ‘TPC Benchmark B (Online Transaction Processing) Standard Specification’. Transaction Processing Performance Council, 1994
    79. 79)
      • ‘TPC Benchmark D (Decision Support) Standard Specification’. Transaction Processing Performance Council, Santa Margherita Ligure, Italy, 1995
    80. 80)
      • Woo, S.C., Ohara, M., Torrie, E., Shingh, J.P., Gupta, A.: `The SPLASH-2 programs: characterization and methodological considerations', Proc. 22nd Int. Symp. on Computer Architecture, May 1994, p. 24–36
    81. 81)
      • Cvetanovic, Z., Bhandarkar, D.: `Characterization of Alpha AXP performance using TP and SPEC workloads', Proc. 21st Int. Symp. on Computer Architecture, Chicago, IL, April 1994, p. 60–70
    82. 82)
      • Chapin, J., Herrod, S., Rosenblum, M., Gupta, A.: `Memory system performance of UNIX on CC-NUMA multiprocessors', Proc. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, Ottawa, Canada, May 1995, p. 1–13
http://iet.metastore.ingenta.com/content/journals/10.1049/ip-cdt_20040349
Loading

Related content

content/journals/10.1049/ip-cdt_20040349
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading
This is a required field
Please enter a valid email address