The infrastructure to support electronic commerce is one of the areas where more processing power is needed. A multiprocessor system can offer advantages for running electronic commerce applications. The memory performance of an electronic commerce server, i.e. a system running electronic commerce applications, is evaluated in the case of shared-bus multiprocessor architecture. The software architecture of this server is based on a three-tier model and the workloads have been setup as specified by the TPC-W benchmark. The hardware configurations are: a single SMP running tiers two and three, and two SMPs each one running a single tier. The influence of memory subsystem on performance and scalability is analysed and several solutions aimed at reducing the latency of memory considered. After initial experiments, which validate the methodology, choices for cache, scheduling algorithm, and coherence protocol are explored to enhance performance and scalability. As in previous studies on shared-bus multiprocessors, it was found that the memory performance is highly influenced by cache parameters. While scaling the machine, the coherence overhead weighs more and more on the memory performance. False sharing in the kernel is among the main causes of this overhead. Unlike previous studies, passive sharing i.e. the useless sharing of the private data of the migrating processes, is shown to be an important factor that influences performance. This is especially true when multiprocessors with a higher number of processors are considered: an increase in the number of processors produces real benefits only if advanced techniques for reducing the coherence overhead are properly adopted. Scheduling techniques limiting process migration may reduce passive sharing, while restructuring techniques of the kernel data may reduce false sharing misses. However, even when process migration is reduced through cache-affinity techniques, standard coherence protocols like MESI protocol don't allow the best performance. Coherence protocols such as PSCR and AMSD produce performance benefits. PSCR, in particular, eliminates coherence overhead due to passive sharing and minimises the number of coherence misses. The adoption of PSCR and cache-affinity scheduling allows the multiprocessor scalability to be extended to 20 processors for a 128-bit shared bus and current values of main-memory-to-processor speed gap.

References

1. 1)
  - P. Stenström , E. Hagersten , D.J. Li , M. Martonosi , M. Venugopal . Trends in shared-memory multiprocessing. IEEE Comput. , 12 , 44 - 50
2. 2)
  - J. Torrellas , M.S. Lam , J.L. Hennessy . False sharing and spatial locality in multiprocessor caches. IEEE Trans. Comput. , 6 , 651 - 663
3. 3)
  - C.B. Stunkel , B. Janssens , W.K. Fuchs . Address tracing for parallel machines. IEEE Comput. , 1 , 31 - 45
4. 4)
  - Keeton, K., Patterson, D., He, Y., Raphael, R., Baker, W.: `Performance characterization of a quad Pentium Pro SMP using OLTP workloads', Proc. 25th Int. Symp. on Computer Architecture, Barcelona, Spain, June 1998, p. 15–26.
5. 5)
  - Cain, T., Rajwar, R., Marden, M., Lipasti, M.: `An architectural characterization of Java TPC-W', Proc. 7th Int. Symp. on High-performance Computer Architecture, Monterrey, Mexico, January 2001, p. 229–240.
6. 6)
  - A.R. Alameldeen , M.M.K. Martin , C.J. Mauer , K.E. Moore , M. Xu , M.D. Hill , D.A. Wood , D.J. Sorin . Simulating a $2M commercial server on a $2K PC. IEEE Comput. , 2 , 50 - 57
7. 7)
  - J. Edwards . (1999) 3-tier client/server at work.
8. 8)
  - Lepak, K.M., Lipasti, M.H.: `On the value locality of store instructions', Proc. 27th Annual Int. Symp. on Computer Architecture, Vancouver, Canada, June 2000, p. 182–191.
9. 9)
  - J.M. Tendler , J.S. Dodson , J.S. Fields , H. Le , B. Sinharoy . POWER4 system microarchitecture. IBM J. Res. Dev. , 1 , 5 - 26
10. 10)
  - Grizzaffi Maynard, A.M., Donnelly, C.M., Olszewski, B.R.: `Contrasting characteristics and cache performance of technical and multiuser commercial workloads', Proc. 6th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1994, p. 158–170.
11. 11)
  - Barroso, L.A., Gharachorloo, K., Bugnion, E.: `Memory system characterization of commercial workloads', Proc. 25th Int. Symp. on Computer Architecture, Barcelona, Spain, June 1998.
12. 12)
  - TPC BENCHMARK W (Web Commerce) Specification, v. 1.0.1 (2000) Transaction Processing Performance Council.
13. 13)
  - Nanda, A.K., Nguyen, A., Michael, M., Joseph, D.: `High throughput coherence controller', Proc. 6th Int. Symp. on High-performance Computer Architecture, Toulouse, France, January 2000, p. 145–155.
14. 14)
  - M.J. Flynn . (1995) Computer architecture, pipelined and parallel processor design.
15. 15)
  - Karlsson, M., Moore, K.E., Hagersten, E., Wood, D.A.: `Memory system behavior of Java-based middleware', Proc. 9th Int. Symp. on High-performance Computer Architecture, Anaheim, CA, February 2003, p. 217–228.
16. 16)
  - Lorenzini, S., Luculli, G., Prete, C.A.: `A fast procedure placement algorithm for optimal cache use', Proc. 9th IEEE Mediterranean Electrotechnical Conference MELECON, Tel Aviv, Israel, May 1998, p. 1279–1284.
17. 17)
  - D.A. Patterson , J.L. Hennessy . (2007) Computer architecture: a quantitative approach.
18. 18)
  - M. Tomasevic , V. Milutinovic . Hardware approaches to cache coherence in shared-memory multiprocessors. IEEE Micro , 5 , 52 - 59
19. 19)
  - D.W. Walker . Free-market computing and the global economic infrastructure. IEEE Parallel Distrib. Technol. , 3 , 60 - 62
20. 20)
  - R. Buyya . (1999) High performance clustered computing.
21. 21)
  - Stenström, P., Brorsson, M., Sandberg, L.: `An adaptive cache coherence protocol optimized for migratory sharing', Proc. 20th Annual Int. Symp. on Computer Architecture, San Diego, CA, May 1993, p. 109–118.
22. 22)
  - M.S. Squillante , D.E. Lazowska . Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Trans. Parallel Distrib. Syst. , 2 , 131 - 143
23. 23)
  - Cox, A.L., Fowler, R.J.: `Adaptive cache coherency for detecting migratory shared data', Proc. 20th Int. Symp. on Computer Architecture, San Diego, CA, May 1993, p. 98–108.
24. 24)
  - T. Shanley , Mindshare Inc . (1999) Pentium Pro and Pentium II system architecture.
25. 25)
  - Prete, C.A.: `A new solution of coherence protocol for tightly coupled multiprocessor systems', Proc. EUROMICRO 90: Hardware and Software in System Engineering, Microprocessing and Microprogramming, Vienna, Austria, August 1990, 30, (1-5), p. 207–214.
26. 26)
  - Linux on SGI/MIPS, http://oss.sgi.com/mips/, accessed February 2003.
27. 27)
  - Woo, S.C., Ohara, M., Torrie, E., Shingh, J.P., Gupta, A.: `The SPLASH-2 programs: characterization and methodological considerations', Proc. 22nd Int. Symp. on Computer Architecture, May 1994, p. 24–36.
28. 28)
  - B. Brandau , T. Confrey , A. D'Silva , C.J. Matheus , R. Weihmayer . Reinventing GTE with information technology. IEEE Comput. , 3 , 50 - 58
29. 29)
  - Eggers, S.J.: `Simulation analysis of data sharing in shared-memory multiprocessors', , PhD thesis, UCB/CSD 89/501, University of California, Berkeley, April.
30. 30)
  - Yu, A., Chen, J.: `The POSTGRES95 user manual', Computer Science Div., Dept. of EECS, University of California at Berkeley, July 1995.
31. 31)
  - GNU Free Software Foundation, http://www.gnu.org/software/, accessed February 2003.
32. 32)
  - Torrellas, J., Tucker, A., Gupta, A.: `Benefits of cache-affinity scheduling in shared-memory multiprocessors', Proc. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, Santa Clara, CA, May 1993, p. 272–274.
33. 33)
  - Cao, Q., Torrellas, J., Trancoso, P., Larriba-Pey, J.L., Knighten, B., Won, Y.: `Detailed characterization of a quad Pentium Pro server running TPC-D', Proc. Int. Conf. on Computer Design, Austin, TX, October 1999, p. 108–115.
34. 34)
  - Goldschmidt, S.R., Hennessy, J.L.: `The accuracy of trace-driven simulations of multiprocessors', Proc. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, Santa Clara, CA, May 1993, p. 146–157.
35. 35)
  - J. Kalamatianos , A. Khalafi , D. Kaeli , W. Meleis . Analysis of temporal-based program behavior for improved instruction cache performance. IEEE Trans. Comput. , 2 , 168 - 175
36. 36)
  - Limprecht, R.: `Microsoft transaction server', Proc. 42nd IEEE Int. Computer Conf., San Jose, CA, February 1997, p. 14–18.
37. 37)
  - C.A. Prete , G. Prina , L. Ricciardi . A trace-driven simulator for performance evaluation of cache-based multiprocessor system. IEEE Trans. Parallel Distrib. Syst. , 9 , 915 - 929
38. 38)
  - Ranganathan, P., Gharachorloo, K., Adve, S.V., Barroso, L.: `Performance of database workloads on shared-memory systems with out-of-order processors', Proc. 8th Int Conf. on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1998, p. 307–318.
39. 39)
  - R. Giorgi , C.A. Prete , G. Prina , L. Ricciardi . Trace Factory: generating workloads for trace-driven simulation of shared-bus multiprocessors. IEEE Concurr. , 4 , 54 - 68
40. 40)
  - C.A. Prete , G. Prina , R. Giorgi , L. Ricciardi . Some considerations about passive sharing in shared-memory multiprocessors. IEEE TCCA Newsletter , 34 - 40
41. 41)
  - S.V. Adve , K. Gharachorloo . Shared memory consistency models: a tutorial. IEEE Comput. , 12 , 66 - 76
42. 42)
  - P. Foglia . An algorithm for the classification of coherence-related overhead in shared-bus shared-memory multiprocessors. IEEE TCCA News. , 53 - 58
43. 43)
  - Chapin, J., Herrod, S., Rosenblum, M., Gupta, A.: `Memory system performance of UNIX on CC-NUMA multiprocessors', Proc. ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, Ottawa, Canada, May 1995, p. 1–13.
44. 44)
  - J. Torrellas , A. Tucker , A. Gupta . Evaluating the performance of cache-affinity scheduling in shared-memory multiprocessors. J. Parallel Distrib. Comput. , 2 , 139 - 151
45. 45)
  - Short, R., Gamache, R., Vert, J., Massa, M.: `Windows NT clusters for availability and scalability', Proc. 42nd IEEE Int. Computer Conf., San Jose, CA, February 1997, p. 8–13.
46. 46)
  - R.A. Uhlig , T.N. Mudge . Trace-driven memory simulation: a survey. ACM Comput. Surv. , 128 - 170
47. 47)
  - ‘AMD x86-64 Architecture Programmer's Manual Vol. 2: System Programming’, Advanced Micro Device Inc., September 2002.
48. 48)
  - Robinson, D., the Apache Group: APACHE – An HTTP Server, Reference Manual, 1995, http://www.apache.org, accessed January 1995.
49. 49)
  - R.L. Hyde , B.D. Fleisch . An analysis of degenerate sharing and false coherence. J. Parallel Distrib. Comput. , 2 , 183 - 195
50. 50)
  - Chandra, R., Devine, S., Verghese, B., Gupta, A., Rosenblum, M.: `Scheduling and page migration for multiprocessor compute servers', Proc. 6th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 1994, p. 12–24.
51. 51)
  - Trancoso, P., Larriba-Pey, J.L., Zhang, Z., Torrellas, J.: `The memory performance of DSS commercial workloads in shared-memory multiprocessors', Proc. 3rd Int. Symp. on High-performance Computer Architecture, San Antonio, TX, February 1997, p. 250–260.
52. 52)
  - Saulsbury, A., Pong, F., Nowatzyk, A.: `Missing the memory wall: the case for processor/memory integration', Proc. 23rd Int. Symp. on Computer Architecture, Philadelphia, PA, May 1996, p. 90–103.
53. 53)
  - ‘TPC Benchmark D (Decision Support) Standard Specification’. Transaction Processing Performance Council, Santa Margherita Ligure, Italy, 1995.
54. 54)
  - Agarwal, A., Gupta, A.: `Memory reference characteristics of multiprocessor applications under Mach', Proc. ACM Sigmetrics, Santa Fe, NM, May 1998, p. 215–225.
55. 55)
  - K. Hwang , Z. Xu . (1998) Scalable parallel computing: technology, architecture, programming.
56. 56)
  - Dubois, M., Skeppstedt, J., Ricciulli, L., Ramamurthy, K., Stenström, P.: `The detection and elimination of useless miss in multiprocessors', Proc. 20th Int. Symp. on Computer Architecture, San Diego, CA, May 1993, p. 88–97.
57. 57)
  - T. Lewis . The legacy maturity model. IEEE Comput. , 11 , 125 - 128
58. 58)
  - Gharachorloo, K., Gupta, A., Hennessy, J.: `Performance evaluation of memory consistency models for shared-memory multiprocessors', Proc. 4th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, April 1991, p. 245–357.
59. 59)
  - V.S. Pai , P. Ranganathan , H. Abdel-Shafi , S. Adve . The impact of exploiting instruction-level parallelism on shared-memory multiprocessors. IEEE Trans. Comput. , 2 , 218 - 226
60. 60)
  - J.K. Archibald , J.L. Baer . Cache coherence protocols: evaluation using a multiprocessor simulation model. ACM Trans. Comput. Syst. , 273 - 298
61. 61)
  - K.C. Yeager . The MIPS R10000 superscalar microprocessor. IEEE Micro , 4 , 42 - 50
62. 62)
  - Martin, M.M.K., Sorin, D.J., Hill, M.D., Wood, D.A.: `Bandwidth-adaptive snooping', Proc. 8th Int. Symp. on High-performance Computer Architecture, Anaheim, CA, February 2002, p. 224–235.
63. 63)
  - Mauer, C.J., Hill, M.D., Wood, D.A.: `Full-system timing-first simulation', Proc. 2002 ACM SIGMETRICS Int. Conf. on Measurement and Modeling of Computer Systems, Marina del Rey, CA, June 2002, p. 108–116.
64. 64)
  - C.A. Prete . RST cache memory design for a tightly coupled multiprocessor system. IEEE Micro , 2 , 16 - 19
65. 65)
  - J. Edwards . The changing face of freeware. IEEE Comput. , 10 , 11 - 13
66. 66)
  - R. Giorgi , C.A. Prete . PSCR: a coherence protocol for eliminating passive sharing in shared-bus shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. , 7 , 742 - 763
67. 67)
  - Kroft, D.: `Lockup-free instruction fetch/prefetch cache organization', Proc. 8th Annual Int. Symp. on Computer Architecture, Minneapolis, MN, June 1981, p. 81–87.
68. 68)
  - J. Handy . (1998) The Cache Memory Book.
69. 69)
  - ‘TPC Benchmark B (Online Transaction Processing) Standard Specification’. Transaction Processing Performance Council, 1994.
70. 70)
  - V. Milutinovic . (2001) Infrastructure for electronic business on the internet.
71. 71)
  - J.M. Andreoli , F. Pacull , R. Pareschi . XPECT: a framework for electronic commerce. IEEE Internet Comput. , 4 , 40 - 48
72. 72)
  - A. Gupta , W.-D. Weber . Cache invalidation patterns in shared-memory multiprocessors. IEEE Trans. Comput. , 7 , 794 - 810
73. 73)
  - M. Tomasevic , V. Milutinovic . The word-invalidate cache coherence protocol. Microprocess. Microsyst. , 3 , 3 - 16
74. 74)
  - An overview of the UltraSPARC III Cu Processor v1.1, Sun Microsystems, Inc., Palo Alto, CA, June 2002.
75. 75)
  - T.E. Jeremiassen , S.J. Eggers . Reducing false sharing on shared- memory multiprocessors through compile time data transformations. ACM SIGPLAN Notice , 8 , 179 - 188
76. 76)
  - Torrellas, J., Gupta, A., Hennessy, J.: `Characterizing the caching and synchronization performance of a multiprocessor operating system', Proc. 5th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Boston, MA, September 1992, p. 162–174.
77. 77)
  - M. Tomasevic , V. Milutinovic . (1993) The cache coherence problem in shared-memory multiprocessors – hardware solutions.
78. 78)
  - The POWER4 Processor Introduction and Tuning Guide, Int. Business Machine Corp., Austin, TX, November 2001.
79. 79)
  - Cvetanovic, Z., Bhandarkar, D.: `Characterization of Alpha AXP performance using TP and SPEC workloads', Proc. 21st Int. Symp. on Computer Architecture, Chicago, IL, April 1994, p. 60–70.
80. 80)
  - Martin, M.M.K., Sorin, D.J., Ailamaki, A., Alameldeen, A.R., Dickson, R.M., Mauer, C.J., Moore, K.E., Plakal, M., Hill, M.D., Wood, D.A.: `Timestamp snooping: an approach for extending SMPs', Proc. 9th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, November 2000, p. 25–36.
81. 81)
  - Sweazey, P., Smith, A.J.: `A class of compatible cache consistency protocols and their support by the IEEE futurebus', Proc. 13th Int Symp. on Computer Architecture, Tokyo, Japan, June 1986, p. 414–423.
82. 82)
  - J. Torrellas , R. Daigle . Optimizing the instruction cache performance of the operating system. IEEE Trans. Comput. , 12 , 1363 - 1381

Simulation study of memory performance of SMP multiprocessors running a TPC-W workload

Simulation study of memory performance of SMP multiprocessors running a TPC-W workload

Buy article PDF

Buy Knowledge Pack

Thank you

References

Related content