Soft-error reliable architecture for future microprocessors

Soft-error reliable architecture for future microprocessors

For access to this article, please select a purchase option:

Buy eFirst article PDF
(plus tax if applicable)
Buy Knowledge Pack
10 articles for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
IET Computers & Digital Techniques — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

The transient error is the failure of the device due to transient hardware faults caused by high-energy particles like neutron and alpha particle strikes. In this study, the authors propose two schemes of fault-tolerant architecture. The first scheme is a hardware-based solution called REMO that combines the best features of space and time redundancy. REMO provides very high fault coverage with minimum overheads in performance, power and area. The second scheme, REMORA combines the best features of hardware and software approaches of fault tolerance. The persistent issue of unprotected code which exists in software approaches is eliminated in this proposal. Simulation results from a SPEC2006 benchmark suite indicate, REMO incurs an increase in the area of about 6%, power overhead is 9% in spite of redundant execution and a negligible performance penalty during a fault-free run. In REMORA, performance degradation increases to 12%. The code size inflation is close to 12%. This is due to the additional signature instructions inserted into the application program. In this study, the authors have explored the possibility of eliminating this penalty by embedding the signatures in control flow instructions. The power and area overhead of REMORA is on par with REMO.


    1. 1)
      • 1. Baumann, R.: ‘Soft errors in advanced computer systems’, IEEE Des. Test. Comput., 2005, 22, (3), pp. 258266.
    2. 2)
      • 2. Jing, Y., Maria, J., Marc, S.: ‘ESoftcheck: removal of non-vital checks for fault tolerance’. Proc. Int. Symp. on Code Generation and Optimization, Seattle, WA, 2009, pp. 3546.
    3. 3)
      • 3. Gopalakrishnan, S., Singh, V.: ‘REMO: redundant execution with minimum area, power, performance overhead fault tolerant architecture’. Proc. Int. Symp. on IOLTS, Catalonia, Spain, July 2016, pp. 109114.
    4. 4)
      • 4. Gopalakrishnan, S., Singh, V.: ‘REMORA: a hybrid low-cost soft-error reliable fault tolerant architecture’. Proc. Int. Symp. on DFT, Cambridge, UK., October 2017, pp. 16.
    5. 5)
      • 5. Reinhardt, S.K., Mukherjee, S.S.: ‘Transient fault detection via simultaneous multithreading’. Proc. Int. Conf. ISCA, Vancouver, BC, Canada, June 2000, pp. 2536.
    6. 6)
      • 6. Rodrigues, R., Kundu, S.: ‘A low power architecture for online detection of execution errors in SMT processors’. Proc. Int. Symp. on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, New York, NY, USA., October 2013, pp. 3338.
    7. 7)
      • 7. Gomma, M., Scarbrough, C., Vijaykumar, T.N.: ‘Transient-fault recovery for chip multiprocessors’, IEEE Micro, 2003, 23, (6), pp. 7683.
    8. 8)
      • 8. Subramanyan, P., Singh, V., Saluja, K.K., et al: ‘Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors’. Proc. of DATE, Dresden, Germany, 2010, pp. 15721577.
    9. 9)
      • 9. Subramanyan, P., Singh, V., Saluja, K.K., et al: ‘Energy efficient redundant execution for chip multiprocessors’. Proc. Int. Conf. GLSVLSI, Rhode Island, USA., 2010, pp. 419426.
    10. 10)
      • 10. Austin, T.M.: ‘Diva: a reliable substrate for deep submicron microarchitecture design’. Proc. Int. Symp. on Microarchitecture, Haifa, Israel, 1999, pp. 196207.
    11. 11)
      • 11. Kim, S., Somani, A.: ‘SSD: an affordable fault tolerant architecture for superscalar processors’. Proc. Eight Pacfic Rim Int. Symp. on Dependable Computing, Seoul, Republic of Korea, 2001, pp. 2734.
    12. 12)
      • 12. Nickel, J., Somani, A.: ‘REESE: a method of soft error detection in microprocessors’. Proc. Int. Conf. on Dependable Systems and Networks, Göteborg, Sweden, July 2001, pp. 401410.
    13. 13)
      • 13. Bournoutian, G., Orailoglu, A.: ‘Dynamic transient fault detection and recovery for embedded processor datapaths’. Proc. Int. CODES + ISSS, Tampere, Finland, 2012, pp. 4352.
    14. 14)
      • 14. Smolens, J.C., Kim, J., Hoe, J.C., et al: ‘Efficient resource sharing in concurrent error detecting superscalar microarchitectures’. Proc. Int. Symp. on Microarchitecture, Portland, Oregon, 2004, pp. 257268.
    15. 15)
      • 15. Sohi, G.S., Franklin, M., Saluja, K.K.: ‘A study of time-redundant fault-tolerance techniques for high-performance pipelined computers’. Proc. Int. Symp. on Fault-Tolerant Computing, Chicago (IL), USA, 1989, pp. 436449.
    16. 16)
      • 16. Ray, J., Hoe, J.C., Falsafi, B.: ‘Dual use of superscalar datapath for transient-fault detection and recovery’. Proc. of 34th Int. Symp. on Microarchitecture, Austin, TX, USA., 2001, pp. 214224.
    17. 17)
      • 17. Oh, N., Shirvani, P.P., McCluskey, E.J.: ‘Error detection by duplicated instructions in super-scalar processors’, IEEE Trans. Reliab., 2002, 51, (1), pp. 6375.
    18. 18)
      • 18. Reis, A.G., Chang, J., Vachharajani, N., et al: ‘SWIFT: software implemented fault tolerance’. Int. Symp. on Code Generation and Optimization (CGO), San Jose, CA, USA., 2005, pp. 243254.
    19. 19)
      • 19. Feng, S., Gupta, S., Ansari, A., et al: ‘Shoestring: probabilistic soft error reliability on the cheap’. Proc. Int. Conf. ASPLOS, USA, 2010, pp. 385396.
    20. 20)
      • 20. Khudia, D.S., Mahlke, S.: ‘Harnessing soft computations for low-budget fault tolerance’. Proc. Int. Symp. on Microarchitecture, Cambridge, UK., 2014, pp. 319330.
    21. 21)
      • 21. Mahmood, A., McCluskey, E.J.: ‘Concurrent error detection using watchdog processors-a survey’, IEEE Trans. Comput., 1998, 37, (2), pp. 160174.
    22. 22)
      • 22. Khudia, D.S., Mahlke, S.: ‘Low cost control flow protection using abstract control signatures’, ACM Sigplan Not., 2013, 48, (5), pp. 312.
    23. 23)
      • 23. Ragel, R. G., Parameswaran, S.: ‘A hybrid hardware-software technique to improve reliability in embedded processors’, ACM Trans. Embedded Comput. Syst., 2011, 10, (3), pp. 36:136:16.
    24. 24)
      • 24. Vemu, R., Abraham, J.: ‘Ceda: control-flow error detection using assertions’, IEEE Trans. Comput., 2011, 60, (9), pp. 12331245.
    25. 25)
      • 25. Oh, N., Shirvani, P., McCluskey, E.J.: ‘Control-flow checking by software signatures’, IEEE Trans. Rel., 2002, 51, (2), pp. 111122.
    26. 26)
      • 26. Meixner, A., Sorin, D.J.: ‘Error detection using dynamic dataflow verification’. Proc. Int. Conf. on Parallel Architectures and Compilation Techniques, Brasov, Romania, September 2007, pp. 104118.
    27. 27)
      • 27. Shrivastava, A., Rhisheekesan, A.: ‘Quantitative analysis of control flow checking mechanisms for soft errors’. Proc. ACM/EDAC/IEEE Design Automation Conf. (DAC), San Francisco, CA, USA., 2014, pp. 16.
    28. 28)
      • 28. Meixner, A., Bauer, M.E., Sorin, D.J: ‘Argus: Low-cost, comprehensive error detection in simple cores’, IEEE Microarchitecture, 2008, 28, (1), pp. 5259.
    29. 29)
      • 29. Parra, L., Lindoso, A., Portela, M., et al: ‘Efficient mitigation of data and control flow errors in microprocessors’, IEEE Trans. Nucl. Sci., 2014, 61, (4), pp. 14.
    30. 30)
      • 30. Golander, A., Weiss, S., Ronen, R.: ‘DDMR: dynamic and scalable dual modular redundancy with short validation intervals’, IEEE Comput. Archit. Lett., 2008, 7, (2), pp. 6568.
    31. 31)
      • 31. Ernst, D., Kim, N.S., Das, S., et al: ‘Razor: a low-power pipeline based on circuit-level timing speculation’. Proc. 36th Annual IEEE/ACM Int. Symp. on Microarchitecture, 2003 MICRO-36, San Diego, CA, USA., December 2003, pp. 718.
    32. 32)
      • 32. Lattner, C., Adve, V.: ‘Llvm: a compilation framework for lifelong program analysis & transformation’. Proc. Int. Code Generation and Optimization, San Jose, CA, USA., 2004, pp. 7586.
    33. 33)
      • 33. Binkert, N., Beckmann, B., Black, G., et al: ‘The gem5 simulator’, ACM SIGARCH Comput. Archit. News, 2011, 39, (2), pp. 17.
    34. 34)
      • 34. Li, S., Ahn, J.H., Strong, R.D., et al: ‘McPAT: An integrated power, area, and timing modeling framework for multicore and many core architectures’. Proc. Int. Symp. on MICRO 42, New York, NY, USA., 2009, pp. 469480.

Related content

This is a required field
Please enter a valid email address