access icon free Dataflow and microarchitecture co-optimisation for sparse CNN on distributed processing element accelerator

Accelerators that utilise the sparsity of both activation data and network structure of convolutional neural networks (CNNs) have demonstrated efficient processing of CNNs with superior performance. Previous research studies have shown three critical design concerns when designing accelerators for sparse CNNs, including data reuse, parallel computing performance, and effective sparse computation. These factors were each used in the previous accelerator designs, but none of the designs have considered all the factors at the same time. This study provides analytical approaches and experimental results to reveal the insight of accelerator design for sparse CNNs. The authors have shown that the architectural aspects need to be all considered to avoid performance pitfalls, including their mutual effects. Based on the proposed analytical approach, they proposed enhancement techniques and co-designed among the factors discussed in this study. The improved architecture shows up to 1.5× data reuse and/or 1.55× performance improvement in comparison with state-of-the-art sparse CNN accelerators while still maintaining equal area and energy cost.

Inspec keywords: neural net architecture; AI chips; optimisation; parallel architectures; convolutional neural nets

Other keywords: performance improvement; convolutional neural networks; sparse CNN; parallel computing performance; CNN accelerators; sparse computation; distributed processing element accelerator; microarchitecture co-optimisation; critical design concerns; data reuse

Subjects: Microprocessor chips; Neural computing techniques; Optimisation techniques; Parallel architecture

References

    1. 1)
      • 40. Suda, N., Chandra, V., Dasika, G., et al: ‘Throughput-optimized opencl-based FPGA accelerator for large-scale convolutional neural networks’. Proc. of the 2016 ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays ACM, 2016, pp. 1625.
    2. 2)
      • 15. Li, F., Zhang, B., Liu, B.: ‘Ternary weight networks’, arXiv preprint arXiv:160504711, 2016.
    3. 3)
      • 13. Han, S., Pool, J., Tran, J., et al: ‘Learning both weights and connections for efficient neural network’. Advances in Neural Information Processing Systems, Montreal, Quebec, Canada, 2015, pp. 11351143.
    4. 4)
      • 48. Shao, Y.S., Reagen, B., Wei, G.Y., et al: ‘Aladdin: a pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures’. 2014 ACM/IEEE 41st Int. Symp. on Computer Architecture (ISCA), 2014, pp. 97108.
    5. 5)
      • 30. Eaton, J.W.: ‘GNU octave 4.2 reference manual’ (Samurai Media Limited, UK, 2017).
    6. 6)
      • 38. Han, S., Mao, H., Dally, W.J.: ‘Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding’, arXiv preprint arXiv:151000149, 2015.
    7. 7)
      • 42. Courbariaux, M., Hubara, I., Soudry, D., et al: ‘Binarized neural networks: training deep neural networks with weights and activations constrained to + 1 or-1’, arXiv preprint arXiv:160202830, 2016.
    8. 8)
      • 46. Nakahara, H., Sada, Y., Shimoda, M., et al: ‘FPGA-based training accelerator utilizing sparseness of convolutional neural network’. 2019 29th Int. Conf. on Field Programmable Logic and Applications (FPL), 2019, pp. 180186.
    9. 9)
      • 34. Han, S.: ‘Deep-compression-alexnet’, https://github.com/songhan/Deep-Compression-AlexNet, 2016.
    10. 10)
      • 20. Lai, B.C., Pan, J.W., Lin, C.Y.: ‘Enhancing utilization of SIMD-like accelerator for sparse convolutional neural networks’, IEEE Trans. Very Large Scale Integr. (VLSI) Syst, 2019, 27, (5), pp. 12181222.
    11. 11)
      • 28. Gao, M., Pu, J., Yang, X., et al: ‘Tetris: scalable and efficient neural network acceleration with 3D memory’. Proc. of the Twenty-Second Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Xi'an, China, 2017, pp. 751764.
    12. 12)
      • 17. Han, S., Liu, X., Mao, H., et al: ‘EIE: efficient inference engine on compressed deep neural network’. 2016 ACM/IEEE 43rd Annual Int. Symp. on Computer Architecture (ISCA), Seoul, Korea, 2016, pp. 243254.
    13. 13)
      • 18. Zhang, S., Du, Z., Zhang, L., et al: ‘Cambricon-x: an accelerator for sparse neural networks’. The 49th Annual IEEE/ACM Int. Symp. on Microarchitecture, Taipei, Taiwan, 2016, p. 20.
    14. 14)
      • 16. Albericio, J., Judd, P., Hetherington, T., et al: ‘Cnvlutin: ineffectual-neuron-free deep neural network computing’, ACM SIGARCH Comput. Architect. News, 2016, 44, (3), pp. 113.
    15. 15)
      • 11. Guo, K., Sui, L., Qiu, J., et al: ‘Angel-eye: A complete design flow for mapping CNN onto customized hardware’. 2016 IEEE Computer Society Annual Symp. on VLSI (ISVLSI), Pittsburgh, Pennsylvania, USA, 2016, pp. 2429.
    16. 16)
      • 7. Szegedy, C., Liu, W., Jia, Y., et al: ‘Going deeper with convolutions’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, 2015, pp. 19.
    17. 17)
      • 10. Chen, T., Du, Z., Sun, N., et al: ‘Diannao: a small footprint high-throughput accelerator for ubiquitous machine-learning’, ACM SIGARCH Comput. Architect. News, 2014, 42, (1), pp. 269284.
    18. 18)
      • 32. Muralimanohar, N., Balasubramonian, R., Jouppi, N.P.: ‘Cacti 6.0: a tool to model large caches’, HP Lab., 2009, 27, p. 28.
    19. 19)
      • 39. Gysel, P., Motamedi, M., Ghiasi, S.: ‘Hardware-oriented approximation of convolutional neural networks’, arXiv preprint arXiv:160403168, 2016.
    20. 20)
      • 22. Zhang, C., Li, P., Sun, G., et al: ‘Optimizing FPGA-based accelerator design for deep convolutional neural networks’. Proc. of the 2015 ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays ACM, Monterey, California, USA, 2015, pp. 161170.
    21. 21)
      • 27. Kim, D., Ahn, J., Yoo, S.: ‘Zena: zero-aware neural network accelerator’, IEEE Des. Test, 2018, 35, (1), pp. 3946.
    22. 22)
      • 14. Hubara, I., Courbariaux, M., Soudry, D., et al: ‘Binarized neural networks’. In: Advances in Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 41074115.
    23. 23)
      • 44. Valavi, H., Ramadge, P.J., Nestler, E., et al: ‘A 64-tile 2.4-mb in-memory-computing cnn accelerator employing charge-domain compute’, IEEE J. Solid-State Circuits, 2019, 54, (6), pp. 17891799.
    24. 24)
      • 5. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ‘ImageNet classification with deep convolutional neural networks’. Advances in Neural Information Processing Systems, Lake Tahoe, Nevada, USA, 2012, pp. 10971105.
    25. 25)
      • 47. Lian, X., Liu, Z., Song, Z., et al: ‘High-performance FPGA-based CNN accelerator with block-floating-point arithmetic’, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2019, 27, (8), pp. 18741885.
    26. 26)
      • 25. Nuzman, D., Zaks, A.: ‘Outer-loop vectorization: revisited for short SIMD architectures’. Proc. of the 17th Int. Conf. on Parallel Architectures and Compilation Techniques ACM, Toronto, Ontario, Canada, 2008, pp. 211.
    27. 27)
      • 4. He, K., Zhang, X., Ren, S., et al: ‘Delving deep into rectifiers: surpassing human-level performance on ImageNet classification’. Proc. of the IEEE Int. Conf. on Computer vision, Santiago, Chile, 2015, pp. 10261034.
    28. 28)
      • 9. Chen, Y.H., Krishna, T., Emer, J.S., et al: ‘Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks’, IEEE J. Solid-State Circuits, 2017, 52, (1), pp. 127138.
    29. 29)
      • 33. Crockett, L.H., Elliot, R.A., Enderwitz, M.A.: ‘The Zynq book tutorials for Zybo and zedboard’ (Strathclyde Academic Media, UK, 2015).
    30. 30)
      • 45. Yin, S., Jiang, Z., Seo, J.S., et al: ‘XNOR-SRAM: in-memory computing SRAM macro for binary/ternary deep neural networks’, IEEE J. Solid-State Circuits, 2020, 55, (6), pp. 17331743.
    31. 31)
      • 43. Venkatesh, G., Nurvitadhi, E., Marr, D.: ‘Accelerating deep convolutional networks using low-precision and sparsity’. 2017 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, Louisiana, USA, 2017, pp. 28612865.
    32. 32)
      • 6. Simonyan, K., Zisserman, A.: ‘Very deep convolutional networks for large-scale image recognition’, arXiv preprint arXiv:14091556, 2014.
    33. 33)
      • 8. He, K., Zhang, X., Ren, S., et al: ‘Deep residual learning for image recognition’. Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, 2016, pp. 770778.
    34. 34)
      • 1. Huang, P.S., He, X., Gao, J., et al: ‘Learning deep structured semantic models for web search using clickthrough data’. Proc. of the 22nd ACM Int. Conf. on Information & Knowledge Management, San Francisco, California, USA, 2013, pp. 23332338.
    35. 35)
      • 23. Hassabis, D., Kumaran, D., Summerfield, C., et al: ‘Neuroscience-inspired artificial intelligence’, Neuron, 2017, 95, (2), pp. 245258.
    36. 36)
      • 29. Chen, Y.H., Yang, T.J., Emer, J., et al: ‘Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices’, IEEE J. Emerg. Sel. Top. Circuits Syst., 2019, 9, (2), pp. 292308.
    37. 37)
      • 3. Dahl, G.E., Sainath, T.N., Hinton, G.E.: ‘Improving deep neural networks for LVCSR using rectified linear units and dropout’. 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 86098613.
    38. 38)
      • 36. Dennard, R.H., Gaensslen, F.H., Rideout, V.L., et al: ‘Design of ion-implanted MOSFET's with very small physical dimensions’, IEEE J. Solid-State Circuits, 1974, 9, (5), pp. 256268.
    39. 39)
      • 49. Samajdar, A., Zhu, Y., Whatmough, P., et al: ‘Scale-sim: systolic CNN accelerator simulator’, arXiv preprint arXiv:181102883, 2018.
    40. 40)
      • 12. Cavigelli, L., Benini, L.: ‘Origami: A 803-GOP/S/W convolutional network accelerator’, IEEE Trans. Circuits Syst. Video Technol., 2017, 27, (11), pp. 24612475.
    41. 41)
      • 41. Qiu, J., Wang, J., Yao, S., et al: ‘Going deeper with embedded fpga platform for convolutional neural network’. Proc. of the 2016 ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays ACM, 2016, pp. 2635.
    42. 42)
      • 21. Chen, Y.H., Emer, J., Sze, V.: ‘Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks’, ACM SIGARCH Comput. Architect. News, 2016, 44, (3), pp. 367379.
    43. 43)
      • 19. Parashar, A., Rhu, M., Mukkara, A., et al: ‘SCNN: an accelerator for compressed-sparse convolutional neural networks’. 2017 ACM/IEEE 44th Annual Int. Symp. on Computer Architecture (ISCA), Toronto, Ontario, Canada, 2017, pp. 2740.
    44. 44)
      • 31. Ghenassia, F., Clouard, A., Maillet-Contoz, L., et al: ‘Transaction-level modeling with Systemc’ (Springer, Netherlands, 2005), vol. 2.
    45. 45)
      • 35. Russakovsky, O., Deng, J., Su, H., et al: ‘ImageNet large scale visual recognition challenge’, Int. J. Comput. Vis., 2015, 115, (3), pp. 211252.
    46. 46)
      • 50. Ma, X., Zhao, R., Zhou, J.: ‘Convolutional neural network (CNN) accelerator chip design’. 2019 IEEE 13th International Conference on Anti-counterfeiting, Security, and Identification (ASID), 2019, pp. 211215.
    47. 47)
      • 2. Hadsell, R., Sermanet, P., Ben, J., et al: ‘Learning long-range vision for autonomous off-road driving’, J. Field Robot., 2009, 26, (2), pp. 120144.
    48. 48)
      • 24. Xue, J.: ‘Loop tiling for parallelism’ (Springer Science & Business Media, USA, 2012), vol. 575.
    49. 49)
      • 26. Bailey, D.G.: ‘Design for embedded image processing on FPGAs’ (John Wiley & Sons, Singapore, 2011).
    50. 50)
      • 37. Nurvitadhi, E., Venkatesh, G., Sim, J., et al: ‘Can FPGAS beat GPUS in accelerating next-generation deep neural networks?’. Proc. of the 2017 ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, 2017, pp. 514.
http://iet.metastore.ingenta.com/content/journals/10.1049/iet-cds.2019.0225
Loading

Related content

content/journals/10.1049/iet-cds.2019.0225
pub_keyword,iet_inspecKeyword,pub_concept
6
6
Loading