Statistical speech reconstruction for larynx-related dysphonia has achieved good performance using Gaussian mixture models and, more recently, restricted Boltzmann machine arrays; however, deep neural network (DNN)-based systems have been hampered by the limited amount of training data available from individual voice-loss patients. The authors propose a novel DNN structure that allows a partially supervised training approach on spectral features from smaller data sets, yielding very good results compared with the current state-of-the-art.

References

1. 1)
  - 8. Kawahara, H., Morise, M., Takahashi, T., et al: ‘TANDEM-STRAIGHT: a temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, f0, and aperiodicity estimation’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2008 ICASSP 2008, 2008, pp. 3933–3936.
2. 2)
  - 4. McLoughlin, I.V., Li, J., Song, Y.: ‘Reconstruction of continuous voiced speech from whispers’. Proc. Interspeech, August 2013, pp. 1022–1026.
3. 3)
  - 15. McLoughlin, I.V.: ‘Speech and audio processing: a MATLAB-based approach’ (Cambridge University Press, 2016).
4. 4)
  - 10. Jiang, B., Song, Y., Wei, S., et al: ‘Deep bottleneck features for spoken language identification’, PLoS ONE, 2014, 9, (7), p. e100795 (doi: 10.1371/journal.pone.0100795).
5. 5)
  - 6. Li, J.-J., McLoughlin, I.V., Dai, L.-R., et al: ‘Whisper-to-speech conversion using restricted Boltzmann machine arrays’, Electron. Lett., 2014, 50, (24), pp. 1781–1782 (doi: 10.1049/el.2014.1645).
6. 6)
  - 5. Toda, T., Nakagiri, M., Shikano, K.: ‘Statistical voice conversion techniques for body-conducted unvoiced speech enhancement’, IEEE Trans. Audio Speech Lang. Process., 2012, 20, (9), pp. 2505–2517 (doi: 10.1109/TASL.2012.2205241).
7. 7)
  - 14. Toda, T., Black, A.W., Tokuda, K.: ‘Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory’, IEEE Trans. Audio Speech Lang. Process., 2007, 15, (8), pp. 2222–2235 (doi: 10.1109/TASL.2007.907344).
8. 8)
  - 7. Hinton, G., Deng, L., Yu, D., et al: ‘Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups’, IEEE Signal Process. Mag., 2012, 29, (6), pp. 82–97 (doi: 10.1109/MSP.2012.2205597).
9. 9)
  - 12. Chen, L.-H., Ling, Z.-H., Liu, L.-J., et al: ‘Voice conversion using deep neural networks with layer-wise generative training’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (12), pp. 1859–1872 (doi: 10.1109/TASLP.2014.2353991).
10. 10)
  - 11. McLoughlin, I., Zhang, H.-M., Xie, Z.-P., et al: ‘Robust sound event classification using deep neural networks’, IEEE Trans. Audio Speech Lang. Process., 2015, 23, pp. 540–552 (doi: 10.1109/TASLP.2015.2389618).
11. 11)
  - 1. Mcloughlin, I.V., Sharifzadeh, H.R., Tan, S.L., et al: ‘Reconstruction of phonated speech from whispers using formant-derived plausible pitch modulation’, ACM Trans. Accessible Comput. (TACCESS), 2015, 6, (4), p. 12.
12. 12)
  - 7. Tajiri, Y., Tanaka, K., Toda, T., et al: ‘Non-audible murmur enhancement based on statistical conversion using air- and body-conductive microphones in noisy environments’. 16th Annual Conf. of the Int. Speech Communication Association, 2015.
13. 13)
  - 3. Morris, R.W., Clements, M.A.: ‘Reconstruction of speech from whispers’, Med. Eng. Phys., 2002, 24, (7), pp. 515–520 (doi: 10.1016/S1350-4533(02)00060-7).
14. 14)
  - 2. Sharifzadeh, H.R., McLoughlin, I.V., Ahmadi, F.: ‘Reconstruction of normal sounding speech for laryngectomy patients through a modified CELP codec’, IEEE Trans. Biomed. Eng., 2010, 57, pp. 2448–2458 (doi: 10.1109/TBME.2010.2053369).
15. 15)
  - 13. Xu, Y., Du, J., Dai, L.-R., et al: ‘A regression approach to speech enhancement based on deep neural networks’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (1), pp. 7–19 (doi: 10.1109/TASLP.2014.2364452).

Speech reconstruction using a deep partially supervised neural network

References

Related content