Speech reconstruction using a deep partially supervised neural network
- Author(s): Ian McLoughlin 1, 2 ; Jingjie Li 2 ; Yan Song 2 ; Hamid R. Sharifzadeh 3
-
-
View affiliations
-
Affiliations:
1:
School of Computing , The University of Kent , Medway , UK ;
2: National Engineering Laboratory of Speech and Language Information Processing , The University of Science and Technology of China , Hefei, Anhui , People's Republic of China ;
3: Signal Processing Laboratory , Unitec Institute of Technology , Auckland , New Zealand
-
Affiliations:
1:
School of Computing , The University of Kent , Medway , UK ;
- Source:
Volume 4, Issue 4,
August
2017,
p.
129 – 133
DOI: 10.1049/htl.2016.0103 , Online ISSN 2053-3713
Statistical speech reconstruction for larynx-related dysphonia has achieved good performance using Gaussian mixture models and, more recently, restricted Boltzmann machine arrays; however, deep neural network (DNN)-based systems have been hampered by the limited amount of training data available from individual voice-loss patients. The authors propose a novel DNN structure that allows a partially supervised training approach on spectral features from smaller data sets, yielding very good results compared with the current state-of-the-art.
Inspec keywords: medical signal processing; speech processing; medical disorders; Boltzmann machines
Other keywords: voice-loss patients; statistical speech reconstruction; Gaussian mixture models; deep partially supervised neural network; partially supervised training approach; larynx related dysphonia; restricted Boltzmann machine arrays; DNN structure
Subjects: Speech and biocommunications; Biology and medical computing; Neural computing techniques; Biomedical engineering; Speech and audio signal processing; Biomedical measurement and imaging
References
-
-
1)
-
8. Kawahara, H., Morise, M., Takahashi, T., et al: ‘TANDEM-STRAIGHT: a temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, f0, and aperiodicity estimation’. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2008 ICASSP 2008, 2008, pp. 3933–3936.
-
-
2)
-
4. McLoughlin, I.V., Li, J., Song, Y.: ‘Reconstruction of continuous voiced speech from whispers’. Proc. Interspeech, August 2013, pp. 1022–1026.
-
-
3)
-
15. McLoughlin, I.V.: ‘Speech and audio processing: a MATLAB-based approach’ (Cambridge University Press, 2016).
-
-
4)
-
10. Jiang, B., Song, Y., Wei, S., et al: ‘Deep bottleneck features for spoken language identification’, PLoS ONE, 2014, 9, (7), p. e100795 (doi: 10.1371/journal.pone.0100795).
-
-
5)
-
6. Li, J.-J., McLoughlin, I.V., Dai, L.-R., et al: ‘Whisper-to-speech conversion using restricted Boltzmann machine arrays’, Electron. Lett., 2014, 50, (24), pp. 1781–1782 (doi: 10.1049/el.2014.1645).
-
-
6)
-
5. Toda, T., Nakagiri, M., Shikano, K.: ‘Statistical voice conversion techniques for body-conducted unvoiced speech enhancement’, IEEE Trans. Audio Speech Lang. Process., 2012, 20, (9), pp. 2505–2517 (doi: 10.1109/TASL.2012.2205241).
-
-
7)
-
14. Toda, T., Black, A.W., Tokuda, K.: ‘Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory’, IEEE Trans. Audio Speech Lang. Process., 2007, 15, (8), pp. 2222–2235 (doi: 10.1109/TASL.2007.907344).
-
-
8)
-
7. Hinton, G., Deng, L., Yu, D., et al: ‘Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups’, IEEE Signal Process. Mag., 2012, 29, (6), pp. 82–97 (doi: 10.1109/MSP.2012.2205597).
-
-
9)
-
12. Chen, L.-H., Ling, Z.-H., Liu, L.-J., et al: ‘Voice conversion using deep neural networks with layer-wise generative training’, IEEE/ACM Trans. Audio Speech Lang. Process., 2014, 22, (12), pp. 1859–1872 (doi: 10.1109/TASLP.2014.2353991).
-
-
10)
-
11. McLoughlin, I., Zhang, H.-M., Xie, Z.-P., et al: ‘Robust sound event classification using deep neural networks’, IEEE Trans. Audio Speech Lang. Process., 2015, 23, pp. 540–552 (doi: 10.1109/TASLP.2015.2389618).
-
-
11)
-
1. Mcloughlin, I.V., Sharifzadeh, H.R., Tan, S.L., et al: ‘Reconstruction of phonated speech from whispers using formant-derived plausible pitch modulation’, ACM Trans. Accessible Comput. (TACCESS), 2015, 6, (4), p. 12.
-
-
12)
-
7. Tajiri, Y., Tanaka, K., Toda, T., et al: ‘Non-audible murmur enhancement based on statistical conversion using air- and body-conductive microphones in noisy environments’. 16th Annual Conf. of the Int. Speech Communication Association, 2015.
-
-
13)
-
3. Morris, R.W., Clements, M.A.: ‘Reconstruction of speech from whispers’, Med. Eng. Phys., 2002, 24, (7), pp. 515–520 (doi: 10.1016/S1350-4533(02)00060-7).
-
-
14)
-
2. Sharifzadeh, H.R., McLoughlin, I.V., Ahmadi, F.: ‘Reconstruction of normal sounding speech for laryngectomy patients through a modified CELP codec’, IEEE Trans. Biomed. Eng., 2010, 57, pp. 2448–2458 (doi: 10.1109/TBME.2010.2053369).
-
-
15)
-
13. Xu, Y., Du, J., Dai, L.-R., et al: ‘A regression approach to speech enhancement based on deep neural networks’, IEEE/ACM Trans. Audio Speech Lang. Process., 2015, 23, (1), pp. 7–19 (doi: 10.1109/TASLP.2014.2364452).
-
-
1)