Your browser does not support JavaScript!

Visual voice activity detection with optical flow

Visual voice activity detection with optical flow

For access to this article, please select a purchase option:

Buy article PDF
(plus tax if applicable)
Buy Knowledge Pack
10 articles for $120.00
(plus taxes if applicable)

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership 

Recommend Title Publication to library

You must fill out fields marked with: *

Librarian details
Your details
Why are you recommending this title?
Select reason:
IET Image Processing — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Current voice activity detection methods generally utilise only acoustic information. Therefore they are susceptible to false classification because of the presence of other acoustic sources such as another speaker or non-stationary noise. To address this issue, the authors propose a new method of voice activity detection using solely visual information in the form of a speaker's mouth region. Such video information is not affected by the acoustic environment. Simulations show that a high percentage correct silence detection (CSD) can be obtained with a low percentage false silence detection (FSD). Comparisons with two other visual voice activity detectors show the proposed method to be consistently more accurate, and on average yields a 4% improvement in CSD. The usefulness of the method is confirmed by applying it to a previously published audio–visual convolutive blind source separation algorithm, to increase the intelligibility of a speaker.


    1. 1)
      • Magarey, J., Dick, A.: `Multiresolution stereo image matching using complex wavelets', Proc. 14th Int. Conf. on Pattern Recognition, 16–20 August 1998, 1, p. 4–7.
    2. 2)
      • L.R. Rabiner , R.W. Schafer . (1978) Digital processing of speech signals.
    3. 3)
      • T.F. Cootes , G.J. Edwards , C.J. Taylor . Active appearance models. IEEE Trans on Pattern Analysis and Machine Intelligence (PAMI) , 6 , 681 - 685
    4. 4)
    5. 5)
      • Liu, P., Wang, Z.: `Voice activity detection using visual information', Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, (ICASSP), 2004, Montreal, Canada.
    6. 6)
      • Aubrey, A., Rivet, B., Hicks, Y., Girin, L., Chambers, J., Jutten, C.: `Two novel visual voice activity detectors based on appearance models and retinal filtering', In Fifteenth Eur. Signal Processing Conference (EUSIPCO), 2007.
    7. 7)
      • Magarey, J.: `Motion estimation using complex wavelets', 1997, PhD, Cambridge University.
    8. 8)
      • S. Haykin , J. Principe , T. Sejnowski , J. McWhirter . (2006) New directions in statistical signal processing: from systems to Brains.
    9. 9)
      • S. Gökhun Tanyer , H. Özer . Voice activity detection in nonstationary noise. IEEE Trans. Speech Audio Process. , 4 , 478 - 482
    10. 10)
      • Iyengar, G., Neti, C.: `A vision based microphone switch for speech intent detection', Recognition, Analysis and Tracking of Face and Gestures in Real Time Systems (RATFG-RTS) Workshop, ICCV, 2001, Vancouver, Canada.
    11. 11)
      • Samani, A., Winkler, J., Niranjan, M.: `Automatic face recognition using stereo images', Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2006.
    12. 12)
      • W. Wang , S. Sanei , J. Chambers . Penalty function based joint diagonalization approach for convolutive blindsource separation of nonstationary sources. IEEE Trans. Signal Process. , 5 , 1654 - 69
    13. 13)
      • L.R. Rabiner . A tutorial on hidden Markov models and selected applications in speechrecognition. Proc. IEEE , 2 , 257 - 286
    14. 14)
      • J. Magarey , N. Kingsbury . Motion estimation using a complex-valued wavelet transform. IEEE Trans. Signal Process. , 4 , 1069 - 1084
    15. 15)
      • Castellano, G., Boyce, J., Sandler, M.: `Moving target detection in infrared imagery using a regularized CDWT optical flow', Proc. IEEE Workshop on Computer Vision Beyond the Visible Spectrum: Methods and Applications (CVBVS'99), 21–22 June 1999, p. 13–22.
    16. 16)
      • Rivet, B., Girin, L., Serviere, C., Pham, D.-T., Jutten, C.: `Using a visual voice activity detector to regularize the permutations in blind separation of convolutive speech mixtures', 15thInt. Conf. on Digital Signal Processing, July 2007, p. 223–226.
    17. 17)
      • van Bree, K.C., Belt, H.J.W.: `The use of a formant diagram in audiovisual speech activity detection', 15thEuro. Signal Processing Conf. (EUSIPCO), 2007.
    18. 18)
      • Sodoyer, D., Rivet, B., Girin, L., Schwartz, J.L., Jutten, C.: `An analysis of visual speech information applied to voice activity detection', IEEE Int. Conf. on Acoustics, Speech and Signal Processing, (ICASSP), 2006, Toulouse, France, p. 601–604.

Related content

This is a required field
Please enter a valid email address