Speech-driven facial animation using a hierarchical model

Buy article PDF

Abstract

A system capable of producing near video-realistic animation of a speaker given only speech inputs is presented. The audio input is a continuous speech signal, requires no phonetic labelling and is speaker-independent. The system requires only a short video training corpus of a subject speaking a list of viseme-targeted words in order to achieve convincing realistic facial synthesis. The system learns the natural mouth and face dynamics of a speaker to allow new facial poses, unseen in the training video, to be synthesised. To achieve this the authors have developed a novel approach which utilises a hierarchical and nonlinear principal components analysis (PCA) model which couples speech and appearance. Animation of different facial areas, defined by the hierarchy, is performed separately and merged in post-processing using an algorithm which combines texture and shape PCA data. It is shown that the model is capable of synthesising videos of a speaker using new audio segments from both previously heard and unheard speakers.

References

    1. 1)
      • Parke, F.: `Computer generated animation of faces', Proc. ACM National Conf., 1972
    2. 2)
    3. 3)
      • Reveret, L., Bailly, G., Badin, P.: `Mother: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation', Proc. 6th Int. Conf. on Spoken Language Processing (ICSLP), Beijing, China, Oct. 2000
    4. 4)
      • Ezzat, T., Geiger, G., Poggio, T.: `Trainable videorealistic speech animation', Proc. Computer Graphics and Interactive Techniques, San Antonio, TX, USA, July 2002, p. 388–398
    5. 5)
      • Theobald, B., Cawley, G., Glauert, J., Bangham, A.: `2.5 d visual speech synthesis using appearance models', Proc. BMVC, Norwich, UK, 2003, 1, p. 43–52
    6. 6)
      • Bregler, C., Covell, M., Slaney, M.: `Video rewrite: driving visual speech with audio', Proc. 24th Conf. on Computer Graphics and Interactive Techniques, 1997, p. 353–360
    7. 7)
      • Le Goff, B., Benoit, G.: `A text-to-audiovisual-speech synthesiser for French', Proc. Int. Conf. on Spoken Language Processing (ICSLP), 1996
    8. 8)
      • Brand, M.: `Voice puppetry', Proc. Computer Graphics and Interactive Techniques, 1999, p. 21–28
    9. 9)
      • Beskow, J.: `Rule-based visual speech synthesis', Proc. Eurospeech, 1995, p. 299–302
    10. 10)
      • Huang, F.J., Chen, T.: `Real-time lip-synch face animation driven by a human voice', Proc. IEEE Workshop on Multimedia Signal Processing, Los Angeles, CA, USA, 1998
    11. 11)
    12. 12)
      • ‘Final fantasy’. DVD edition, Columbia Tri-Star, 2001
    13. 13)
    14. 14)
    15. 15)
    16. 16)
    17. 17)
    18. 18)
      • Deller, J.R., Proakis, J.G., Hansen, J.H.L.: Discrete-time processing of speech signals, 1993 (Macmillan Publishing Co., New York)
    19. 19)
      • Cootes, T., Taylor, C.: `A mixture model for representing shape variation', Proc. British Machine Vision Conf., 1997, p. 110–119
    20. 20)
      • Bowden, R.: `Learning non-linear models of shape and motion', 2000, PhD thesis, Dept. of Systems Engineering, Brunel University, UK
    21. 21)
      • Nitchie, E.: How to read lips for fun and profit, 1979 (Hawthorn BooksNew York, USA)
This is a required field
Please enter a valid email address