BALDI (talking head) speaking Czech (2001)
Baldi is a computer animated talking head developed at the University of California at Santa Cruz http://cslu.cse.ogi.edu/toolkit in late 1990s. Baldi produces realistic animation of face, mouth and tongue movements synchronized with either synthetic or natural speech. Primarily, Baldi was developed as an aid for teaching hearing handicapped children, but it might have much broader scope of usage, e.g. as an animated agent in information kiosks, as a tool supporting perception of synthesized speech in noisy conditions, as an artificial tutor in foreign language learning or as an character in computer games. The software of Baldi consists of two main programs: the 3D face animator and speech synchronization tool. They run on the widely used MS Windows (95/98/2000) platform. The original version of the Baldi program could 'speak' only English. Later, also a Spanish version was developed. The goal of our work was to learn Baldi speaking Czech.
Learning Baldi Czech language
In the first step we just tried to mimic Czech language using the original software. A Czech sentence was transcribed into a sequence of English phonemes and replayed by the English TTS system synchronized with Baldi. The key problem was to find the English phonemes that were most similar to the Czech ones - both at acoustic as well as at visual level. In cases there was no direct correspondence, e.g. for Czech unique consonants „ř", „ď", „ť" or „ň" we had to utilize the closest English consonants, like „r", „d", „t" and „n". The next step was to override the Baldi synchronization module and thus to allow for our own control of Baldi. In this way we wanted to overcome the troubles with the above mentioned manual English-corresponding transcription of Czech sentences. Here, we employed the SOB files that are used in the original package for the final control of the Baldi movements. We can see that the file includes the information on speech signal (either syntesized or recorded) and the position of individual phonemes.
Three tasks for Baldi
We wished the Czech version of the Baldi software could be used in three different tasks:
Let's note that in the original software only the first two tasks were solved (for English, of course). For the first task we employed the Czech TTS system developed at Institute of Radioelectronics (URE) in Prague. Their system uses Czech diphone inventory and cepstral representation of speech signal. The synthesis module provides information about diphone time boundaries which is utilized for estimation of the phoneme boundaries needed in the section 4 of the corresponding SOB file. In the second task we could benefit from the fact that the content (text transcription) of the utterance is known. Thus we can apply a text-to-phoneme transcription followed by a phoneme-to-signal alignment procedure. Both procedures were developed earlier within the Aligner project. The third task was the most difficult one because we have no information about the content of the utterance. Therefore we must rely on an estimate of what was spoken. In our case this is done by running a speech recognizer that searches for the most probable sequence of phonemes in the signal. We employ our own recognition engine working with an inventory of HMMs trained for Czech phonemes. The performance of the phoneme recognition is not high - about 72 % successfully classified phonemes. However, detailed analysis show that this fact is not critical since most recognition errors consists in confusion of very close phonemes (e.g. short and long version of the same vowels).
Video:
More information: