A. Installing and starting the software

Start the installation by running Setup.exe. The setup program will install both the software as well as several files with sample data. The VISPER consists of 5 executable files: Visper, Vsignal, DTW, Vmarkov1 and Vmarkov2. The whole system is launched by starting Visper.exe.

To end your work with the VISPER system press the Exit button in the Visper Setup tool window.

B. The VISPER tools

Visper Setup: sets up conditions for experiments, starts the other tools
Signal Profiler: enables data recording and analysis, outputs recognition results, gives an access to the other tools
DTW Explorer: demonstrates visualized DTW matching
Visual Markov: provides visualized HMM training and matching
(All the above tools are displayed in the same size and colors as they will appear on your computer.)

C. Basic features of the speech processing tools

Speech data: sampling at 8000 Hz, 16 bit resolution, 20 ms frames with 10 ms overlapping
Speech parameters: 20 - 8 LPC cepstrum coefficients, 8 delta cepstrum coefficients, (log) energy, delta energy, delta-delta energy, spectral variation function coefficient (measuring the local change of cepstrum)
Endpoint detection: based on energy, max. duration of a speech token: 2000 ms
DTW algorithms: linear time warping, DTW with Itakura constraints, DTW with no slope constraints and a selectable range parameter
HMMs: continuous whole-word HMMs with up to 12 states and 3 mixtures

D. Operating modes

The VISPER system offers 10 operating modes. All of them can be launched from the starting page of the Visper Setup tool. You will arrive on that page also whenever you press Begin Again button on the Visper Setup window.

The operating modes have been arranged according to the complexity of the tasks. In single modes (1-5) only one tool is engaged at one time while in combined modes (6-9) the Signal Profiler runs in co-operation with the other tools. Here is the list of the modes:

Speech Recording mode: allows you to define vocabularies and record data
Signal Exploring mode: recorded data can be observed and analysed
DTW Matching mode: runs the DTW procedures on prerecorded reference and test data
HMM Training mode: trains continuous HMMs with the given parameters
HMM Matching mode: runs HMM matching procedures on trained models and prerecorded speech data
Pick & Match (HMM) mode: allows you to see a prerecorded data in the Signal Profiler window and make a recognition test with them by 'jumping' into the Visual Markov tool.
Pick & Match (DTW) mode: the same as above but with the DTW Explorer tool
Speak & Recognize (HMM) mode: allows you to say words in the Signal Profiler window, see recognition results there and jump into the Visual Markov tool (if interested).
Speak & Recognize (DTW) mode: the same as above but with the DTW Explorer tool.
Database Maintenance mode: allows you to do safe cleaning of the disk if some part of the database becomes obsolete.

E. Setting up conditions for the experiments

Each of the above modes can be set up within a Visper Setup procedure that involves several simple steps. The steps have their own pages (tab stops) where you are supposed to make some choices. After doing so, press the Next button to get further. In order to help the user, each page carries a blue-print prompt explaining what should be done. During the setup procedure you can see the already made selections in the lower part of the Visper Setup tool window. If you want to change any of the previously made selections, press the Begin Again button. The setup procedure usually does not take more than several seconds.

F. Recording new speech data

Recording is done in the Speech Recording mode. After selecting it in the Visper Setup initial page press the Next button to get to the next page - the vocabulary selection page. If you want to add a new recording to an already existing vocabulary, select the vocabulary name and press the Next button.

If you wish to found a new vocabulary check this option and press the Next button. You will arrive on the new vocabulary page. Enter the vocabulary name and then type vocabulary items. These should be either single words or groups of words linked by an underscore (like "go_down"). If necessary any of the items can be easily edited. To close the vocabulary press the Next button.

The data you want to record will be stored in one file. One file always keeps one repetition of the given vocabulary uttered by a specific speaker. Thus the next step consists in typing the name of the file that will be used for the new recording. This is to be done on the file page.

Afterthat you will arrive to the setup end page where you have a last chance to check if your settings are O.K. If yes press Next button and this will start the Signal Profiler tool.

The Signal Profiler will prompt you to start recording. If ready for recording, press Next word button and see which word is to be said. When you say it, the speech signal is automatically detected and displayed. Optionally it is also replayed. If not satisfied with that recording, press the Repeat button. Go on in this procedure after the whole vocabulary is recorded. Then you should leave the Signal Profiler to start another action. If you did not save the recording before, you are reminded to do so.
Note: During the recording session you can explore the recorded signal in the same way as the in Signal Exploring mode.

G. Observing the already recorded speech data

The Signal Exploring mode allows you to see and analyze the already recorded data. After selecting this mode, you will be asked to choose an existing vocabulary on the vocabulary selection page and an existing file on the file page. Check the selection on the setup end page and start the Signal Profiler tool.

Now the Signal Profiler will allow you to explore any of the word recordings stored in the given file. The words can be accessed either in sequentially by pressing the Next/Previous Word button or by choosing the wanted word within the combo box.

The Signal Profiler window provides you with several signal analysis panels:

The largest one displays the speech signal corresponding to the selected word. The signal is partitioned into 3 zones - a main zone containing the detected speech signal and two margin zones with pieces of the signal preceding and following the speech. (These margin zones are always 10 frames long and allow you to check the performance of the built in endpoint detector.) Any frame of the signal can be selected (see the brown-color zone) and displayed in detail in the frame panel. Moving the mouse pointer along the signal waveform results in a synchronized show of frames in the frame panel.

Optionally, the frame panel can show some other plots associated with the displayed frame, e.g. the Hamming windowed signal or the FFT plot. To use this option, click by the right mouse button on the panel and follow prompts in a dialogue box.

A group of six similar panels serves for displaying features selected from the given 20-feature set. Again, by clicking the right mouse button on each of the panels, you will be given a chance to select a feature and also its color.

The last panel displays a rough estimate of the signal spectrogram. By clicking the right mouse button on the panel you will see a dialogue box that will allow you to choose either power spectrum or amplitude spectrum. Optionally, also a 3D plot of the power spectrum can be shown.

H. Working with the DTW Explorer

The DTW Explorer tries to illustrate the problem of matching two parameterized speech signals by means of time warping algorithms. Though the DTW technique has already become obsolete for practical applications, its knowledge is still essential for understanding the principles of modern speech recognition techniques based on HMMs.

The best way to introduce the DTW technique is to launch the DTW Matching mode on the Visper Setup initial page. As usual you will pass through the vocabulary selection page. On the file page you will be asked to select one file that will provide test words for the DTW matching experiments. Then you will arrive at the reference file page. Here you should select at least one file (maximum is five) that will provide reference templates for the matching. Usually, the reference file and the test file should be different. Selection or deselection of the files is done using the appropriate buttons. Next stop is the work directory page. The chosen directory (either existing or new) will be used for storing working copies of the speech data. It is because the next page, the feature selection page, allows you to select an arbitrary subset of features for your experiments. This will result in creating temporary files stored in the work directory. Then check the selection on the setup end page and start the DTW Explorer tool.

The DTW Explorer starts with selecting the first word from the test file and getting ready for a visualized match with a selected reference template. Remember that for your convenience, whenever you (or the system) selects a new test word (or a new test algorithm), the system immediately finds the best reference with the lowest distance and offers it as a default for the match. In this way you can do very fast 'recognition tests' just by pressing the Next Word button and seeing which reference is offered for the match.

The DTW Explorer's screen consists again of several panels.

In the left part, there is a pair of speech signals to be matched. The upper is the tested word (see it above), the lower is a reference. Both of them are represented by a time plot of a selected feature (e.g. energy). The plot, though just a simplified one-dimensional projection of the signal, gives the user at least an approximate view on the signal contours and duration. Both the test and reference signals occupy two similar panels. The upper one always shows the original, the lower displays the signal after the warping procedure. The length (in number of frames) of the original and warped signal is also indicated.

The time-aligned signals as well as the local, accumulated and global distances are displayed in the lowest of the panels (figure above). All the distances are of the Euclidean type. The plot is animated synchronously with the main DTW plot shown below.

The large panel in the right half of the Explorer's screen provides a more detailed view on the time-alignment procedure. This is achieved by visualizing the space where the test-reference mapping is searched. The space is defined by a matrix of local distances computed for all pairs of test and reference frames. The DTW Explorer offers three visualization modes: a) the classic plot of the warping path, b) color map indicating the distances like in cartographic maps - see figure above left, c) a 3D (mountain-like) plot - the right figure.

A wide variety of system options allows for extensive investigation and experimental work. You can set up various DTW algorithms using the method dialogue box, select which plot type or which reference will be employed in the visualized match. Both speaker-dependent as well as multi-speaker (multi-reference) tests can be simulated.

The run of the matching procedure can be controlled by the buttons with self-explaining names, like, Pause, Continue and Step. To unload the DTW Explorer and return back to the Visper Setup screen press the VISPER button.

I. Training continuous HMMs

The HMM training and testing procedures are provided by the Visual Markov tool. In fact, the tool consists of two separate programs, Vmarkov1 and Vmarkov2, the first one specialized in the training, the second one in the testing. However, both of them show nearly the same face to the user so that we can speak just about one tool.

Let us start with the HMM Training mode. From the Visper Setup's initial page you pass through the vocabulary selection page and arrive at the reference file page. Here you should select the files (maximum is 20) that will provide data for training word models. Selection or deselection of the files is done using the appropriate buttons. Next stops of the setup journey will be on the work directory page and on the feature selection page. The work directory is particularly important in HMM recognition tests because you can have different types of models in different directories - all being ready for immediate use in recognition experiments. After you pass through the setup end page, the Visual Markov's training tool is launched.

After loading, the Visual Markov is ready to start the training of the first word as you can read in the white prompt line. To do it, just press the Train button. This will start the animated movie that shows the iterative training procedure. The procedure consists of a standard initialization part followed by the reestimation based on the Baum-Welch algorithm. Iterative procedure in any of the two parts finishes when a) the number of iterations exceeds a limit of 15 steps, or b) there is no significant change in the total likelihood score, or c) the next iteration produces a lower likelihood value. The run of the procedure can be controlled by the buttons with self-explaining names, like, Repeat, Pause, Continue and Step Ahead, Step Back or Break. To train the next word, use the appropriate button.

What types of HMMs are used in the Visual Markov? The models are left-to-right whole-word HMMs with transitions limited either to the same or to the next right state. The output probabilities are continuous multi-mixture gaussians with diagonal covariance matrices. How are the state parameters displayed?

Above is one of the eight state panels. Each of them serves for visualizing the parameters of the specified model state. (State S7 is displayed in the figure above.) The output probability function - defined as a function of N features in an N+1- dimensional space - is represented by its 3D cuts, i.e. as a function of two selected features (cep1 and cep2 on the figure above). (These cuts are correct because the covariance matrix is diagonal.) The number in the upper right corner denotes the loop transition probability.

The ninth panel is used to show some statistics concerning the training procedure, such as the model's name, its parameters, current training part, iteration step and the current (log) likelihood score. During the animated training procedure, that can take some tens of seconds, the user can investigate the evolution of all the model parameters. The 'movie' is particularly interesting in case of a multi-speaker multi-mixture model.

There are two types of menu options available for the training. The first one available through the model dialogue box allows you to set up the model parameters. The other, the display dialogue box, controls the display settings, such as the number and position of the displayed states as well as pairs of features to be displayed.

The models can be trained either individually - using the Train and Next Model buttons - or you can use the menu option that will cause automatic training of all the models in the given vocabulary.

J. Testing continuous HMMs

After you have trained a set of HHMs, you can test them. The simplest way to do it is the HMM Matching mode chosen within the Visper Setup's initial page. Select the vocabulary in the vocabulary selection page and the test file on the file page. The next step is selecting the directory on the model directory page. Here you must choose a directory with the desired set of models. If you want to see the parameters of the models in the specified directory, use the Models button. Complete the setup end page, which will launch the Visual Markov testing tool.

Its screen is very similar to that of the training tool. It has just several different control buttons whose meaning, however, is familiar to those who already run the DTW Explorer. Now, the main task accomplished by the Visual Markov is to demonstrate the Viterbi match between a test word and a selected model. As we know, the Viterbi algorithm searches for the most likely alignment between the word frames and model states. The likelihood value is then used for finding the best-fitting model that becomes the winning candidate of the recognition process.

How to illustrate the Viterbi search? In the Visual Markov, we aim at demonstrating the results of the search, not the search itself. Thus the matching procedure can run fast in the background outputting just the data that are necessary for displaying the alignment and for ordering the candidates. To present the match, the Visual Markov aligns the word frames with the corresponding model states and then shows the local likelihoods achieved in each frame. The likelihood value is found as the point on the state pdf whose position corresponds to the frame vector. The point is highlighted by a green ball (see figure below).

Travelling through the model, frame after frame, state after state, the green ball collects likelihood values. The closer to the top of gaussians, the higher are the contributions to the total likelihood score. The current numbers and coordinates of the frame vectors as well as the current values of the total score are displayed in the ninth panel (see figure below).

The matching procedure is animated so that you can follow the green ball on its way. By selecting different models you can understand why some models achieve higher scores than the others. There is a rule similar to the DTW Explorer: Whenever a test word is selected, the system immediately finds the best model and offers it as a default for the match. The other models are available through a dialogue box that appears after you press the Select Model button. There is also a display option dialogue similar to that of the training tool.

K. Running combined modes

All the previously described modes employ just one tool at the same moment. This is good mainly for introductory lessons. However, for more advanced investigations it may be useful to combine several tools together. For, example, it is interesting to see the recorded signal first and then to investigate how it is classified by means of the DTW or HMM technique. Or if some utterance is classified wrongly you may want to hear or see the signal to be sure that it was recorded well.

This is possible within the combined modes 6 and 7. These modes allows you to start the Signal Profiler with the selected file and then jump temporarily either to the DTW Explorer or to the Visual Markov. Switching between the tools is done by pressing the appropriate buttons.

L. Running real-time recognition modes

The modes 8 and 9 are similar to the previous ones but they will allow you to do experiments with on-line speech data. After pressing the Speak button in the Signal Profiler you can say a word that is immediately recognized. You see the recognition result in the combo box. Nevertheless, you still have a possibility to go to the recognition tools (the DTW Explorer or the Visual Markov) to observe the whole classification procedure. See a sample screen of the Speak & Recognize (HMM) mode.

M. Database maintenance

To help the user in maintaining the speech databases we have included a simple utility that makes disk cleaning safe and easy. The utility is a part of the Visper Setup tool. It has a simple design that allows to delete single files, work directories or even the directories containing data belonging to a specific vocabulary.

Back to the main page