Market Overview - Speech Recognition: Speakers' Corner

Fancy learning about speech recognition? Then hunker down and listen to Chris Long as he talks qwerty to his PC

Like so many boffin-oriented technologies speech recognition has suffered at the hands of science fiction entertainment ? the grey area between what is possible and wholly impossible hijacked by people?s imaginations. Two prime examples are HAL, the mentally unstable talking computer in 2001, and the computer on board the USS Enterprise, which seems to be able to discern when people are talking to it even if they are in conversation with someone else.

This wouldn?t normally be a problem, but trying to sell speech products to people who have only seen the Star Trek version of the technology can prove a difficult task.

All the same, a sure sign that the market is maturing is that people correct you if you call it voice recognition. And it is a mark of both speech and voice technologies that we can easily define them.

For anyone who wants clarification, voice recognition is just that ? recognising someone?s voice, usually as part of a security system. Speech recognition, on the other hand, understands the words and either acts on them or turns them into text.

One story about speech and voice recognition, which is probably apocryphal but serves to highlight the problems with both technologies, concerns a computer expert. He rigged up a sophisticated control system at his house that among other things would recognise his voice. On returning home, the expert would speak to the computer, and it, upon recognition, would let him in.

One evening, after a night imbibing alcohol with fellow boffins, he arrived home and shouted: ?Hallo computer, open the door please.? Which, of course, when channelled through his addled speech system came out as: ?Hesho coompooor opn dr pls.?

The computer, baffled by this gibbering idiot at the front door, did no such thing and our scientist spent the night on the doorstep.

This example indicates the difficulty in the reasonably simple process of understanding what someone says ? we do it without problems every day and can even work out what someone is saying from only a few words. Computers, having nowhere near our own processing power, have a harder time of it. All the same, things are looking up for the speech recognition market after some years in the research and development wilderness.

The market is healthier and fitter than ever before, and as a result speech recognition is appearing on more and more desktops. And it may be a surprise to some that this change is more or less down to IBM and its RS/6000 dictation product, which it pitched at about #800 to #1,000, forcing players like Dragon and Kurzweil to cut prices to remain in the game.

Philips came to the market from the other end, moving from dictation machines to the next logical step, computerised dictation systems. All attention was focused on the development of better and faster recognition engines.

In the meantime, the escalation in processing power has allowed manufacturers to offload specific tasks in the recognition process to the system hardware. This in turn has brought down the price of the products. Some systems are now using the multimedia bits of MMX processors. IBM has a system that needs either an MMX processor or a next level up non-MMX processor ? giving some indication that these systems are somewhat dependent on processor power.

Whereas four years ago a fully configured speech system on a PC cost about #5,000, speech recognition software with a 30,000-word vocabulary is now available for about #100 ? all you need to do is add a PC. Last year IBM sold more than 250,000 copies of its speech recognition product, going a long way to show how popular such applications are becoming.

There are several different approaches to speech recognition. Each one defines the application, the use to which it is put, and the amount of technology needed. These areas are: speaker dependent or independent, continuous or discrete speaking, and large or limited approach to word definition.

Looking in turn to each of these distinctions, speaker dependency can be defined as the degree to which the computer recognises the words spoken without requiring some form of word training. In theory, anyone should be able to use a speaker independent system immediately.

In essence, speaker independence acts like an automatic phone system trying to understand simple words straight away. You might be given a list of responses like ?yes? and ?no? and the system will understand only these words.

Such systems have a limited vocabulary but need no training and will have a go at understanding anyone. They are effectively trained to listen for a set of vocal sounds within certain perimeters which define all sorts of things from voice pitch to accent.

At the other end of the spectrum is speaker dependent. Here the computer hasn?t the faintest idea what you are talking about and you train it by associating your word sounds with its vocabulary. Thus you say ?aardvark? and you have to find aardvark in its vocabulary and connect the two. It effectively has a recording of your voice and will play back the connected word ? thus you could quite conceivably record one word and associate it with another, a system more useful for politicians.

Although speech engines are usually touted as speaker independent or speaker dependent, there is a third, more popular approach: in between. In this middle or adaptive approach, the system has some idea of what you are saying, with some sounds hardwired into it for easy match-up, but it will take a bit of training (typically an hour or two) before the system is fully operational. For the training, you read a list of words at it and the system matches them up to its vocabulary ? so if you were to say ?spam? it would offer ?swim, swam, spam, mam? and you select the right one.

The spectrum of recognition systems is tied in with the required applications. First off is the simplest system: command and control. This is where you tell the system to do something. Thus command and control speech recognition is closer in concept to voice-operated macros or hot keys, where a command is connected to a set of actions. For example, you can say ?paragraph? and it puts in an indent and two carriage returns. Typically these systems are less sophisticated.

Running parallel to command and control systems are dictation or speech systems, where you tell the computer what you want to write and the system converts it to words on the screen. This puts more stress on the technology but, according to some, it is faster to talk into a computer than it is to hunt and peck with your index fingers on the keyboard, although many remain unconvinced.

Now that the new generation of powerful dictation systems have started to come through at much reduced prices there is a serious alternative to typing.

There are two approaches to getting the words on the screen: batch and immediate. Batch is where you talk to the system and it records what you say, it then goes away and processes the file and comes up with the words. Immediate is just as it sounds ? it throws up words as it figures them out, in effect as you talk.

Some people claim that the batch system is better because it doesn?t distract the speaker by putting words on the screen when they are talking ? though these people are usually manufacturers of batch systems.

Speech systems fall into two categories: discrete, and continuous or natural speech. At this point it is worth knowing some of the problems that a computer has when trying to decipher what we are saying. We tend to pronounce words differently in phrases than individually, for example ?would you? becomes ?wouldja?. This is called co-articulation and makes it doubly difficult to identify the phonemes (speech sounds) in what we say as well as deciding which phonemes belong with which words (see box).

Thus to solve problems like this you have to help the computer along a bit.

Discrete speech recognition for example requires you to talk ? very ? slowly ? usually with a 10th of a second break between each word. The pauses provide clues to the recognition engine about where the word starts and finishes and thus reduces the amount of processing power necessary for accurate results.

With the pauses between words, discrete speech engines limit speech to an effective rate of about 50 to 75 words per minute, a lot less than the average 200 or so words per minute in normal speech.

Natural speech programs are relatively new on the PC and there are now systems that can get up to 140 words a minute. So, you banter away to yourself and the software frantically works out what you are saying and puts it up on the screen. A couple of years ago this would have been very difficult if not impossible.

The final part of speech recognition is the words that are contained in the program. These fall into large or limited dictionary systems.

When the system looks at the sentence it has just received, it has to parse it ? work out the grammatical structure of what it is looking at ? so it can figure out what is being said. Thus it looks up the words it thinks you have spoken in a dictionary and tries to make some sense of them. This isn?t to say that it tries to understand what you are saying but merely to reproduce the words you have spoken on the screen.

Large vocabulary systems use an appallingly complex system called N-gram modelling, in which N represents the number of words the system looks at and tries to decode at one time ? on current systems it is generally three words.

The model, through extrapolation, attempts to guess the next word that will be spoken, while also checking the three words to make sure they are correct. For example, if ?cat sat off? appeared, it would check its grammar system and work out that it was much more likely to be ?cat sat on?. And if the next word was ?the? and after that ?mat? it would pat itself on the back for a job well done.

To take the sample number in N-gram to four words would require an order of magnitude increase in processing and memory power. Manufacturers don?t expect to go beyond three in the near future.

Small vocabulary systems understand only a few hundred words, and are almost certainly application specific, for example, wordprocessing or legal letter writing. With a limited set of words to choose from, and a smaller set of relationships between these words (like say, ?file, open?), these systems are often just right for command and control systems.

The vocabulary is important because it defines where a product is used. For example, the legal and medical worlds have very specific vocabularies, and so speech systems aimed at these areas offer specific dictionaries. Thus users can produce language specific documents. The demand for these vertical solutions means that they are currently the main focus of commercial speech recognition systems.

A growing number of products allow you to swap between a specialised dictionary and a normal speech dictionary, but these tend to be more advanced programs. This setup, combined with the natural speech system, points the way to the future.

Obviously we won?t be saying goodbye to the keyboard for a good while, if ever, but we will certainly be talking to our PCs a lot more.