What Exactly is an Acoustic Model?
The first piece is called the acoustic model, and basically what that is, is a model of all of the basic sounds of language. The third piece of the model is the model of how we put words together into phrases and sentences in the language. How many words are in the Google voice search database? All of these are statistical models, and so for example, this model, although it's capturing, sort of, the grammatical constraints of the language, it's doing it in a statistical way based on feeding it lots of data. So data-driven approaches are approaches based on building large statistical models of the language by feeding it lots of data. And then there are words like "route" that have alternate pronunciations -- you can say "root" or "rout" and both are correct. The English language has a lot of homonyms -- words that phonetically sound the same but mean different things. So for example, that model might learn that if the recognizer thinks it just recognized "the dog" and now it's trying to figure out what the next word is, it may know that "ran" is more likely than "pan" or "can" as the next word just because of what we know about the usage of language in English. This post w as c reated by GSA Conte nt Generator Demoversion!
So the lexical models are built by stringing together acoustic models, the language model is built by stringing together word models, and it all gets compiled into one enormous representation of spoken English, let's say, and that becomes the model that gets learned from data, and that recognizes or searches when some acoustics come in and it needs to find out what's my best guess at what just got said. These are the challenges Cohen and his team face at Google. What Google applications are currently using speech recognition? To understand English, there are many hurdles one must overcome. Twenty or 30 years ago, there were sort of two camps. In the Android developer's kit, Google makes two models available: the Freeform model and the Web search model. Think of "to," "two" and "too." People speaking with an accent or in a regional dialect may pronounce words in a way that's dramatically different from the standard pronunciation. And what that is, is a definition for all of the words in the language of how they get pronounced. What I mean by that is rather than having people go in and try to program all these rules or all of these descriptions of how language works, we tried to build models where we could feed lots and lots of data to the models, and the models will learn about the structure of speech from the data.
One of the fundamental things, given the kind of data-driven approach that we take, is we try to have very large, broad training sets. We all work on the boundary of trying to understand language, the structure of language, trying to develop algorithms, machine learning style algorithms where we figure out how do we come up with a better model that can better capture the structure of speech, and then have an algorithm such that we feed that model lots and lots of data, and the model both changes its structure and alerts its internal parameters to become a better, richer model of language, given the data that's being fed to it. How do we alter the model so that we can do a better job for those longer-distance restraints that matter, to capture them in a model? We have enough instances of Brooklyn accents -- and not just thanks to me -- but we have people from Brooklyn that have spoken to our systems such that we do a good job when people with Brooklyn accents talk to our system. We have large amounts of data coming in from all kinds of people with all kinds of accents, saying all kinds of things, and so on and so forth, and the most important thing is to have good coverage in your training set of whatever is coming in.
For example, we have something called delta feature, so we not only look at what's the acoustics at this moment, but what's the trajectory of those acoustics? Yeah, that word has been used loosely, and it has meant a couple different things over time. In the most general sense, you could think of it as a description of what we might expect in terms of what word strings can happen. You might expect most people will say either "A," "B," or "C," or they might say, "I want A" or "B please," or things like that, things that because of the application were fairly predictable. So more recently, like, in the last 25 years, those communities came together and we learned certain things from the linguists about the structure of speech, like the fact that I mentioned earlier, which is the production of any particular phoneme is very influenced by the phonemes that surround it.
Post a Comment for "What Exactly is an Acoustic Model?"