How voice2json Works
At a high level, voice2json
transforms audio data (voice commands) into JSON events.
The voice commands are specified beforehand in a compact, text-based format:
[LightState]
states = (on | off)
turn (<states>){state} [the] light
This format supports:
[optional words]
(alternative | choices)
name = body
- rules<rule name>
- rule references(value){name}
- tagsinput:output
- substitutions$movies
- slot lists1..100
- number sequencesTEXT!float
- converters
During training, voice2json
generates artifacts that can recognize and decode the specified voice commands. If these commands change, voice2json
must be re-trained.
Core Components
voice2json
core functionality can be broken down into speech and intent recognition components.
When voice commands are recognized by the speech component, the transcription is given to the intent recognizer to process. The final result is a structured JSON event with:
- An intent name
- Recognized slots/entities
- Optional metadata about the speech recognition process
- Input text, time, tokens, etc.
For example:
{
"text": "turn on the light",
"intent": {
"name": "LightState"
},
"slots": {
"state": "on"
}
}
Speech to Text
The offline transcription of voice commands in voice2json
is handled by one of three open source systems:
- Pocketsphinx
- CMU (2000)
- Kaldi
- Johns Hopkins (2009)
- DeepSpeech
- Mozilla (v0.6, 2019)
Pocketsphinx and Kaldi both require:
- An acoustic model
- Maps audio features to phonemes
- A pronunciation dictionary
- Maps phonemes to words
- A language model
- Describes how often words follow other words
DeepSpeech combines the acoustic model and pronunciation dictionary into a single neural network. It still uses a language model, however.
Acoustic Model
An acoustic model maps acoustic/speech features to likely phonemes in a given language.
Typically, Mel-frequency cepstrum coefficients (abbreviated MFCCs) are used as acoustic features. These mathematically highlight useful aspects of human speech.
Phonemes are language (and even locale) specific. They are the indivisible units of word pronunciation. Determining a language’s phonemes requires a linguistic analysis, and there may be some debate over the final set. Individual human languages typically have no more than a few dozen phonemes. The set of all possible phonemes can be represented using the International Phonetic Alphabet.
An acoustic model is a statistical mapping between audio features (MFCCs) and one or more phonemes. This mapping is learned from a large collection of speech examples along with their corresponding transcriptions. A pre-built pronunciation dictionary is needed to map transcriptions back to phonemes before a model can be trained. Collecting, transcribing, and validating these large speech data sets is a limiting factor in open source speech recognition.
Pronunciation Dictionary
A dictionary that maps sequences of phonemes to words is needed both to train an acoustic model and to do speech recognition. More than one mapping (pronunciation) is possible for each word.
For practical purposes, let’s consider a word to be just the “stuff between whitespace” in text. Regardless of how exactly you define what a “word” is, what matters most is consistency: someone needs to decide if compound words (like “pre-built”), contractions, etc. are single (“prebuilt”) or multiple words (“pre” and “built”).
Below is a table of examples phonemes for U.S. English from the CMU Pronouncing Dictionary.
Phoneme | Word | Pronunciation |
---|---|---|
AA | odd | AA D |
AE | at | AE T |
AH | hut | HH AH T |
AO | ought | AO T |
AW | cow | K AW |
AY | hide | HH AY D |
B | be | B IY |
CH | cheese | CH IY Z |
D | dee | D IY |
DH | thee | DH IY |
EH | Ed | EH D |
ER | hurt | HH ER T |
EY | ate | EY T |
F | fee | F IY |
G | green | G R IY N |
HH | he | HH IY |
IH | it | IH T |
IY | eat | IY T |
JH | gee | JH IY |
K | key | K IY |
L | lee | L IY |
M | me | M IY |
N | knee | N IY |
NG | ping | P IH NG |
OW | oat | OW T |
OY | toy | T OY |
P | pee | P IY |
R | read | R IY D |
S | sea | S IY |
SH | she | SH IY |
T | tea | T IY |
TH | theta | TH EY T AH |
UH | hood | HH UH D |
UW | two | T UW |
V | vee | V IY |
W | we | W IY |
Y | yield | Y IY L D |
Z | zee | Z IY |
ZH | seizure | S IY ZH ER |
More recent versions of this dictionary include stress, indicating which parts of the word are emphasized during pronunciation.
During training, voice2json
copies pronunciations for every word in your voice command templates from a large pre-built pronunciation dictionary. Words that can’t be found in this dictionary have their pronunciations guess using a pre-trained grapheme to phoneme model.
Grapheme to Phoneme
A grapheme to phoneme (G2P) model can be used to guess the phonetic pronunciation of words. This is a statistical model that maps sequences of characters (graphemes) to sequences of phonemes, and is typically trained from a large pre-built pronunciation dictionary. voice2json
uses a tool called Phonetisaurus for this purpose.
Language Model
A language model describes how often some words follow others. It is common to see models that go from one to three words in a row.
Language models are created from a large text corpus, such as books, news sites, Wikipedia, etc. Not all combinations will be present in the training materal, so their probabilities have to be predicted by a heuristic.
Below is a made-up example of word singleton/pair/triplet probabilities for a corpus that only contains the words “sod”, “sawed”, “that”, “that’s”, and “odd”.
0.2 sod
0.2 sawed
0.2 that
0.2 that's
0.2 odd
0.25 that's odd
0.25 that sawed
0.25 that sod
0.25 odd that
0.5 that's odd that
0.5 that sod that
During speech recognition, incoming phonemes may match more than one word from the pronunciation dictionary. The language model helps narrow down the possibilities by telling the speech recognizer that some word combinations are very unlikely and can be ignored.
Sentence Fragments
The language model does not contain probabilities for entire sentences, only sentence fragments. Getting a complete sentence from the speech recognizer requires a few tricks:
- Adding virtual start/stop sentence “words” (
<s>
,</s>
)<s> what time
is the start of a sentence “what time…”is it </s>
is the end of a sentence “…is it?”
- Use sliding time windows
- Fragments are stitched together using overlapping windows
- “what time”, “time is”, “is it” for the sentence “what time is it”
- Breaking audio at long pauses or always assuming a single sentence
- You can always assume the first “word” is
<s>
(start of sentence) - Where to put
</s>
(end of sentence), though?
- You can always assume the first “word” is
When using these tricks, the recognized “sentences” may still be non-sensical and have little to do with previous sentences. For example:
that sod that that sod that sawed...
Modern transformer neural networks can handle long-term dependencies within and between sentences much better, but:
- They require a huge amount of training data
- They can be slow/resource intensive to (re-)train and execute without specialized hardware
For voice2json
’s intended use (pre-specified, short voice commands) the tricks above are usually good enough. While cloud services can be used with voice2json
, there are trade-offs in privacy and resiliency (loss of Internet or cloud account).
Language Model Training
During training, voice2json
generates a custom language model based on your voice command templates (usually in ARPA format). Thanks to the opengrm library, voice2json
can take the intermediary sentence graph produced during the initial stages of training and directly generate a language model! This enables voice2json
to train in seconds, even for millions of possible voice commands.
Language Model Mixing
voice2json
’s custom language model can optionally be mixed with a much larger, pre-built language model. Depending on how much weight is given to either model, this will increase the probability of your voice commands against a background of general sentences in the profile’s language.
When mixed appropriately, voice2json
is capable of (nearly) open-ended speech recognition with a preference for the user’s voice commands. Unfortunately, this will usually result in lower speech recognition performance and many more intent recognition failures (which is only trained on the user’s voice commands).
Text to Intent
The speech recognition system(s) in voice2json
produce text transcriptions that are then given to an intent recognition system. When both speech and intent systems are trained together from the same template file, all valid commands (with minor variations) should be correctly translated to JSON events.
voice2json
transforms the set of possible voice commands into a graph that acts as a finite state transducer (FST). When given a valid sentence as input, this transducer will output the (transformed) sentence along with “meta” words that provide the sentence’s intent and named entities.
As an example, consider the sentence template below for a LightState
intent:
[LightState]
states = (on | off)
turn (<states>){state} [the] light
When trained with this template, voice2json
will generate a graph like this:
Each state is labeled with a number, and edges (arrows) have labels as well. The edge labels have a special format, which represent the input required to traverse the edge and the corresponding output. A colon (“:”) separates the input/output words on an edge, and is omitted when both input and output are the same. Output “words” that begin with two underscores (“__”) are “meta” words that provide additional information about the recognized sentence.
The FST above will accept all possible sentences in the template file:
- turn on the light
- turn on light
- turn off the light
- turn off light
This is the output when each sentence is accepted by the FST:
Input | Output |
---|---|
turn on the light |
__label__LightState turn __begin__state on __end__state the light |
turn on light |
__label__LightState turn __begin__state on __end__state light |
turn off the light |
__label__LightState turn __begin__state off __end__state the light |
turn off light |
__label__LightState turn __begin__state off __end__state light |
The __label__
notation is taken from fasttext, a highly-performant sentence classification framework. A single meta __label__
word is produced for each sentence, labeling it with the property intent name.
The __begin__
and __end__
meta words are used by voice2json
to construct the JSON event for each sentence. They mark the beginning and end of a tagged block of text in the original template file – e.g., (on | off){state}
. These begin/end symbols can be easily translated into a common scheme for annotating text corpora (IOB) in order to train a Named Entity Recognizer (NER). flair can read such corpora, for example, and train NERs using PyTorch.
The voice2json
NLU library currently uses the following set of meta words:
__label__INTENT
- Sentence belongs to intent named
INTENT
- Sentence belongs to intent named
__begin__TAG
- Beginning of tag named
TAG
- Beginning of tag named
__end__TAG
- End of tag named
TAG
- End of tag named
__convert__CONV
- Beginning of converter named
CONV
- Beginning of converter named
__converted__CONV
- End of converter named
CONV
- End of converter named
__source__SLOT
- Name of slot list where text came from
__unpack__PAYLOAD
- Decodes
PAYLOAD
as a base64-encoded string and then interprets as edge label
- Decodes
fsticuffs
voice2json
’s FST-based intent recognizer is called fsticuffs
. It takes the intent graph generated during training and uses it to convert transcriptions from the speech system into JSON events.
Intent recognition is done by simply running the transcription through the intent graph and parsing the output words (and meta words). The transcription “turn on the light” is split (by whitespace) into the words turn
on
the
light
.
Following a path through the example intent graph above with the words as input symbols, this will output:
__label__LightState
turn
__begin__state
on
__end__state
the
light
A fairly simple state machine receives these symbols/words, and constructs a structured intent that is ultimately converted to JSON. The intent’s name and named entities are recovered using the __label__
, __begin__
, and __end__
meta words. All non-meta words are collected for the final text string, which includes substitutions and conversions. The final output is something like this:
{
"text": "turn on the light",
"intent": {
"name": "LightState"
},
"slots": {
"state": "on"
}
}
Fuzzy FSTs
What if fsticuffs
were to receive the transcription “would you turn on the light”? This is not a valid example voice command, but seems reasonable to accept via text input (e.g., chat).
Because would
and you
are not words encoded in the intent, the FST will fail to recognize it. To deal with this, voice2json
allows stop words to be silently passed over during recognition if they would not have been accepted. This “fuzzy” recognition mode is slower, but allows for may more sentences to be accepted.
Conclusion
When trained, voice2json
produces the following artifacts:
- A pronunciation dictionary containing only the words from your voice command templates
- Words missing from the dictionary have their pronunciations guessed using a grapheme to phoneme model
- An intent graph that is used to recognize intents from sentences
- Can optionally ignore common words to allow for “fuzzier” recognition
- A language model generated directly from the intent graph using opengrm
- This may be optionally mixed with a large pre-built language model