• Home • Formats
Data Formats
voice2json strives to use only common data formats, preferably text-based. Some artifacts generated during training, such as your language model, are even usable by other speech systems.
Audio
voice2json expects 16-bit 16Khz mono audio as input. When WAV data is provided in a different format, it is automatically converted with sox.
Transcriptions
The transcribe-wav command produces JSON in the following format:
{
"text": "transcription text",
"transcribe_seconds": 0.123,
"wav_name": "test.wav",
"wav_seconds": 1.456
}
where
textis the most likely transcription of the audio data (string)transcribe_secondsis the number of seconds it took to transcribe (number)wav_nameis the name of the WAV file (string)wav_secondsis the duration of the WAV audio in seconds (number)
Intents
The recognize-intent command produces JSON in the following format:
{
"intent": {
"name": "NameOfIntent",
"confidence": 1.0
},
"entities": [
{ "entity": "entity_1", "value": "value_1", "raw_value": "value_1",
"start": 0, "end": 1, "raw_start": 0, "raw_end": 1 },
{ "entity": "entity_2", "value": "value_2", "raw_value": "value_2",
"start": 0, "end": 1, "raw_start": 0, "raw_end": 1 }
],
"slots": {
"entity_1": "value_1",
"entity_2": "value_2"
},
"text": "transcription text with substitutions",
"raw_text": "transcription text without substitutions",
"tokens": ["transcription", "text", "with", "substitutions"],
"raw_tokens": ["transcription", "text", "without", "substitutions"],
"recognize_seconds": 0.001
}
where
intentdescribes the recognized intent (object)nameis the name of the recognized intent (section headers in your sentences.ini) (string)confidenceis a value between 0 and 1, with 1 being maximally confident (number)
entitiesis a list of recognized entities (list)entityis the name of the slot (string)valueis the (substituted) value (string)raw_valueis the (non-substituted) value (string)startis the zero-based start index of the entity intext(number)raw_startis the zero-based start index of the entity inraw_text(number)stopis the zero-based stop index (exclusive) of the entity intext(number)raw_stopis the zero-based stop index (exclusive) of the entity inraw_text(number)
slotsis a dictionary of entities/values (object)- Assumes one value per entity. See
entitiesfor complete list.
- Assumes one value per entity. See
textis the input text with substitutions (string)raw_textis the input text without substitutionstokensis the list of words/tokens intextraw_tokensis the list of words/tokens inraw_textrecognize_secondsis the number of seconds it took to recognize the intent and slots (number)
Pronunciation Dictionaries
Dictionaries are expected in plaintext, with the following format:
word1 P1 P2 P3
word2 P1 P4 P5
...
Each line starts with a word and, after some whitespace, a list of phonemes are given (separated by whitespace). These phonemes must match what the acoustic model was trained to recognize.
Multiple pronunciations for the same word are possible, and may optionally contain an index:
word P1 P2 P3
word(1) P2 P2 P3
word(2) P3 P2 P3
A voice2json profile will typically contain 3 dictionaries:
base_dictionary.txt- A large, pre-built dictionary with most of the words in a given language
custom_words.txt- A small, user-defined dictionary with custom words or pronunciations
dictionary.txt- Contains exactly the vocabulary needed for a profile
- Automatically generated by train-profile
Sounds Like Pronunciations
voice2json supports an alternative way of specifying custom word pronunciations. In a file named sounds_like.txt in your profile (see training.sounds-like-file), you can describe how a word should be pronounced by referencing other words:
unknown_word1 known_word1 [known_word2] ...
...
For example, the singer Beyoncé sounds like a combination of the words “bee yawn say”:
beyoncé bee yawn say
During training, voice2json will look up the pronunciations for the known words and construct a pronunciation for the unknown word from them. The base dictionary and your custom words are consulted for known word pronunciations.
You may reference a specific pronunciation for a known word using the word(n) syntax, where n is 1-based. Pronunciations are loaded in line order from base_dictionary.txt first and then custom_words.txt. For example, read(2) will reference the second pronunciation of the word “read”. Without an (n), all pronunciations found will be used.
Phoneme Literals
You can interject phonetic chunks into these pronunciations too. For example, the word “hooiser” sounds like “who” and the “-zure” in “azure”:
hooiser who /Z 3/
Text between slashes (/) will be interpreted as phonemes in the configured speech system.
Word Segments
If a grapheme-to-phoneme alignment corpus is available (training.grapheme-to-phoneme-corupus), segments of words can also be used for pronunciations. Using the “hooiser” example above, we can replace the phonemes with:
hooiser who a>zure<
This will combine the pronunciation of “who” from the current phonetic dictionaries (base_dictionary.txt and custom_words.txt) and the “-zure” from the word “azure”.
The brackets point >at< the segment of the word that you want to contribute to the pronunciation. This is accomplished using a grapheme-to-phoneme alignment corpus generated using phonetisaurus
and the base_dictionary.txt file. In the a>zure< example, the word “azure” is located in the alignment corpus, and the output phonemes from the phonemes “zure” in it are used.
Language Models
Language models must be in plaintext ARPA format.
A voice2json profile will typically contain 2 language models:
base_language_model.txt- A large, pre-built language model that summarizes a given language
- Used when
--openflag is given to transcribe-wav - Used during language model mixing
language_model.txt- Summarizes the valid voice commands for a profile
- Automatically generated by train-profile
When using DeepSpeech, language models are automatically converted to binary format using KenLM.
Grapheme To Phoneme Models
A grapheme-to-phoneme (g2p) model helps guess the pronunciations of words outside of the dictionary. These models are trained on each profile’s base_dictionary.txt file using phonetisaurus and saved in the OpenFST binary format.
G2P prediction can also be done using transformer models.
Phoneme Maps
Each profile contains one or more text files that contains a mapping from the phonemes present in the profile’s pronunciation dictionaries to either eSpeak (espeak_phonemes.txt) or MaryTTS (marytts_phonemes.txt). The format is simple:
S1 D1
S2 D2
...
where S1 is a source dictionary phoneme and D1 is a destination eSpeak/MaryTTS phoneme. These mappings are produced manually, and may not be perfect. The goal is to help users hear how the speech recognizer is expecting a word to be pronounced and to control how words are spoken.