• Home • Formats
Data Formats
voice2json
strives to use only common data formats, preferably text-based. Some artifacts generated during training, such as your language model, are even usable by other speech systems.
Audio
voice2json
expects 16-bit 16Khz mono audio as input. When WAV data is provided in a different format, it is automatically converted with sox.
Transcriptions
The transcribe-wav command produces JSON in the following format:
{
"text": "transcription text",
"transcribe_seconds": 0.123,
"wav_name": "test.wav",
"wav_seconds": 1.456
}
where
text
is the most likely transcription of the audio data (string)transcribe_seconds
is the number of seconds it took to transcribe (number)wav_name
is the name of the WAV file (string)wav_seconds
is the duration of the WAV audio in seconds (number)
Intents
The recognize-intent command produces JSON in the following format:
{
"intent": {
"name": "NameOfIntent",
"confidence": 1.0
},
"entities": [
{ "entity": "entity_1", "value": "value_1", "raw_value": "value_1",
"start": 0, "end": 1, "raw_start": 0, "raw_end": 1 },
{ "entity": "entity_2", "value": "value_2", "raw_value": "value_2",
"start": 0, "end": 1, "raw_start": 0, "raw_end": 1 }
],
"slots": {
"entity_1": "value_1",
"entity_2": "value_2"
},
"text": "transcription text with substitutions",
"raw_text": "transcription text without substitutions",
"tokens": ["transcription", "text", "with", "substitutions"],
"raw_tokens": ["transcription", "text", "without", "substitutions"],
"recognize_seconds": 0.001
}
where
intent
describes the recognized intent (object)name
is the name of the recognized intent (section headers in your sentences.ini) (string)confidence
is a value between 0 and 1, with 1 being maximally confident (number)
entities
is a list of recognized entities (list)entity
is the name of the slot (string)value
is the (substituted) value (string)raw_value
is the (non-substituted) value (string)start
is the zero-based start index of the entity intext
(number)raw_start
is the zero-based start index of the entity inraw_text
(number)stop
is the zero-based stop index (exclusive) of the entity intext
(number)raw_stop
is the zero-based stop index (exclusive) of the entity inraw_text
(number)
slots
is a dictionary of entities/values (object)- Assumes one value per entity. See
entities
for complete list.
- Assumes one value per entity. See
text
is the input text with substitutions (string)raw_text
is the input text without substitutionstokens
is the list of words/tokens intext
raw_tokens
is the list of words/tokens inraw_text
recognize_seconds
is the number of seconds it took to recognize the intent and slots (number)
Pronunciation Dictionaries
Dictionaries are expected in plaintext, with the following format:
word1 P1 P2 P3
word2 P1 P4 P5
...
Each line starts with a word and, after some whitespace, a list of phonemes are given (separated by whitespace). These phonemes must match what the acoustic model was trained to recognize.
Multiple pronunciations for the same word are possible, and may optionally contain an index:
word P1 P2 P3
word(1) P2 P2 P3
word(2) P3 P2 P3
A voice2json
profile will typically contain 3 dictionaries:
base_dictionary.txt
- A large, pre-built dictionary with most of the words in a given language
custom_words.txt
- A small, user-defined dictionary with custom words or pronunciations
dictionary.txt
- Contains exactly the vocabulary needed for a profile
- Automatically generated by train-profile
Sounds Like Pronunciations
voice2json
supports an alternative way of specifying custom word pronunciations. In a file named sounds_like.txt
in your profile (see training.sounds-like-file
), you can describe how a word should be pronounced by referencing other words:
unknown_word1 known_word1 [known_word2] ...
...
For example, the singer Beyoncé sounds like a combination of the words “bee yawn say”:
beyoncé bee yawn say
During training, voice2json
will look up the pronunciations for the known words and construct a pronunciation for the unknown word from them. The base dictionary and your custom words are consulted for known word pronunciations.
You may reference a specific pronunciation for a known word using the word(n)
syntax, where n
is 1-based. Pronunciations are loaded in line order from base_dictionary.txt
first and then custom_words.txt
. For example, read(2)
will reference the second pronunciation of the word “read”. Without an (n)
, all pronunciations found will be used.
Phoneme Literals
You can interject phonetic chunks into these pronunciations too. For example, the word “hooiser” sounds like “who” and the “-zure” in “azure”:
hooiser who /Z 3/
Text between slashes (/
) will be interpreted as phonemes in the configured speech system.
Word Segments
If a grapheme-to-phoneme alignment corpus is available (training.grapheme-to-phoneme-corupus
), segments of words can also be used for pronunciations. Using the “hooiser” example above, we can replace the phonemes with:
hooiser who a>zure<
This will combine the pronunciation of “who” from the current phonetic dictionaries (base_dictionary.txt
and custom_words.txt
) and the “-zure” from the word “azure”.
The brackets point >at<
the segment of the word that you want to contribute to the pronunciation. This is accomplished using a grapheme-to-phoneme alignment corpus generated using phonetisaurus
and the base_dictionary.txt
file. In the a>zure<
example, the word “azure” is located in the alignment corpus, and the output phonemes from the phonemes “zure” in it are used.
Language Models
Language models must be in plaintext ARPA format.
A voice2json
profile will typically contain 2 language models:
base_language_model.txt
- A large, pre-built language model that summarizes a given language
- Used when
--open
flag is given to transcribe-wav - Used during language model mixing
language_model.txt
- Summarizes the valid voice commands for a profile
- Automatically generated by train-profile
When using DeepSpeech, language models are automatically converted to binary format using KenLM.
Grapheme To Phoneme Models
A grapheme-to-phoneme (g2p) model helps guess the pronunciations of words outside of the dictionary. These models are trained on each profile’s base_dictionary.txt
file using phonetisaurus and saved in the OpenFST binary format.
G2P prediction can also be done using transformer models.
Phoneme Maps
Each profile contains one or more text files that contains a mapping from the phonemes present in the profile’s pronunciation dictionaries to either eSpeak (espeak_phonemes.txt
) or MaryTTS (marytts_phonemes.txt
). The format is simple:
S1 D1
S2 D2
...
where S1
is a source dictionary phoneme and D1
is a destination eSpeak/MaryTTS phoneme. These mappings are produced manually, and may not be perfect. The goal is to help users hear how the speech recognizer is expecting a word to be pronounced and to control how words are spoken.