• Home • Formats

Data Formats

voice2json strives to use only common data formats, preferably text-based. Some artifacts generated during training, such as your language model, are even usable by other speech systems.

Audio
Transcriptions
Intents
Pronunciation Dictionaries
Language Models
G2P Models
Phoneme Maps

Audio

voice2json expects 16-bit 16Khz mono audio as input. When WAV data is provided in a different format, it is automatically converted with sox.

Transcriptions

The transcribe-wav command produces JSON in the following format:

{
    "text": "transcription text",
    "transcribe_seconds": 0.123,
    "wav_name": "test.wav",
    "wav_seconds": 1.456
}

where

text is the most likely transcription of the audio data (string)
transcribe_seconds is the number of seconds it took to transcribe (number)
wav_name is the name of the WAV file (string)
wav_seconds is the duration of the WAV audio in seconds (number)

Intents

The recognize-intent command produces JSON in the following format:

{
    "intent": {
        "name": "NameOfIntent",
        "confidence": 1.0
    },
    "entities": [
        { "entity": "entity_1", "value": "value_1", "raw_value": "value_1",
          "start": 0, "end": 1, "raw_start": 0, "raw_end": 1 },
        { "entity": "entity_2", "value": "value_2", "raw_value": "value_2",
          "start": 0, "end": 1, "raw_start": 0, "raw_end": 1 }
    ],
    "slots": {
        "entity_1": "value_1",
        "entity_2": "value_2"
    },
    "text": "transcription text with substitutions",
    "raw_text": "transcription text without substitutions",
    "tokens": ["transcription", "text", "with", "substitutions"],
    "raw_tokens": ["transcription", "text", "without", "substitutions"],
    "recognize_seconds": 0.001
    
}

where

intent describes the recognized intent (object)
- name is the name of the recognized intent (section headers in your sentences.ini) (string)
- confidence is a value between 0 and 1, with 1 being maximally confident (number)
entities is a list of recognized entities (list)
- entity is the name of the slot (string)
- value is the (substituted) value (string)
- raw_value is the (non-substituted) value (string)
- start is the zero-based start index of the entity in text (number)
- raw_start is the zero-based start index of the entity in raw_text (number)
- stop is the zero-based stop index (exclusive) of the entity in text (number)
- raw_stop is the zero-based stop index (exclusive) of the entity in raw_text (number)
slots is a dictionary of entities/values (object)
- Assumes one value per entity. See entities for complete list.
text is the input text with substitutions (string)
raw_text is the input text without substitutions
tokens is the list of words/tokens in text
raw_tokens is the list of words/tokens in raw_text
recognize_seconds is the number of seconds it took to recognize the intent and slots (number)

Pronunciation Dictionaries

Dictionaries are expected in plaintext, with the following format:

word1 P1 P2 P3
word2 P1 P4 P5
...

Each line starts with a word and, after some whitespace, a list of phonemes are given (separated by whitespace). These phonemes must match what the acoustic model was trained to recognize.

Multiple pronunciations for the same word are possible, and may optionally contain an index:

word P1 P2 P3
word(1) P2 P2 P3
word(2) P3 P2 P3

A voice2json profile will typically contain 3 dictionaries:

base_dictionary.txt
- A large, pre-built dictionary with most of the words in a given language
custom_words.txt
- A small, user-defined dictionary with custom words or pronunciations
dictionary.txt
- Contains exactly the vocabulary needed for a profile
- Automatically generated by train-profile

Sounds Like Pronunciations

voice2json supports an alternative way of specifying custom word pronunciations. In a file named sounds_like.txt in your profile (see training.sounds-like-file), you can describe how a word should be pronounced by referencing other words:

unknown_word1 known_word1 [known_word2] ...
...

For example, the singer Beyoncé sounds like a combination of the words “bee yawn say”:

beyoncé bee yawn say

During training, voice2json will look up the pronunciations for the known words and construct a pronunciation for the unknown word from them. The base dictionary and your custom words are consulted for known word pronunciations.

You may reference a specific pronunciation for a known word using the word(n) syntax, where n is 1-based. Pronunciations are loaded in line order from base_dictionary.txt first and then custom_words.txt. For example, read(2) will reference the second pronunciation of the word “read”. Without an (n), all pronunciations found will be used.

Phoneme Literals

You can interject phonetic chunks into these pronunciations too. For example, the word “hooiser” sounds like “who” and the “-zure” in “azure”:

hooiser who /Z 3/

Text between slashes (/) will be interpreted as phonemes in the configured speech system.

Word Segments

If a grapheme-to-phoneme alignment corpus is available (training.grapheme-to-phoneme-corupus), segments of words can also be used for pronunciations. Using the “hooiser” example above, we can replace the phonemes with:

hooiser who a>zure<

This will combine the pronunciation of “who” from the current phonetic dictionaries (base_dictionary.txt and custom_words.txt) and the “-zure” from the word “azure”.

The brackets point >at< the segment of the word that you want to contribute to the pronunciation. This is accomplished using a grapheme-to-phoneme alignment corpus generated using phonetisaurus and the base_dictionary.txt file. In the a>zure< example, the word “azure” is located in the alignment corpus, and the output phonemes from the phonemes “zure” in it are used.

Language Models

Language models must be in plaintext ARPA format.

A voice2json profile will typically contain 2 language models:

base_language_model.txt
- A large, pre-built language model that summarizes a given language
- Used when --open flag is given to transcribe-wav
- Used during language model mixing
language_model.txt
- Summarizes the valid voice commands for a profile
- Automatically generated by train-profile

When using DeepSpeech, language models are automatically converted to binary format using KenLM.

Grapheme To Phoneme Models

A grapheme-to-phoneme (g2p) model helps guess the pronunciations of words outside of the dictionary. These models are trained on each profile’s base_dictionary.txt file using phonetisaurus and saved in the OpenFST binary format.

G2P prediction can also be done using transformer models.

Phoneme Maps

Each profile contains one or more text files that contains a mapping from the phonemes present in the profile’s pronunciation dictionaries to either eSpeak (espeak_phonemes.txt) or MaryTTS (marytts_phonemes.txt). The format is simple:

S1 D1
S2 D2
...

where S1 is a source dictionary phoneme and D1 is a destination eSpeak/MaryTTS phoneme. These mappings are produced manually, and may not be perfect. The goal is to help users hear how the speech recognizer is expecting a word to be pronounced and to control how words are spoken.