Home • Formats

Data Formats

voice2json strives to use only common data formats, preferably text-based. Some artifacts generated during training, such as your language model, are even usable by other speech systems.


voice2json expects 16-bit 16Khz mono audio as input. When WAV data is provided in a different format, it is automatically converted with sox.


The transcribe-wav command produces JSON in the following format:

    "text": "transcription text",
    "transcribe_seconds": 0.123,
    "wav_name": "test.wav",
    "wav_seconds": 1.456



The recognize-intent command produces JSON in the following format:

    "intent": {
        "name": "NameOfIntent",
        "confidence": 1.0
    "entities": [
        { "entity": "entity_1", "value": "value_1", "raw_value": "value_1",
          "start": 0, "end": 1, "raw_start": 0, "raw_end": 1 },
        { "entity": "entity_2", "value": "value_2", "raw_value": "value_2",
          "start": 0, "end": 1, "raw_start": 0, "raw_end": 1 }
    "slots": {
        "entity_1": "value_1",
        "entity_2": "value_2"
    "text": "transcription text with substitutions",
    "raw_text": "transcription text without substitutions",
    "tokens": ["transcription", "text", "with", "substitutions"],
    "raw_tokens": ["transcription", "text", "without", "substitutions"],
    "recognize_seconds": 0.001


Pronunciation Dictionaries

Dictionaries are expected in plaintext, with the following format:

word1 P1 P2 P3
word2 P1 P4 P5

Each line starts with a word and, after some whitespace, a list of phonemes are given (separated by whitespace). These phonemes must match what the acoustic model was trained to recognize.

Multiple pronunciations for the same word are possible, and may optionally contain an index:

word P1 P2 P3
word(1) P2 P2 P3
word(2) P3 P2 P3

A voice2json profile will typically contain 3 dictionaries:

  1. base_dictionary.txt
    • A large, pre-built dictionary with most of the words in a given language
  2. custom_words.txt
    • A small, user-defined dictionary with custom words or pronunciations
  3. dictionary.txt
    • Contains exactly the vocabulary needed for a profile
    • Automatically generated by train-profile

Sounds Like Pronunciations

voice2json supports an alternative way of specifying custom word pronunciations. In a file named sounds_like.txt in your profile (see training.sounds-like-file), you can describe how a word should be pronounced by referencing other words:

unknown_word1 known_word1 [known_word2] ...

For example, the singer Beyoncé sounds like a combination of the words “bee yawn say”:

beyoncé bee yawn say

During training, voice2json will look up the pronunciations for the known words and construct a pronunciation for the unknown word from them. The base dictionary and your custom words are consulted for known word pronunciations.

You may reference a specific pronunciation for a known word using the word(n) syntax, where n is 1-based. Pronunciations are loaded in line order from base_dictionary.txt first and then custom_words.txt. For example, read(2) will reference the second pronunciation of the word “read”. Without an (n), all pronunciations found will be used.

Phoneme Literals

You can interject phonetic chunks into these pronunciations too. For example, the word “hooiser” sounds like “who” and the “-zure” in “azure”:

hooiser who /Z 3/

Text between slashes (/) will be interpreted as phonemes in the configured speech system.

Word Segments

If a grapheme-to-phoneme alignment crops is available (training.grapheme-to-phoneme-corupus), segments of words can also be used for pronunciations. Using the “hooiser” example above, we can replace the phonemes with:

hooiser who a>zure<

This will combine the pronunciation of “who” from the current phonetic dictionaries (base_dictionary.txt and custom_words.txt) and the “-zure” from the word “azure”.

The brackets point >at< the segment of the word that you want to contribute to the pronunciation. This is accomplished using a grapheme-to-phoneme alignment corpus generated using phonetisaurus and the base_dictionary.txt file. In the a>zure< example, the word “azure” is located in the alignment corpus, and the output phonemes from the phonemes “zure” in it are used.

Language Models

Language models must be in plaintext ARPA format.

A voice2json profile will typically contain 2 language models:

  1. base_language_model.txt
  2. language_model.txt
    • Summarizes the valid voice commands for a profile
    • Automatically generated by train-profile

When using DeepSpeech, language models are automatically converted to binary format using KenLM.

Grapheme To Phoneme Models

A grapheme-to-phoneme (g2p) model helps guess the pronunciations of words outside of the dictionary. These models are trained on each profile’s base_dictionary.txt file using phonetisaurus and saved in the OpenFST binary format.

G2P prediction can also be done using transformer models.

Phoneme Maps

Each profile contains one or more text files that contains a mapping from the phonemes present in the profile’s pronunciation dictionaries to either eSpeak (espeak_phonemes.txt) or MaryTTS (marytts_phonemes.txt). The format is simple:

S1 D1
S2 D2

where S1 is a source dictionary phoneme and D1 is a destination eSpeak/MaryTTS phoneme. These mappings are produced manually, and may not be perfect. The goal is to help users hear how the speech recognizer is expecting a word to be pronounced and to control how words are spoken.