Home • Formats

Data Formats

voice2json strives to use only common data formats, preferably text-based. Some artifacts generated during training, such as your language model, are even usable by other speech systems.


voice2json expects 16-bit 16Khz mono audio as input. When WAV data is provided in a different format, it is automatically converted with sox.


The transcribe-wav command produces JSON in the following format:

    "text": "transcription text",
    "transcribe_seconds": 0.123,
    "wav_name": "test.wav",
    "wav_seconds": 1.456



The recognize-intent command produces JSON in the following format:

    "intent": {
        "name": "NameOfIntent",
        "confidence": 1.0
    "entities": [
        { "entity": "entity_1", "value": "value_1", "raw_value": "value_1",
          "start": 0, "end": 1, "raw_start": 0, "raw_end": 1 },
        { "entity": "entity_2", "value": "value_2", "raw_value": "value_2",
          "start": 0, "end": 1, "raw_start": 0, "raw_end": 1 }
    "slots": {
        "entity_1": "value_1",
        "entity_2": "value_2"
    "text": "transcription text with substitutions",
    "raw_text": "transcription text without substitutions",
    "tokens": ["transcription", "text", "with", "substitutions"],
    "raw_tokens": ["transcription", "text", "without", "substitutions"],
    "recognize_seconds": 0.001


Pronunciation Dictionaries

Dictionaries are expected in plaintext, with the following format:

word1 P1 P2 P3
word2 P1 P4 P5

Each line starts with a word and, after some whitespace, a list of phonemes are given (separated by whitespace). These phonemes must match what the acoustic model was trained to recognize.

Multiple pronunciations for the same word are possible, and may optionally contain an index:

word P1 P2 P3
word(1) P2 P2 P3
word(2) P3 P2 P3

A voice2json profile will typically contain 3 dictionaries:

  1. base_dictionary.txt
    • A large, pre-built dictionary with most of the words in a given language
  2. custom_words.txt
    • A small, user-defined dictionary with custom words or pronunciations
  3. dictionary.txt
    • Contains exactly the vocabulary needed for a profile
    • Automatically generated by train-profile

Language Models

Language models must be in plaintext ARPA format.

A voice2json profile will typically contain 2 language models:

  1. base_language_model.txt
  2. language_model.txt
    • Summarizes the valid voice commands for a profile
    • Automatically generated by train-profile

When using DeepSpeech, language models are automatically converted to binary format using KenLM.

Grapheme To Phoneme Models

A grapheme-to-phoneme (g2p) model helps guess the pronunciations of words outside of the dictionary. These models are trained on each profile’s base_dictionary.txt file using phonetisaurus and saved in the OpenFST binary format.

G2P prediction can also be done using transformer models.

Phoneme Maps

Each profile contains one or more text files that contains a mapping from the phonemes present in the profile’s pronunciation dictionaries to either eSpeak (espeak_phonemes.txt) or MaryTTS (marytts_phonemes.txt). The format is simple:

S1 D1
S2 D2

where S1 is a source dictionary phoneme and D1 is a destination eSpeak/MaryTTS phoneme. These mappings are produced manually, and may not be perfect. The goal is to help users hear how the speech recognizer is expecting a word to be pronounced and to control how words are spoken.