• Home • Commands

Command-Line Tools

$ voice2json [--debug] [--profile <PROFILE>] <COMMAND> [<COMMAND_ARG>...]

The <PROFILE> can be:

A supported language name like en or fr
The name of a known profile, such as de_kaldi-zamia
A directory with with profile.yml, a sentences.ini, and other language-specific files

If no <PROFILE> is given, the $XDG_CONFIG_HOME/voice2json directory is used first if it exists. Otherwise, the default U.S. English profile is used.

The following commands are available:

download-profile - Download missing files for a profile
train-profile - Generate speech/intent artifacts
transcribe-wav - Transcribe WAV file to text
transcribe-stream - Transcribe live audio stream to text
recognize-intent - Recognize intent from JSON or text
wait-wake - Listen to live audio stream for wake word
record-command - Record voice command from live audio stream
pronounce-word - Look up or guess how a word is pronounced
speak-sentence - Speak a sentence using text-to-speech
generate-examples - Generate random intents
record-examples - Generate and record speech examples
test-examples - Test recorded speech examples
show-documentation - Run HTTP server locally with documentation
print-profile - Print profile settings
print-downloads - Print profile file download information
print-files - Print user profile files for backup
print-version - Print voice2json version and exit

download-profile

Downloads missing language-specific files from Github.

$ voice2json --profile en-us_kaldi-zamia download-profile

Output:

Downloaded 10 file(s) /home/user/.local/share/voice2json/en-us_kaldi-zamia

The --profile argument can be one of the supported languages (like en or fr), or one of the known profile names like de_kaldi-zamia.

train-profile

Generates all necessary artifacts in a profile for speech/intent recognition.

$ voice2json train-profile

Output:

Training completed in 0.9538522080001712 second(s)

Settings that control where generated artifacts are saved are in the training section of your profile.

Slots Directory

If your sentences.ini file contains slot references, voice2json will look for text files in a directory named slots in your profile (set training.slots-directory to change). If you reference $movies, then slots/movies should exist with one item per line. When these files change, you should re-train.

Slot Programs

If a slot cannot be found in your training.slots-directory, then voice2json will search for a program in training.slot-programs-directory (slot_programs by default). If you reference $movies in your sentences.ini, then slot_programs/movies should be an executable program that will output values, one per line. These programs are executed every time you re-train.

Intent Whitelist

If a file named intent_whitelist exists in your profile (set training.intent-whitelist to change), then voice2json will only consider the intents listed in it (one per line). If this file is missing (the default), then all intents from sentences.ini are considered. When this file changes, you should re-train.

Language Model Mixing

voice2json is designed to only recognize the voice commands you specify in sentences.ini. All of the supported speech systems are capable of transcribing open-ended speech, however. But what if you want to recognize sort of open-ended speech that’s still focused on your voice commands?

In every profile, voice2json includes a “base” dictionary and language model. The former contains the pronunciations all possible words. The latter is a large language model trained on very large corpus of text in the profile’s language (usually books and web pages).

During training, voice2json can mix the large, open ended language model with the one generated specifically for your voice commands. You specify a mixture weight, which controls how much of an influence the large language model has (see training.base-language-model-weight). A mixture weight of 0 makes voice2json sensitive only to your voice commands, which is the default. A mixture weight of 0.05, on the other hand, adds a 5% influence from the large language model.

Diagram of training process

To see the effect of language model mixing, consider a simple sentences.ini file:

[ChangeLightState]
turn (on){state} the living room lamp

This will only allow voice2json to recognize the voice command “turn on the living room lamp”. If we train voice2json and transcribe a WAV file with this command, the output is no surprise:

time voice2json train-profile
...
real	0m0.688s

$ voice2json transcribe-wav \
    turn_on_living_room_lamp.wav | \
    jq -r .text

turn on the living room lamp

Now let’s do speech to text on a variation of the command, a WAV file with the speech “would you please turn on the living room lamp”:

$ voice2json transcribe-wav \
    would_you_please_turn_on_living_room_lamp.wav | \
    jq -r .text
    
turn the turn on the living room lamp

The word salad here is because we’re trying to recognize a voice command that was not present in sentences.ini (technically, it’s because n-gram models are kind of dumb). We could always add it to sentences.ini, of course. There may be cases, however, where we cannot anticipate all of the variations of a voice command. For these cases, you should increase the training.base-language-model-weight in your profile to something above 0. Let’s set it to 0.05 (5% mixture) and re-train:

time voice2json train-profile
...
real	1m3.221s

Note that training took significantly longer (a full minute!) because of the size of the base langauge model. Now, let’s test our two WAV files again:

$ voice2json transcribe-wav \
    turn_on_living_room_lamp.wav | \
    jq -r .text
    
turn on the living room lamp

$ voice2json transcribe-wav \
    would_you_please_turn_on_living_room_lamp.wav | \
    jq -r .text
    
would you please turn on the living room lamp

Great! voice2json was able to transcribe a sentence that it wasn’t explicitly trained on. If you’re trying this at home, you surely noticed that it took a lot longer to process the WAV files too (probably 3-4x longer). In practice, it’s not recommended to do mixed language modeling on lower-end hardware like a Raspberry Pi. If you need more open-ended speech recognition, try turning voice2json into a network service.

The Elephant in the Room

This isn’t the end of the story for open-ended speech recognition in voice2json, however. What about intent recognition? When the set of possible voice commands is known ahead of time, it’s relatively easy to know what to do with each and every sentence. The flexibility gained from mixing in a base language model unfortunately places a larger burden on the intent recognizer.

In our ChangeLightState example above, we’re fortunate that everything still works as expected:

$ voice2json recognize-intent -t \
    'would you please turn on the living room lamp' | \
    jq .

outputs:

{
  "text": "turn on the living room lamp",
  "raw_text": "would you please turn on the living room lamp",
  "intent": {
    "name": "ChangeLightState",
    "confidence": 0.5
  },
  "entities": [
    {
      "entity": "state",
      "value": "on",
      "start": 5,
      "end": 7
    },
    {
      "entity": "name",
      "value": "living room lamp",
      "start": 12,
      "end": 28
    }
  ],
  "tokens": [
    "turn",
    "on",
    "the",
    "living",
    "room",
    "lamp"
  ],
  "slots": {
    "state": "on",
    "name": "living room lamp"
  }
}

This only works because fuzzy recognition is enabled. Notice the text property? All the “problematic” words have simply been dropped! If you need something more sophisticated, consider training a Rasa NLU bot using generated examples.

transcribe-wav

Transcribes WAV file(s) or raw audio data. Outputs a single line of jsonl for each transcription (format description).

WAV data from stdin

Reads a WAV file from standard in and transcribes it.

$ voice2json transcribe-wav < turn-on-the-light.wav

Output:

{"text": "turn on the light", "transcribe_seconds": 0.123, "wav_seconds": 1.456}

Note: No wav_name property is provided when WAV data comes from standard in.

Files as arguments

Reads one or more WAV files and transcribes each of them in turn.

$ voice2json transcribe-wav \
      turn-on-the-light.wav \
      what-time-is-it.wav

Output:

{"text": "turn on the light", "transcribe_seconds": 0.123, "wav_seconds": 1.456, "wav_name": "turn-on-the-light.wav"}
{"text": "what time is it", "transcribe_seconds": 0.123, "wav_seconds": 1.456, "wav_name": "what-time-is-it.wav"}

Files from stdin

Reads one or more WAV file paths from standard in and transcribes each of them in turn. If arguments are also provided, they will be processed first.

$ voice2json transcribe-wav --stdin-files
...
turn-on-the-light.wav
what-time-is-it.wav
<CTRL-D>

Output:

{"text": "turn on the light", "transcribe_seconds": 0.123, "wav_seconds": 1.456, "wav_name": "turn-on-the-light.wav"}
{"text": "what time is it", "transcribe_seconds": 0.123, "wav_seconds": 1.456, "wav_name": "what-time-is-it.wav"}

Open Transcription

When given the --open argument, transcribe-wav will ignore your custom voice commands and instead use the large, pre-trained speech model present in your profile. Do this if you want to use voice2json for general transcription tasks that are not domain specific. Keep in mind, of course, that this is not what voice2json is optimized for!

If you want the best of both worlds (transcriptions focused on a particular domain, but still able to accommodate general speech), check out language model mixing. This comes at a performance cost, however, in training, loading, and transcription times. Consider using transcribe-wav as a service to avoid re-loading your mixed speech model.

transcribe-stream

Transcribes voice commands from a live audio stream, automatically detecting speech and silence. Outputs a single line of jsonl for each transcription (format description).

$ voice2json transcribe-stream

{"text": "turn off the living room lamp", "likelihood": 1, "transcribe_seconds": 2.333360348999122, "wav_seconds": 2.407, "tokens": null, "timeout": false}

Like transcribe-wav, transcribe-stream accepts a --open argument for open transcription.

Like wait-wake, transcribe-stream also accepts a --exit-count argument for exiting once a specific number of voice commands have been recorded and transcribed.

Stream Events

If you need to react to the voice command starting and stopping, use --event-sink to direct events to file (same events as record-command). With process substition, you can easily publish these events to MQTT:

$ voice2json transcribe-stream \
    --event-sink >(mosquitto_pub -l -t stream/events/topic)

The -l argument to mosquitto_pub will cause it to read lines from standard in and send them as separate messages. Catching these events in Node-RED is straightforward with an MQTT input node subscribed to the same topic.

Saving Voice Commands

Using the --wav-sink argument, you can save voice commands to WAV file(s) as they’re spoken. If the argument to --wav-sink is an existing directory, each voice command will be written to that directory with a name formatted according to --wav-filename (.wav is automatically appended):

$ voice2json transcribe-stream \
    --wav-sink /path/to/existing/directory/ \
    --wav-filename '%Y%m%d-%H%M%S'

recognize-intent

Recognizes an intent and slots from JSON or plaintext. Outputs a single line of jsonl for each input line (format description).

Inputs can be provided either as arguments or lines via standard in.

JSON input

Input is a single line of jsonl per sentence, minimally with a text property (like the output of transcribe-wav).

$ voice2json recognize-intent '{ "text": "turn on the light" }'

Output:

{"text": "turn on the living room lamp", "intent": {"name": "LightState", "confidence": 1.0}, "entities": [{"entity": "state", "value": "on"}], "slots": {"state": "on"}, "recognize_seconds": 0.001}

Plaintext input

Input is a single line of plaintext per sentence.

$ voice2json recognize-intent --text-input 'turn on the light' 'turn off the light'

Output:

{"text": "turn on the living room lamp", "intent": {"name": "LightState", "confidence": 1.0}, "entities": [{"entity": "state", "value": "on"}], "slots": {"state": "on"}, "recognize_seconds": 0.001}
{"text": "turn off the living room lamp", "intent": {"name": "LightState", "confidence": 1.0}, "entities": [{"entity": "state", "value": "off"}], "slots": {"state": "off"}, "recognize_seconds": 0.001}

Number Replacement

For most profile languages, voice2json supports replacing numbers in the input text (e.g., “75”) with words (“seventy five”). You can enable this sentences given to recognize-intent by adding the --replace-numbers argument:

$ voice2json recognize-intent --replace-numbers --text-input 'set the temperature to 75'

For English, this will perform intent recognition on the sentence “set the temperature to seventy five”. See the number replacement section in the template language documentation for how to do this automatically in your sentences.ini.

Intent Filter

You can filter which intents are eligible for recognition using --intent-filter:

$ voice2json recognize-intent \
    --text 'some text to recognize' \
    --intent-filter 'Intent1' 'Intent2'

Only the intent names provided will be checked. Intent names are case sensitive, and should match your sentences.ini file.

wait-wake

Listens to a live audio stream for a wake word using Mycroft Precise (default phrase is “hey mycroft”). Outputs a single line of jsonl each time the wake word is detected.

$ voice2json wait-wake

Once the wake word is spoken, voice2json will output:

{ "keyword": "/path/to/model_file.pb", "detect_seconds": 1.2345, "detect_timestamp": 1234567890 }

where keyword is the path to the detected keyword file and detect_seconds is the time of detection relative to when voice2json was started. The detect_timestamp field is computed with time.time()

Custom Wake Word

You can train your own wake word or use one of the pre-trained model files from Mycroft AI.

Exit Count

Providing a --exit-count <N> argument to wait-wake tells voice2json to automatically exit after the wake word has been detected N times. This is useful when you want to use wait-wake in a shell script.

Audio Sources

By default, the wait-wake, record-command, and record-examples commands execute the program defined in the audio.record-command section of your profile to record audio. You can customize/change this program or provide a different source of audio data with the --audio-source argument, which expects a file path or “-“ for standard in. Through process substition or Unix pipes, this can be used to receive microphone audio streamed over a network.

record-command

Records from a live audio stream until a voice command has been spoken. Outputs WAV audio data containing just the voice command.

$ voice2json record-command > my-voice-command.wav

record-command uses the webrtcvad library to detect live speech. Once speech has been detected, voice2json begins recording until there is silence. If speech goes on too long, a timeout is reached and recording stops. The profile settings under the voice-command section control exactly how many seconds of speech and silence are needed to segment live audio.

See audio sources for a description of how record-command gets audio input.

Redirecting WAV Output

The --wav-sink argument lets you change where record-command, pronounce-word, and speak-sentence write their output WAV data. When this is set to something other than “-“ (standard out), record-command will output lines of JSON to standard out that describe events in the live speech.

$ voice2json record-command \
    --audio-source <(sox turn-on-the-living-room-lamp.wav -t raw -) \
    --wav-sink /dev/null

will output something like:

{"event": "speech", "time_seconds": 0.24}
{"event": "started", "time_seconds": 0.54}
{"event": "silence", "time_seconds": 4.5}
{"event": "speech", "time_seconds": 4.619999999999999}
{"event": "silence", "time_seconds": 4.799999999999998}
{"event": "stopped", "time_seconds": 5.279999999999995}

where event is either “speech”, “started”, “silence”, “stopped”, or “timeout”. The “started” and “stopped” events refer to the start/stop of the detected voice command. The time_seconds property is the time of the event relative to the start of the WAV file (time 0).

pronounce-word

Uses eSpeak or MaryTTS to pronounce words the same way that the speech recognizer is expecting to hear them. This depends on manually created phoneme maps in each profile.

Words can be provided either as arguments or lines via standard in. You can also save output to a WAV file.

Assuming you’re using the en-us_pocketsphinx-cmu profile:

$ voice2json pronounce-word hello

Output:

hello HH AH L OW
hello HH EH L OW

In addition to text output, you should have heard both pronunciations of “hello”. These came the base_dictionary.txt included in the profile.

If you pass a --marytts argument, voice2json will try to contact a MaryTTS server running locally on port 59125. This can be changed using the marytts.process-url in your profile.

Unknown Words

The same pronounce-word command works for words that are probably not in your phonetic dictionary:

$ voice2json pronounce-word raxacoricofallipatorius

Output:

raxacoricofallipatorius R AE K S AH K AO R IH K AO F AE L AH P AH T AO R IY IH S
raxacoricofallipatorius R AE K S AH K AO R IY K OW F AE L AH P AH T AO R IY IH S
raxacoricofallipatorius R AE K S AH K AO R AH K OW F AE L AH P AH T AO R IY IH S
raxacoricofallipatorius R AE K S AH K AA R IH K AO F AE L AH P AH T AO R IY IH S
raxacoricofallipatorius R AE K S AH K AO R IH K OW F AE L AH P AH T AO R IY IH S

This produced 5 pronunciation guesses using phonetisaurus and the grapheme-to-phoneme model provided with the profile (g2p.fst).

If you want to hear a specific pronunciation, just provide it with the word:

$ voice2json pronounce-word 'moogle M UW G AH L'

You can save these pronunciations in the custom_words.txt file in your profile. Make sure to re-train.

Sounds Like Pronunciations

By passing the --sounds-like flag to pronounce-word, you can provide word pronunciations using other words or word segments.

speak-sentence

Speaks a full sentence using either eSpeak or MaryTTS (make sure to set text-to-speech.marytts.voice in your profile).

Sentences can be provided either as arguments or lines via standard in. You can also save output to a WAV file.

$ voice2json speak-sentence 'hello world!'

If you pass a --marytts argument, voice2json will try to contact a MaryTTS server running locally on port 59125. This can be changed using the marytts.process-url in your profile.

MaryTTS User Dictionaries

If you’ve added custom words to your profile (in custom_words.txt), voice2json will try and use your profile’s MaryTTS phoneme map to generate a user dictionary so words will be pronounced as you’ve specified. If you don’t want this, set text-to-speech.marytts.dictionary-file to the empty string ("") in your profile.

generate-examples

Generates random intents and slots from your profile. Outputs a single line of jsonl for each intent line (format description).

$ voice2json generate-examples --number 1 | jq .

Output (formatted with jq):

{
  "text": "turn on the light",
  "intent": {
    "name": "LightState",
    "confidence": 1
  },
  "entities": [
    {
      "entity": "state",
      "value": "on",
      "start": 5,
      "end": 7
    }
  ],
  "slots": {
    "state": "off"
  },
  "tokens": [
      "turn",
      "on",
      "the",
      "light"
  ]
}

IOB Format

If the --iob argument is given, generate-examples will output examples in an inside-outside-beginning format with 3 tab-separated sections:

The words themselves surround by BS (begin sentence) and ES tokens
The tag for each word, one of O (outside), B-<NAME> (begin NAME), I-<NAME> (inside NAME)
The intent name

$ voice2json generate-examples --number 1 --iob

Output:

BS turn off the light ES<TAB>O O B-state O O O<TAB>LightState

Other Formats

See the Rasa NLU bot recipe for an example of transforming voice2json examples into Rasa NLU’s Markdown training data format.

record-examples

Generates random example sentences from sentences.ini and prompts you to record them. Saves WAV files, transcriptions, and expected intents (as JSON events) to a directory.

$ voice2json record-examples --directory /path/to/examples/

You will be prompted with a random sentence. Once you press ENTER, voice2json will begin recording. When you press ENTER again, the recorded audio will be saved to a WAV file in the provided --directory (default is the current directory). When you’re finished recording examples, press CTRL+C to exit.

A directory of recorded examples can be used for performance testing.

test-examples

Transcribes and performs intent recognition on all WAV files in a directory (usually recorded with record-examples). Outputs a JSON report with speech/intent recognition details and accuracy statistics (including word error rate).

$ voice2json test-examples --directory /path/to/examples/

outputs something like:

{
  "num_wavs": 1,
  "num_words": 0,
  "num_entities": 0,
  "correct_transcriptions": 0,
  "correct_intent_names": 0,
  "correct_words": 0,
  "correct_entities": 0,
  "transcription_accuracy": 0.123,
  "intent_accuracy": 0,
  "entity_accuracy": 0,
  "intent_entity_accuracy": 0,
  "average_transcription_speedup": 1.0

  "actual": {
    "example-1.wav": {
      ...
      "word_error": {
        "reference": ["..."],
        "hypothesis": ["..."],
        "words": 0,
        "matches": 0,
        "errors": 0
      }
    },
  },
  "expected": {
    "example-1.wav": {
    ...
  }
}

where actual provides details of the transcription/intent recognition of the examples, and expected is simply pulled from the provided transcription/intent files. The remaining properties are statistics that describes the overall accuracy of the examples relative to expectations.

Report Format

The statistics of the report contain:

num_wavs - total number of WAV files that were tested (number)
num_words - total number of expected words across all test WAVs (number)
num_entities - total number of distinct entity/value pairs across all test WAVs (number)
correct_transcriptions - number of WAV files whose actual transcriptions exactly matched expectations (number)
correct_intent_names - number of WAV files whose actual intent exactly matched expectations (number)
correct_entities - number of entity/value pairs that exactly matched expectations if and only if the actual intent matched too (number)
transcription_accuracy - correct words / num words (number, 1 = perfect)
intent_entity_accuracy - correct intents + entities / num wavs (number, 1 = perfect)
intent_accuracy - correct intents / num wavs (number, 1 = perfect)
entity_accuracy - correct entities / num entities (number, 1 = perfect)
average_transcription_speedup - average WAV duration / transcription time (number, higher is faster than real-time)

The actual section of the report contains the recognized intent of each WAV file as well as a word_error section with:

reference - words from expected transcription
hypothesis - words from actual transcription
words - number of expected words (number)
matches - number of correct words (number)
errors - number of incorrect words (number)

The expected section is just the intent or transcription recorded in the examples directory alongside each WAV file. For example, a WAV file named example-1.wav should ideally have an example-1.json file with an expected intent. Failing that, an example-1.txt file with the transcription must be present.

show-documentation

Runs a local HTTP server with this documentation. The default port is 8000, which can be changed with --port:

$ voice2json show-documentation --port 8000

The documentation should now be accessible at http://localhost:8000

If you’re running voice2json inside Docker, make sure you use -p to expose the correct port via docker run.

print-downloads

Prints download information for profile files. Outputs a single line of jsonl for each file. Can be used to download missing profile files and verify them.

$ voice2json print-downloads [OPTIONS] <PROFILE> [<PROFILE>] ...

Downloading a New Profile

The most common use case for print-downloads is to download the required files for a specific profile. Rather than downloading the full .tar.gz for a profile (100’s of MB at least), you can exclude files you don’t need using these options:

--no-mixed-language-model
- Exclude files needed for language model mixing
--no-open-transcription
- Exclude files needed for open transcription
--no-grapheme-to-phoneme
- Exclude files needed for guessing word pronunciations
--no-text-to-speech
- Exclude files needed for text to speech

The --with-examples option includes example sentences.ini and custom_words.txt files in the download list.

If you only plan to use voice2json for custom voice commands, the following command will print the required files (with examples) for the U.S. English Pocketsphinx profile:

$ voice2json print-downloads \
    --no-mixed-language-model \
    --no-open-transcription \
    --with-examples \
    en-us_pocketsphinx-cmu
    
{"bytes": 1537, "sha256": "49181202f2b991d25f6cac8cd1705994494b9600d4311794ecbb9fcf8b188aef", "file": "LICENSE", "profile": "en-us_pocketsphinx-cmu", "url": "https://github.com/synesthesiam/en-us_pocketsphinx-cmu/raw/master/LICENSE", "profile-directory": "/home/hansenm/.config/voice2json"}
...

Using the information provided in each line, a small Bash script can be used to actually download the files (requires curl and jq):

$ voice2json print-downloads \
    --no-mixed-language-model \
    --no-open-transcription \
    --with-examples \
    en-us_pocketsphinx-cmu | \
while read -r json; do
    # Source URL
    url="$(echo "${json}" | jq --raw-output .url)"

    # Destination directory and file path
    profile_dir="$(echo "${json}" | jq --raw-output '.["profile-directory"]')"
    dest_file="$(echo "${json}" | jq --raw-output .file)"
    dest_file="${profile_dir}/${dest_file}"

    # Directory of destination file
    dest_dir="$(dirname "${dest_file}")"

    echo "${url} => ${dest_file}"

    # Create destination directory and download file
    mkdir -p "${dest_dir}"
    curl -sSfL -o "${dest_file}" "${url}"
done

This will download about half as many bytes as are needed for the complete .tar.gz

Note: the script above overwrites any existing files and does not verify the download sizes/SHA256 sums.

Use the --only-missing flag to only print download information for profile files that do not already exist.

Download Format

Each line from print-downloads is a JSON object with the following fields:

bytes - expected size of the file in bytes (number)
sha256 - expected SHA256 sum of the file (string)
url - URL to download the file (string)
file - path of the file relative to the profile directory (string)
profile - name of the profile (string)
profile-directory - directory of the profile (string)

print-files

Prints absolute paths to user-created files in your profile that should be backed up.

You can back up to a .tar.gz like this:

$ voice2json print-files | tar -czf /path/to/profile_backup.tar.gz -T -

Includes:

Training sentences (sentences.ini) and slot files (slots/)
Custom word pronunciations (custom_words.txt, sounds_like.txt)
Slot programs (slot_programs/)
Converters (converters/)

print-profile

Prints all profile settings as JSON to the console. This is a combination of the default settings and what’s provided in profile.yml.

$ voice2json print-profile | jq .

Output:

{
    "language": {
        "name": "english",
        "code": "en-us"
    },
    "speech-to-text": {
        ...
    },
    "intent-recognition": {
        ...
    },
    "training": {
        ...
    },
    "wake-word": {
        ...
    },
    "voice-command": {
        ...
    },
    "text-to-speech": {
        ...
    },
    "audio": {
        ...
    },

    ...
}