Skip to the content.

Home • Recipes

voice2json Recipes

Below are small demonstrations of how to use voice2json for a specific problem or as part of a larger system.

Picard’s Tea

This is a simple, fun example to recognize orders for tea from folks like Jean-Luc Picard. It accepts the infamous “tea, earl grey, hot” order as well as a few others, such as “tea, green, lukewarm”.

type = (earl grey) | green | black
temperature = hot | lukewarm | cold
tea (<type>){type} (<temperature>){temperature}

Put this in your sentences.ini and re-train:

voice2json train-profile

Record, transcribe, and recognize a single voice command with:

$ voice2json record-command \
    | voice2json transcribe-wav \
    | voice2json recognize-intent

Saying “tea, earl grey, hot” will output something like:

  "text": "tea earl grey hot",
  "intent": {
    "name": "MakeTea",
  "slots": {
    "type": "earl grey",
    "temperature": "hot"

The strength of voice2json is its ability to be quickly customized for how you expect a command to be given.

Create an MQTT Transcription Service

voice2json is designed to work well in Unix-style workflows. Many of the commands consume and produce jsonl, which makes them interoperable with other line-oriented command-line tools.

The mosquitto_pub and mosquitto_sub commands included in the mosquitto-clients package enable other programs to send and receive messages over the MQTT protocol. These programs can then easily participate in an IoT system, such as a Node-RED flow.

For this recipe, start by installing the MQTT client commands:

$ sudo apt-get install mosquitto-clients

We can create a simple transcription service using the transcribe-wav voice2json command. This service will receive file paths on a transcription-request topic, and send the text transcription out on a transcription-response topic.

$ mosquitto_sub -t 'transcription-request' | \
      voice2json transcribe-wav --stdin-files | \
      while read -r json; \
          do echo "$json" | jq --raw-output .text; \
      done | \
      mosquitto_pub -l -t 'transcription-response'

We use the --stdin-files argument of transcribe-wav to make it read file paths on standard in and emit a single line of JSON for each transcription. The excellent jq tool is used to extract the text of the transcription (--raw-otuput emits the value without quotes).

With the service running, open a separate terminal and subscribe to the transcription-response topic:

$ mosquitto_sub -t 'transcription-response'

Finally, in yet another terminal, send a transcription-request with a WAV file path on your system:

$ mosquitto_pub -t 'transcription-request' \
    -m '/path/to/turn-on-the-light.wav'

In the terminal subscribed to transcription-response messages, you should see the text transcription printed:

turn on the light

Launch a Program via Voice

Let’s use voice2json to launch programs using voice commands. This will follow a typical voice assistant flow, meaning we will:

  1. Wait for a wake word to be spoken
  2. Record the voice command
  3. Recognize and handle the intent

The script realizes these steps in a bash while loop using the wait-wake and record-command voice2json commands to do steps 1 and 2. For step 3, transcribe-wav and recognize-intent are used.

Our sentences.ini file is very simple:

(start | run | launch) ($program){program}

We keep the list of supported programs in a slots/program file:

web: browser:firefox
file: browser:nemo
text: editor:xed

Note the use of substitutions to map spoken program names (e.g., “web browser”, “mail”) to actual binary names (firefox, thunderbird). A few words were added to custom_words.txt using pronounce-word to guess their pronunciations.

After following the installation instructions, we can execute the script. After saying the wake word (“hey mycroft” by default), you should be able to say “run firefox” and have it launch a Firefox window.

Set and Run Timers

A common task for voice assistants is to set timers. Here, we demonstrate a “simple” timer that supports a single timer that’s less than 10 hours in one second increments:

hour_expr = (1..9){hours} [and (a half){minutes:30!int}] (hour | hours)
minute_expr = (1..59){minutes} [and (a half){seconds:30!int}] (minute | minutes)
second_expr = (1..59){seconds} (second | seconds)

time_expr = ((<hour_expr> [[and] <minute_expr>] [[and] <second_expr>]) | (<minute_expr> [[and] <second_expr>]) | <second_expr>)

set [a] timer for <time_expr>

There are over 8 million possible sentences here, such as “set a timer for two hours and ten and a half minutes”. This template makes use of number ranges and converters to relieve the burden on the intent handler. Because hours, minutes, and seconds are kept in separate slots, these numbers can simply be converted to seconds and summed:

#!/usr/bin/env python3
import sys
import json
import time

for line in sys.stdin:
    intent = json.loads(line)

    # Extract time integers
    hours = int(intent["slots"].get("hours", 0))
    minutes = int(intent["slots"].get("minutes", 0))
    seconds = int(intent["slots"].get("seconds", 0))

    # Compute total number of seconds to wait
    total_seconds = (hours * 60 * 60) + (minutes * 60) + seconds

    # Wait
    print(f"Waiting for {total_seconds} second(s)")

After following the installation instructions, execute the script. It will wait for a “wake up” MQTT message on the timer/wake-up topic. If you’d like to use a wake word instead, see the launch program example.

When the wake up message is received, you can say something like “set a timer for five seconds”. After an acknowledgment beep, the example will wait the appropriate amount of time and then play an alarm sound (three short beeps). A response MQTT message is also published on the timer/alarm topic after the timer has finished, allowing a Node-RED or other IoT software to respond.

Parallel WAV Recognition

Want to recognize intents in a large number of WAV files as fast as possible? You can use the GNU Parallel utility with voice2json to put those extra CPU cores to good use!

$ find /path/to/wav/files/ -name '*.wav' | \
      tee wav-file-names.txt | \
      parallel -k --pipe -n 10 \
         'voice2json transcribe-wav --stdin-files | voice2json recognize-intent'

This will run up to 10 copies of voice2json in parallel and output a line of JSON per WAV file in the same order as they were printed by the find command. For convenience, the file names are saved to a text file named wav-file-names.txt.

If you want to check voice2json’s performance on a directory of WAV files and transcriptions, see the test-examples command.

Train a Rasa NLU Bot

Intent recognition in voice2json is not very flexible. Similar words and phrasings cannot be substituted, and there is little room for error. If your voice command system will also be accessible via chat, you may want to use a proper natural language understanding system.

voice2json can generate training examples for machine learning systems like Rasa NLU. In this recipe, we use the default English sentences:

what time is it
tell me the time

whats the temperature
how (hot | cold) is it

is the garage door (open | closed)

light_name = ((living room lamp | garage light) {name}) | <ChangeLightColor.light_name>
light_state = (on | off) {state}

turn <light_state> [the] <light_name>
turn [the] <light_name> <light_state>

light_name = (bedroom light) {name}
color = (red | green | blue) {color}

set [the] <light_name> [to] <color>
make [the] <light_name> <color>

For ease of installation, create a rasa script that calls out to the official Rasa Docker image:

#!/usr/bin/env bash
docker run -it -v "$(pwd):/app" -p 5005:5005 rasa/rasa:latest-spacy-en "$@"

Note that the current directory is mounted and port 5005 is exposed. Next, we create a config.yml:

language: "en"

pipeline: "pretrained_embeddings_spacy"

We use the generate-examples command to randomly generate up to 5,000 example intents with slots. Beware that no attempt is made in this toy example to balance classes.

$ mkdir -p data && \
    voice2json train-profile && \
    voice2json generate-examples -n 5000 | \
      python3 > data/

Next, we train a model. This can take a few minutes, depending on your hardware:

$ mkdir -p models && \
    ./rasa train nlu

Once your model is trained, you can run a test shell:

$ mkdir -p models && \
      ./rasa shell nlu

Try typing in sentences and checking the output.

Intent HTTP Server

If you want to recognize intents remotely, you should use Rasa’s HTTP Server.

$ ./rasa run -m models --enable-api

With that running, you can POST some JSON to port 5005 in a different terminal and get a JSON response:

$ curl -X POST -d '{ "text": "turn on the living room lamp" }' localhost:5005/model/parse

You can easily combine this with voice2json to do transcription + intent recognition:

$ voice2json transcribe-wav \
     ../../etc/test/turn_on_living_room_lamp.wav | \
  curl -X POST -d @- localhost:5005/model/parse


  "intent": {
    "name": "ChangeLightState",
    "confidence": 0.9986116877633666
  "entities": [
      "start": 5,
      "end": 7,
      "value": "on",
      "entity": "state",
      "confidence": 0.9990940955808785,
      "extractor": "CRFEntityExtractor"
      "start": 12,
      "end": 28,
      "value": "living room lamp",
      "entity": "name",
      "confidence": 0.9989133507400977,
      "extractor": "CRFEntityExtractor"
  "intent_ranking": [
      "name": "ChangeLightState",
      "confidence": 0.9986116877633666
      "name": "GetGarageState",
      "confidence": 0.0005631913057901469
      "name": "GetTemperature",
      "confidence": 0.0005114253499747637
      "name": "GetTime",
      "confidence": 0.00030957597780200693
      "name": "ChangeLightColor",
      "confidence": 4.119603066327378e-06
  "text": "turn on the living room lamp"

Try something that voice2json would choke on, like “please turn off the light in the living room”:

  "intent": {
    "name": "ChangeLightState",
    "confidence": 0.9504002047698142
  "entities": [
      "start": 12,
      "end": 15,
      "value": "off",
      "entity": "state",
      "confidence": 0.9991999541256443,
      "extractor": "CRFEntityExtractor"
      "start": 33,
      "end": 44,
      "value": "living room",
      "entity": "name",
      "confidence": 0.0,
      "extractor": "CRFEntityExtractor"
  "intent_ranking": [
      "name": "ChangeLightState",
      "confidence": 0.9504002047698142
      "name": "GetTemperature",
      "confidence": 0.016191147989239697
      "name": "ChangeLightColor",
      "confidence": 0.014916606955255965
      "name": "GetTime",
      "confidence": 0.014345667003407515
      "name": "GetGarageState",
      "confidence": 0.004146373282282381
  "text": "please turn off the light in the living room"

Happy recognizing!

Stream Microphone Audio Over a Network

Using the gst-launch command from GStreamer, you can stream raw audio data from your microphone to another machine over a UDP socket:

gst-launch-1.0 \
    pulsesrc ! \
    audioconvert ! \
    audioresample ! \
    audio/x-raw, rate=16000, channels=1, format=S16LE ! \
    udpsink host=<Destination IP> port=<Destination Port>

where <Destination IP> is the IP address of the machine with voice2json and <Destination Port> is a free port on that machine.

On the destination machine, run:

$ gst-launch-1.0 \
     udpsrc port=<Destination Port> ! \
     rawaudioparse use-sink-caps=false format=pcm pcm-format=s16le sample-rate=16000 num-channels=1 ! \
     queue ! \
     audioconvert ! \
     audioresample ! \
     filesink location=/dev/stdout | \
  voice2json <Command> --audio-source -

where <Destination IP> matches the first command and <Command> is wait-wake, record-command, or record-examples.

See the GStreamer multiudpsink plugin for streaming to multiple machines simultaneously (it also has multicast support).

Fluent AI Dataset

The good folks at Fluent AI have a speech command dataset available for community use. The training set includes over 23,000 spoken examples, and the test set has about 3,800 commands. Each command has at most three attributes: action, object, and location; for example: “turn on (action) the lights (object) in the kitchen (location)”. The object and location may be omitted in certain commands, but the action (intent) is always present.

Using ~100 lines in sentences.ini (excluding comments), I’m able to get 98.7% accuracy on the test set, which is as accurate as the end-to-end system trained in’s published paper! While the sentences voice2json was trained with had to be hand-tuned to fit the test set, it also did not require any audio training data.

If you’d like to reproduce my results, follow the installation instructions and double-check my work :)