• Home • About

About voice2json

voice2json was created and is currently maintained by Michael Hansen.

Mike head

Changes
License
Supporting Tools
History

Version Changes

1.0 to 2.0

Added

Support for Mozilla’s DeepSpeech speech to text engine (version 0.6.1).
transcribe-stream for live transcription
Number ranges, slot programs, and converters to template language.
--intent-filter for recognize-intent command.
--certfile and --keyfile arguments for SSL support.
print-downloads command (download only necessary files).
print-files command (for backups).
print-version command (or --version).

Changed

wait-wake command uses Mycroft Precise instead of porcupine.
speak-sentence command takes --marytts flag instead of --espeak. Defaults to eSpeak.
Using autoconf for source install.
Minimum required Python version is 3.7

Removed

MaryTTS server is no longer embedded in Docker image. Must be hosted externally.

License

voice2json itself is licensed under the MIT license, so feel free to do what you want with the code.

Please see the individual licenses for the supporting tools and language-specific profiles as well.

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Supporting Tools

The following tools/libraries help to support voice2json:

eSpeak (text to speech)
DeepSpeech (speech to text)
jq (JSON formatting)
Julius (speech to text)
Kaldi (speech to text)
KenLM (language modeling)
MaryTTS (text to speech)
Montreal Forced Aligner (acoustic models)
Mycroft Precise
GNU Parallel (parallel execution)
Phonetisaurus (word pronunciations)
Pocketsphinx (speech to text)
PyInstaller
Python 3.7
OpenFST (speech/intent recognition, g2p)
Opengrm (language modeling)
Sox (WAV conversion)
webrtcvad (voice activity detection)
Zamia Speech (acoustic models)

History

Most of the enabling technologies in voice2json were developed and explored in the Rhasspy voice assistant.

Rhasspy logo

Rhasspy was originally inspired by Jasper, an “open source platform for developing always-on, voice-controlled applications”. Rhasspy’s original architecture (v1) was close to Jasper’s, though the two systems handled speech/intent recognition in very different ways.

Jasper

Jasper runs on the Raspberry Pi’s and is extendable through custom Python modules. It’s also highly configurable, featuring multiple speech recognition engines, text to speech systems, and integration with online services (Facebook, Spotify, etc.).

Speech recognition in Jasper is done using pocketsphinx, specifically with the keyword search mode. User modules declare a list of WORDS that Jasper should listen for. The union set of all module WORDS is listened for at runtime, and transcriptions are given to the isValid functions of each module in PRIORITY order. When one returns True, Jasper calls the handle function to perform the module’s intended action(s).

# ---------------------
# Example Jasper Module
# ---------------------

# Orders modules in case of a conflict
PRIORITY = 1

# Bag of words for keyword search
WORDS = ["MEANING", "OF", "LIFE"]

# Return true if transcription is valid for module
def isValid(text):
    return re.search(r"\bmeaning of life\b", text, re.IGNORECASE)

# Handle transcription
def handle(text, mic, profile):
    mic.say("It's 42")

Rhasspy v1

The first version of Rhasspy (originally named wraspy) followed Jasper in its use of pocketsphinx for speech recognition, but with an ARPA language model instead of just keywords. Rhasspy user modules (similar in spirit to Jasper’s) provided a set of training sentences that were compiled into a statistical model using cmuclmtk, a language modeling toolkit.

Inspired by the Markdown-like language used in the rasaNLU training data format, sentences were annotated with extra information to aid in post-processing. For example, the sentence turn on the living room lamp might be annotated as turn [on](state) the [living room lamp](name). While pocketsphinx would recognize the bare sentence, the Rhasspy user module would receive a pre-processed intent with state and name slots. This greatly simplified the user module handling code, since the intent’s slots could be used directly instead of requiring upfront text processing in each and every module.

Example part of a user module:

def get_training_phrases(self):
    """Returns a list of annotated training phrases."""
    phrases = []

    # Create an open/closed question for each door
    for door_name in self.doors:
        for state in ["open", "closed"]:
            phrases.append([
                'is the [{0}](location) door [{1}](state)?'.format(door_name, state)
            ])
        
    return phrases

Some limitations with this approach become apparent with use, however:

Training sentences with optional words or multiple values for a slot must be constructed in code
There is no easy way to get an overview of what voice commands are available
Intent handling is baked into each individual module, making it difficult to interact with other IoT systems (e.g., Node-RED)
Users unfamiliar with Python cannot extend the system

Rhasspy v1 shared many of the same sub-systems with Jasper, such as pocketsphinx for wake word detection, phonetisaurus for guessing unknown word pronunciations, and MaryTTS for text to speech.

Rhasspy v1.5

To address the limitations of v1, a version of Rhasspy was developed as a set of custom components in Home Assistant, an open source IoT framework for home automation. In this version of Rhasspy (dubbed v1.5 here), users ran Rhasspy as part of their Home Assistant processes, controlling lights, etc. with voice commands.

In contrast to v1, there were no user modules in v1.5. Annotated training sentences were provided in a single Markdown file, and intents were handled directly with the built-in scripting capability of Home Assistant automations. This allowed non-programmers to extend Rhasspy, and dramatically increased the reach of intent handling beyond simple Python functions.

Example training sentences:

## intent:ChangeLightState
- turn [on](state) the [living room lamp](name)
- turn [off](state) the [living room lamp](name)

Support for new sub-systems was added in v1.5, specifically the snowboy and Mycroft Precise wake word systems. Some additional capabilities were also introduced, such as the ability to “mix” the language model generated from training sentences with a larger, pre-trained language model (usually generated from books, newspapers, etc.).

While it was met with some interest from the Home Assistant community, Rhasspy v1.5 could not be used in Hass.io, a Docker-based Home Assistant virtual appliance. Additionally, Snips.AI had a great deal of momentum already in this space (offline, Raspberry Pi IoT) for English and French users. With this in mind, Rhasspy pivoted to work as a Hass.io add-on with a greater focus on non-English speakers.

Rhasspy v2

Version 2 of Rhasspy was re-written from scratch using an actor model, where each actor runs in a separate thread and passes messages to other actors. All sub-systems were represented as stateful actors, handling messages differently depending on their current states. A central Dialogue Manager actor was responsible for creating/configuring sub-actors according to a user’s profile, and for responding to requests from the user via a web interface.

V2 Architecture

Messages between actors included audio data, requests/responses, errors, and internal state information. Every behavior in Rhasspy v2 was accomplished with the coordination of several actors.

The notion of a profile, borrowed originally from Jasper, was extended in v2 to allow for different languages. As of August 2019, Rhasspy v2 supported 13 languages (with varying degrees of success). Compatible pocketsphinx models were available for many of the desired languages, but it was eventually necessary to add support for Kaldi, a speech recognition toolkit from Johns Hopkins. With the Kaldi acoustic models released for the Montreal Forced Aligner, Rhasspy has access to many languages that are not commonly supported.

A major change from v1.5 was the introduction of sentences.ini, a new format for specifying training sentences. This format uses simplified JSGF grammars to concisely describe sentences with optional words, alternative clauses, and re-usable rules. These sentence templates are grouped using ini-style blocks, which each represent an intent.

[ChangeLightState]
states = (on | off)
turn (<states>){state} the (living room lamp){name}

During the training process, Rhasspy v2 generated all possible annotated sentences from sentences.ini, and used them to train both a speech and intent recognizer. Transcriptions from the speech recognizer were fed directly into the intent recognizer, which had been trained to receive them!

v3 sentences

Besides the addition of Kaldi, Rhasspy v2 included support for multiple intent recognizers and integration with Snips.ai via the Hermes protocol. This MQTT-based protocol allowed Rhasspy to receive remote microphone input, play sounds/speak text remotely, and be woken up by a Snips.ai server. Because of Rhasspy’s extended language support, this made it possible for Snips.ai users to swap out the speech-to-text module with Rhasspy while keeping the rest of their set-up intact.

Through its REST API and a websocket connection, Rhasspy was also able to interact directly with Node-RED, allowing users to create custom flows graphically. These flows could respond to recognized intents from Rhasspy, further extending Rhasspy beyond only devices that Home Assistant could control.

The Road to voice2json

Rhasspy v2 represented a significant leap forward from v1, but there was still much to do. voice2json represents a distillation of Rhasspy, taking the core operations out and making them usable as traditional Linux command-line tools. In the future, Rhasspy will be based on voice2json, focusing more on the voice assistant/integration aspects of IoT.