Max – #noAlexa

Voice assistants are a thing, as you can see with the success of Alexa and Mark Zuckerberg’s Jarvis demo. When I saw this, Max – our voice assistant – was already up and running in its initial setup 😉

Ivonne created a Node.js project which enables a always-listening service to power our digital assistant. It is built to run on a Raspberry Pi 3 with Raspbian Jessie. We call him Max inspired by the comic ‘The Thirteenth Floor’ where Max serves as the Maxwell building’s main computer.

Below you will learn what components we have used to create our servant.

General Architecture and Approach

The system base is the Raspberry Pi system. It is setup with active loudspeaker connected to the line-out. It connects to a USB microphone (SAMSON Meteorite USB Condenser Microphone) and a blink(1) mk2 USB notification light. The Raspberry Pi runs with a 8GB SD Card and connects to the network using internal Wifi.

To create Max, the node process listens to the microphone to capture hotwords. Once a hotword is identified, the process actively records the speech and forwards it to an online speech-to-text service/Natural language processing to get the intent of the speaker. The intent and entities are used to apply business logic and trigger actions.

To signal the system state to the user, the blink is lit up in different colors to indicate standby/listening/processing/… status.

Max will also use text-to-speech capabilities to give feedback to the user as voice.

The first use case was a TV remote control utilizing a Samsung TV which can process key commands per TCP socket and SOAP Webservices.

Components

Audiorecording

Max uses arecord wrapped into node “mic” package. Inputstream of mic is piped to the hotword detector. Once a hotword is detected, Max switches to active listening and processes recorded data until a silence event is emitted by the “mic” or if a max recording time is reached.

Hotword detection

Max uses snowboy for hotkey detection using personal models (pmdl) created on snowboy website.

Speech processing

Mic delivers data events. Data chunks are collected in a buffer array. Once the hotword detection triggered, the collected data chunks after the hotword and further speech are streamed to wit.ai. This allows to have a continuous speech and does not require to have a pause to indicate active listening making interaction very fluid.

Speech-to-text is powered by wit.ai which is trained for the required intents and entities.

Business Logic (Bot-logic)

The logic checks the data and internal context to derive actions and control the system state. A local MongoDB is used for persistence of context and other data.

The Bot-Logic components are constructed as Node.js modules to handle the trained intents. So for each intent a new sub-module is created and the context store is used to allow all modules to share some data and insights to the user’s needs.

Sound Output

To give Max a voice, initially espeak was used. Max now utilizes Pico TTS libraries to provide a soft (female) voice output. Pico TTS is powered by node speaky module which wraps the pico libraries for access from node.

Next to voice, Max plays short soundfiles to e.g. signal error states.

blink(1)

blink(1) provides a node package for controlling the blink lights.

Samsung TV

For pre-2014 tv-sets channels can be controlled via send-key API using samsung-remote npm module. In addition, the (non public) SOAP API is used.

Ready, Set, Action

To see how Max looks and sounds like now, check the tweet below. As we are german, Max is speaking german, too.

We are expanding the use case of Max over time. Currently Max can control the TV, handle reminders, send Slack messages, provide weather and time information, can query Wikipedia and read latest news pushed by twitter feeds. Because Max can distinguish between Ivonne and myself the services are customized to our preferences.

Advertisements