

“s”, “p”, “ē”, and “CH” in “speech”), occurring at discrete moments in time. This process involves starting with a waveform and from it determining the probabilities of distinct speech sounds, or phonemes (e.g. One vital component of today’s statistically-based speech recognition is acoustic modeling. Although we humans solve these tasks very intuitively and without much apparent effort, trying to get a program to perform both poses problems that are much more difficult to solve than one might think. Even the best speech recognition and semantic analysis software today is far from perfect. Semantic analysis goes a step further by attempting to determine the intended meaning of this text. Speech recognition is the transcription from speech to text by a program. How do speech recognition and semantic analysis work? The second section presents a Unity Asset Store package and public repository we are making available that provides a wrapper for several speech-to-text solutions and a sample scene that compares text transcriptions from each API. It serves as a primer introducing related concepts and links for the reader to get more information on this field.


The first section of this article presents concepts and theory behind speech recognition. We at Labs have been researching speech recognition and analysis tools that could be used to implement these voice commands. Unity Labs’ virtual reality (VR) authoring platform Carte Blanche will have a personal assistant called U, with whom the user will be able to speak in order to easily perform certain actions. But anyone who is capable of speech can easily speak while they are in a VR experience. Typing out a response or command might be too impractical, and overcrowding the application with buttons or other GUI elements could get confusing very fast. Speech recognition is useful for VR not only for simulating conversations with AI agents but also for the user to communicate with any application that requires a great number of options.
