Software Requirements Specification for Voice Interface Library

Author:Eric Meinhardt

Change Record

2013.06.20 - Draft 1 completed.

2013.06.27 - Draft 2 - first revisions - completed. Typos fixed, hyperlinks changed, Franco’s architecture comments integrated, added example of macro use.

2013.07.11 - Draft 3. Name change, made Aouda.X suit simply an example of an analogue suit rather than the only specific target for integration, minor edits for clarity.


Purpose & Scope

This document is intended to detail for developers of the ERAS voice interface library and other ERAS C3 components what ERAS’s voice interface library is predicted to be needed for, relevant details about who is predicted to have those needs, and what features the voice interface has (or will have) to meet those anticipated needs. This requirements specification is intended to cover a software library and associated documentation.


ELIZA effect
The human tendency to ascribe human qualities and behaviors to computers or software.
European Mars Analogue Station Command, Control, and Communication
Extra-vehicular activity. Doing stuff in a space suit.
Heads-up display.
Italian Mars Society.
The Austrian Space Forum (“Österreichisches Weltraum Forum”, auf Deutsch.)

General Description

Problem Statement

Surface-exploration tasks may well require astronauts conducting extra- vehicular activity to give direct instructions to rovers

  • without anything resembling a keyboard-pointer-screen interface
  • without requiring the assistance of a human intermediary teleoperating the rover via something like a keyboard-pointer-screen interface

Voice interaction allows for natural real-time, hands-free command of a rover with minimal learning curve and without a need for dedicated, single-use hardware.

Functional Description

The voice interface library will support the creation of software agents capable of

  1. accepting an audio signal,
  2. deciding whether the audio signal is addressed to the software agent,
  3. attempting to map a relevant signal onto a word sequence,
  4. determining the rover action request associated with the understood word sequence,
  5. and providing appropriate feedback to the user.

The voice interface library’s first release should allow for the simplest possible solution to these requirements; features that involve detailed (or more speculative) knowledge of the capabilities of the rover software agent(s) that control actuators (effectors) or take input from sensors (e.g. complex inference about context, richer feedback to the speaker about the internal state of the rover software, etc) are not part of the design goals for an initial release, although creating modularity and abstractions that allow for easier addition or interface with such code in later updates is.

For now, advanced features (in approximately this order of priority) include

  1. different types, degrees, and mediums of feedback (text, artificial speech, and more demanding graphical displays), with implementation independent of specific details about any particular analogue spacesuit or component thereof (e.g.the Aouda.X HUD and/or MARVIN).

  2. support for a dialog manager (for managing conversation-related inference) and other more advanced natural language processing capabilities built on top of other components of the rover software executive

  3. easy-to-use, low maintenance learning mechanisms, starting with the capacity for simple user-definable macros that can be ‘written’ entirely in the field and ‘on the fly’. For example, suppose an astronaut decides, in the field, that she or he wants the rover to take two pictures (each with different camera settings) with, say, him- or herself at the center, at multiple locations. Without macros and without pre-EVA scripting of this task, the astronaut will have to go through this loop

    1. With the rover following, proceed to the next location where pictures are desired.
    2. Tell the rover to take a picture of the astronaut with parameter set 1.
    3. Tell the rover to take a picture of the astronaut with parameter set 2.

    every time a pair of pictures at a new location is desired. With the ability to record simple macros, the astronaut can instead tell the rover to ‘start recording’, give instructions to the rover - in the case of the example above, ‘Follow closely.’...’Stop.’...’Take a picture of me using <settings abc>.’...’Take a picture of me using <settings xyz>.’...’Stop recording. Label this macro <macro-name>.’ and then tell the rover in question to invoke the macro.


The voice interface library is intended to be written in Python, with an instance hosted on an onboard computer of an analogue space suit (e.g. Aouda.X), wrapped in a Tango distributed control system object, running Ubuntu 12.04 (LTS), and to interact well with other elements of the ERAS C3 Prototype, namely the rover executive / planning agent.

User classes & objectives

(Analogue) Astronaut

Speech will be used by astronauts to direct the rover, ideally, as astronauts are used to using speech - as with other human beings (see the ELIZA effect), but probably have had enough experience talking to phone-based dialog agents and/or smartphone assistants to lower their expectations.

In more detail, this means a voice interface instance must provide feedback (answering the user question “Did the rover hear me and understand what I asked?”) and require a minimum of extra explicitness that a conversation with a human being would be unlikely to contain: i.e. a voice interface instance should have some means of modeling conversational context.

As well, as a control interface, astronauts want as clearly as possible to know what options they have (i.e. what the rover is listening for) to direct the rover at any given moment (e.g. the rover might understand a request to turn, but isn’t sure what direction or how far, etc.) and what the limits are on how they can pursue those options (i.e. what they can reasonably expect the rover will or will not understand).

Users will be expected to have extensive opportunities to learn the capabilities and limitations of the voice interface library and also to provide more than enough training data for speech recognition models prior to field testing.

Functional Requirements

Requirement 1: Receive audio stream


The voice interface instance should be able to receive a local audio stream.


High. This is an essential feature.


This functional requirement depends on an interface requirement - interfacing with other Tango (ERAS C3) objects. (See the software interface requirement.)

Requirement 2: Classify audio signal addressee


A voice interface agent (instance) needs to be able to determine whether or not the stream it’s receiving contains linguistic content directed at it.


High. This is an essential feature.


This functional requirement depends on receiving an audio stream (the first functional requirement).

Requirement 3: Map relevant signal to word sequence


A voice interface software agent needs to infer from the audio signal what a matching word sequence is; library availability and efficiency vs. effectiveness trade-offs will determine how complex this needs to be (e.g. committing to a single most- probable word stream from t=0 forward vs. holding some small number of candidate word sequences in parallel and dynamically reranking them as the signal unfolds).


High. This is an essential feature.


Requirement 3 depends on requirement 2 (identifying whether a linguistic utterance is a rover command).

NB Requirement 2 can be viewed as a strict subset (albeit one notable enough to pick out) of requirement 3: in its simplest form (starting any command for some rover with the rover’s name), the spoken form corresponding to an address is a command to ‘listen carefully to the rest of what I [the current speaker] have to say.’

Requirement 4: Map word sequence to action request


Given a word sequence (or probability estimates over a small number of the most probable word sequences), a voice interface software agent must attempt to determine what action(s) is (are) being requested of the rover.


High. This is an essential feature.


This functional requirement is dependent on feature 3 (mapping a signal deemed relevant to a word sequence).

Requirement 5: User Feedback


The voice recognition software agent may not recognize or understand some or all of an utterance it believes directed at it; the voice interface instance ought, in such cases, be able to provide appropriate feedback to users.

To start, a voice interface agent will be able to send text error messages more useful and informative to an end-user who knows little or nothing about the ERAS voice interface library, Tango, or how voice recognition works than what a developer would use for debugging - stack traces and programmer/scientific jargon will NOT be acceptable. These can either be transmitted (and viewed) as text or via synthesized speech.


High. This is an essential feature.


This functional requirement is a real-time error recovery mechanism; at least one of requirements 1-4 needs some minimum level of functionality before development on feedback messages make much sense. That said, requirements 3 and 4 - mapping an audio signal to a word sequence and interpreting what the requested action associated with that sequence is - will likely be the functional requirement most subject to errors and that users therefore are most likely to want feedback on.

Requirement 6: Rich Feedback


Synthesized speech (minimally text-to-speech versions of the text error messages), differential length/detail feedback, context-based-inference, and non-verbal graphical feedback are variations in feedback that will allow an astronaut to more easily able to understand why the voice interface (or the rover) is not understanding or complying with the astronaut’s request and what they can do to change this, as circumstances allow.


Medium. Graceful recovery from failure will be important in avoiding frustration on the part of users in the face of brittle technology.


This functional requirement is dependent on requirement 5 and the software interface requirements.

Requirement 7: Enhanced Natural Language Processing & Understanding


The minimal specifications don’t make use of any particularly complex natural language technologies other than speech recognition (knowledge of a language’s phonetics and phonology); syntax is represented in a very simple, impoverished, and inflexible form; ‘conversation’ is also a rather lop-sided affair. A detailed and robust rover executive with an explicit ontology of objects in the world, model of self and speakers, and more detailed grammar (of the language in question, of the speech patterns of the astronauts actually on the mission) would allow for a more natural interface with less of a learning curve; astronauts would likely spend less time worrying about how much they need to adjust their answers for the primitiveness of the rover and what the recognized types and sequences of magic words are to make it do their bidding.

Specifically, a part of speech tagger, proper name identification/named- entity-extraction, more complex syntactic and semantic parsers, and a dialog manager, with the latter interfacing with a planning agent (and its associated formal framework) are starting areas for growth. In particular, investing time in developing a dialog manager (or the prerequisites thereof) may be the single most worthwhile investment for additional functionality in the voice interface, permitting more less code to do more work (instead of explicitly and duplicatively hand-coding the edge-cases - e.g. error recovery - for each type of task).



Criticality depends in part on testing; if a simpler system is good enough for intended uses, adding more complicated natural language processing components may end up at worst compromising performance (NLP/NLU is CPU-intensive and might be a bottleneck in voice command processing), the learning/training curve (the system may take a long time to learn enough data from users to function correctly where a simpler system may work well-enough ‘out of the box’), and of course add to the tasks of software development and maintenance.


This functional requirement necessitates functional requirements 1-5, at minimum, and potentially at least further knowledge of planned aspects of the rover software executive.

Requirement 8: Learning mechanisms


After initial testing of each feature and use-case scenario, bottlenecks in functionality (at least earlier in the data-flow, given the nature of compounding errors and dependencies among functional requirements 1-5) should become identifiable; the ability to learn from each episode of each feature use and thereby both improve a the voice interface instance’s statistical models of speech, language, and understanding as well as to add new ‘vocabulary’ items (named entities and ‘voice macros’) could be an important means of minimizing astronaut frustration and effort while maximizing an astronaut’s ability to direct a rover as they please.

Specific areas of improvement are below:

  • speech models
    • speaker-specific supervised training (having a user read aloud from a set of known texts) is normal for some speech recognition models; whether those used by models available in the open-source speech recognition libraries likely to be used are such speech recognition models is unknown at the time of writing; in any case, such training does not take very long for substantial gains in accuracy to be realized.
    • NB that language variety models (simplistically, “dialect”) are probably not worth pursuing unless there are large numbers of people in testing or use that fall into language variety clusters where performance is sufficiently poor when accent is not modeled (at all or explicitly).
  • classifying speech as rover-directed or not

  • grammar extensions - more general, flexible models of language will permit astronauts to interact more naturally, rather than trying to remember the hyperspecific, stilted forms that the rover recognizes.

  • vocabulary - astronauts will be able to add new atomic items (e.g. location names) to a voice interface agent’s knowledgebase and more complex procedures (e.g. let the sequence of actions a, b, and then c be called ‘X’) composed of simpler actions each associated with a voice command.


Medium/low; depends on how well or poorly the other features function and how important extension of the grammar and/or vocabulary seem like they would be in testing more primitive versions.


Low/medium. This feature could plausibly be examined and worked on as each of feature requirements 1-5, 6, and 7 are completed, although some analysis will require the first five to be done.

Interface Requirements

User Interfaces

The user is assumed to have a microphone and at least speakers; a visual interface capable of displaying at least text is presumed but not required at this point.

Software Interfaces

The Tango object representing the server hosting the voice interface instance should have access to appropriate (currently not well defined) Tango objects related to a rover and a flexible number of slots for Tango objects for suit-related interfaces, like receiving microphone audio and/or updates about the state of an astronaut - useful for modeling utterance context.

Externally, the voice interface instance Tango object should have exposed methods for the rover planner/executive to call for the purpose of deciding what feedback to send to a user.

Performance Requirements

Lag-to-Feedback (s)

Time from end of speaker utterance to onset of voice agent feedback transmission. A user ought to receive some feedback within no more than a few seconds for particularly complex commands or noisy input; feedback time for basic, short commands in typical conditions ought to be less than that. Testing will firm up whether these performance times are too generous or stringent.

Word recognition error rate on actual rover-directed speech (%)

A reasonable goal, based on consultation of a review of early/mid-2000s NASA technology and field tests ([1]), is for around ~6.5% or less of actual rover- directed words to be incorrectly recognized. A possible catch here is that the IMS/OeWF volunteers may have varying types and degrees of accented English.

False accept rate (attending to non-rover-directed speech)

‘False accepts’ occur when a rover voice agent misclassifies an utterance as a request directed at it. A reasonable goal based on consultation of [1] is for <10% of all utterances to be incorrectly classified by the rover voice agent as directed at the rover.

An easy fix for this to start with is a prefix-keyword (think Star Trek’s “Computer, ...” - prefixing every command with the name of the rover being addressed); depending on how annoying this is, a separate classifier can be trained later to classify incoming utterances.

False reject rate (ignoring rover-directed speech)

‘False rejects’ occur when a rover voice agent misclassifies an utterance as NOT directed at it. A reasonable goal based on consultation of [1] is for <10% of all utterances to be incorrectly classified by the intended rover voice agent as directed to someone else.

Development and Test Factors

Standards Compliance

All code will adhere to the guidelines outlined in the ERAS Software Engineering Practices Guidelines

In addition, a the voice interface instance ought to be able to support receiving audio in a number of well-supported, non-proprietary audio formats - WAV, AAC, Ogg vorbis.

Software validation and verification

The voice interface library code will be unit-tested, behaviorally tested by cases, using speech recorded on inexpensive consumer-model laptop microphones, possibly tested in simulation (provided a simulation exists at some point), and later field-tested by IMS/OeWF volunteers.


The minimum schedule can be found in [3]. Voice library-salient minimum milestones are below.

  • June 27: First Draft of Design Study finished. Coding begins, moving through use cases with repository updates at least every two weeks.
  • July 29: Design Study Review completed: Design Study doc frozen on repository, server prototype up and running in Tango.
  • Aug 2: Mid-term evaluation.
  • Aug 15: “GSoC on Mars” paper and presentation for 2013 Mars Society convention in Boulder ready.
  • Sep 16: Final server version up and running, all validation tests OK with satisfactory coverage.
  • Sep 23: User/Maintenance Manual frozen.
  • Sep 27: Final evaluation.
  • Oct 2013: Project integration on Bergamo C3 prototype.
  • Within 2013?: Field testing with OeWF.

The preferred schedule, intended to provide some slack for unanticipated difficulties, is below.

Use-Case Models

Use Case: Important features common to all use cases


One or more astronauts/IMS or OeWF volunteers conducting (mock) EVA and using one or more rovers to assist them.

Contextual Goals

Direct the operation of a rover using naturalistic voice commands.




The voice interface instance needs a functioning audio stream input.


  1. Audio is transmitted from the astronaut(s) microphone to the server hosting the voice interface agent.
  2. Language in the audio is classified as rover-directed or not.
  3. Rover-directed speech is mapped onto words (the mapping mechanism is deliberately underspecified).
  4. The voice interface instance decides what to do with the utterance and therefore what kind of feedback to give the user:
    1. Utterances the voice interface instance is confident it understood:
      1. The utterance is mapped onto an action request.
      2. The action request gets passed on to the rover executive (planning agent).
      3. The rover executive then passes on to the voice interface instance whether the request will be executed, if there’s a conflict and the voice interface instance should ask for confirmation/clarification, or if the request cannot be completed.
      4. Whatever action the rover planning agent takes, the voice interface instance then decides appropriate feedback to pass onto the user.
        1. If the request will be straightforwardly granted, a short restatement including parameters (e.g. distance to move or rotate, destination) will be forwarded by the voice interface instance to the astronauts on EVA.
          • Alternately, to cut down on useless chatter, if there is some kind of HUD indicator of what each rover on EVA is doing (i.e. a short status summary), updating this could be a better alternative than HUD text or synthesized speech.
        2. If there’s a conflict, the voice interface instance should pass on a message (via text-in-HUD or via synthesized speech) as to what conflicts with the request (e.g. “CONFLICT: Travel to <name-of- requested- destination> conflicts with existing goal <goal id/description>.”) and ask for confirmation of the request (e.g. “CONFIRM?: Travel to <destination-name>.”)
        3. If the request cannot be complied with (due to precondition violation distinct from a goal conflict), the voice interface agent should pass along a message explaining as much: “REQUEST DENIED: <explanation - precondition xyz violated.>”
        4. If the request was only partially understood or understood with confidence less than a to-be-experimentally-determined threshold, then the voice interface instance should request clarification of the remaining parameters while clarifying what it already understands. For example, “Travel where?” “Move forward how far?” “Track what?” “Follow who?”
    2. Utterances the voice interface instance is NOT confident it understood:
      • The voice interface instance requests clarification a limited number of times:
        • successful clarification puts the voice interface instance back at 4.1 above.
        • before returning to a state where it waits for a new command or until the user decides to break the clarification dialog loop (e.g. via “No more questions.”, “Start over.”, “Shut up.”)


The rover passes on the request as understood to the rover’s planning agent, waits for feedback from the planning agent, and passes it along to the user.


Note that none of the trigger utterance example lists are intended to be exhaustive.

Use Case: Directing rover movement




The rover must be capable of the requested movement and the requested movement should not conflict with other current or near-future goals.

Examples of naturalistic and realistic trigger utterances

Examples of less definite duration, goal directed instructions

  • “<Head/go> (back) to(wards) <named-landmark, the-habitat, the-next-waypoint, astronaut’s-name, other-rover’s-name>.”
  • “Come here.”
  • “Follow me (closely, exactly).”

Notably, a (very) rough grammar template for this sort of command consists of some movement word, a target phrase (possibly including prepositions or adverbs), and optional arguments indicating the manner in which the rover should pursue movement towards the target.

Examples of definite, direct instructions

  • “Rotate <left, right> <# degrees>.”
  • “Go <forward, backward> <a certain number of meters or centimeters>.”
  • “Don’t move.”
  • “Stop.”
  • “Halt.”


An appropriate movement request is made to the rover planner and appropriate feedback reaches the user.

Use Case: Image recording


Human user making an action request, rover voice interaction agent, and (potentially) a target.




The webcam must be operational and the requested use of it should not conflict with other current or future goals.

Examples of realistic, naturalistic trigger utterances

(NB that almost all of these are of a goal-directed nature.)

  • “<Watch/Record> <named-entity> (for-some-duration).”
  • “Take a <photo, panorama, video, capture> of <named-entity> (and label it <name for photo/data capture>).”
  • “End/Stop recording.”
  • “Delete the last <capture/image/recording>.”


An appropriate webcam action request is passed on the planner and feedback forwarded to the astronauts.

Use Case: Report details on rover state


(Analogue) astronaut.




The voice interface must be operational and able to get a response from the rover executive.

Examples of trigger utterances

  • “What’s your current status, <rover name>?”
  • “Run <name of diagnostic routine>.”
  • “What’s the status of your <webcam, other rover-software-or-hardware- component>?”


Anything more than a short list of simple requests is going to start approaching menu-navigation - operating something like a console, all by voice. The scope and feasability of this is only determinable via testing (in simulation or otherwise) - what would astronauts do to diagnose or repair a rover if one too heavy to drag back to safety breaks in the field?


A more or less detailed message of what is or isn’t OK with the requested item is sent via text to the astronaut’s HUD or via voice-synthesis to the current common voice channel.