We introduce Vocal Sandbox, a framework for enabling seamless human-robot collaboration in situated environments. Systems in our framework are characterized by their ability to adapt and continually learn at multiple levels of abstraction from diverse teaching modalities such as spoken dialogue, object keypoints, and kinesthetic demonstrations. To enable such adaptation, we design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot's capabilities in real-time, as they teach new concepts. For example, after demonstrating a new low-level skill for "tracking around" an object, users are provided with trajectory visualizations of the robot's intended motion when asked to track a new object. Similarly, users teach high-level planning behaviors through spoken dialogue, using pretrained language models to synthesize behaviors such as "packing an object away" as compositions of low-level skills - concepts that can be reused and built upon. We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation. In the first setting, we run systematic ablations and user studies with 8 non-expert participants, highlighting the impact of multi-level teaching. Across 23 hours of total robot interaction time, users teach 17 new high-level behaviors with an average of 16 novel low-level skills, requiring 22.1% less active supervision compared to baselines. Qualitatively, users strongly prefer Vocal Sandbox systems due to their ease of use (+31.2%), helpfulness (+13.0%), and overall performance (+18.2%). Finally, we pair an experienced system-user with a robot to film a stop-motion animation; over two hours of continuous collaboration, the user teaches progressively more complex motion skills to produce a 52 second (232 frame) movie.
Vocal Sandbox is a framework for human-robot collaboration that enables robots to adapt and continually learn from situated interactions. In this example, an expert articulates individual LEGO structures for each frame of a stop-motion film, while a robot arm controls the camera. Users teach the robot new high-level behaviors and low-level skills through mixed-modality interactions such as language instructions and demonstrations. The robot learns from this feedback online, scaling to more complex tasks as the collaboration continues.
Vocal Sandbox systems consist of two key components: 1) a language model task planner that maps user intents to sequences of high-level behaviors (plans), and 2) a low-level skill policy that maps individual skills output by the language model to real-world robot behavior (i.e., in this example, the skill policy is implemented as a library of Dynamic Movement Primitives (DMPs).
We seed a language model planner with an API specification that defines plans as sequences of functions invoked with different arguments [Left]. Given utterances that successfully map to plans, we visualize an interpretable trace on the GUI. If an utterance cannot be parsed, we synthesize new functions and arguments by soliciting user feedback.
In the following code blocks, we provide the actual GPT-3.5 Turbo (v11-06) prompts that we use for generation and teaching for our gift-bag assembly setting:
# Utility Function for "Python-izing" Objects as Literal Types
def pythonize_types(types: Dict[str, List[Dict[str, str]]]) -> str:
py_str = "# Python Enums defining the various known objects in the scene\n\n"
# Create Enums for each Type Class
py_str += "# Enums for Various Object Types\n"
for type_cls, element_list in types.items():
py_str += f"class {type_cls}(Enum):\n"
for element in element_list:
py_str += f" {element['name']} = auto() # {element['docstring']}\n"
py_str += "\n"
return py_str.strip()
# Initial "Seed" Objects in the Environment
TYPE_DEFINITIONS = {
"object": [
{"name": "CANDY", "docstring": "A gummy, sandwich-shaped candy."},
{"name": "GIFT_BAG", "docstring": "A gift bag that can hold items."},
]
}
# Base System Prompt -- with "Python-ized" Types
BASE_SYSTEM_PROMPT = (
"You are a reliable code interface that will be representing a robot arm in a collaborative interaction "
"with a user.\n\n"
"In today's session, the user and robot arm will be working together to wrap gifts. "
"On the table are various gift-wrapping related objects.\n\n"
"You will have access to a Python API defining some objects and high-level functions for "
"controlling the robot. \n\n"
"```python\n"
f"{pythonize_types(TYPE_DEFINITIONS)}\n"
"```\n\n"
"Given a spoken utterance from the user your job is to identify the correct sequence of function calls and "
"arguments from the API, returning the appropriate API call in JSON. Note that the speech-to-text engine is not"
"perfect! Do your best to handle ambiguities, for example:"
"\t- 'Put the carrots in the back' --> 'Put the carrots in the bag' (hard 'g')"
"\t- 'Throw the popcorn in the in' --> 'Throw the popcorn in the bin' (soft 'b')\n\n"
"If an object is not in the API, you should not fail. Instead, return an new object, which will be added to the API in the future. "
"Even if you are not sure, respond as best you can to user inputs. "
)
# In-Context Examples
ICL_EXAMPLES = [
{"role" : "system", "content": BASE_SYSTEM_PROMPT},
make_example("release", "release", "{}", "1"),
make_example("grasp", "grasp", "{}", "2"),
make_example("go home", "go_home", "{}", "3"),
make_example("go to the bag", "goto", "{'object': 'GIFT_BAG'}", "5"),
make_example("go away!", "go_home", "{}", "6"),
make_example("grab the gummy", "pickup", "{'object': 'CANDY'}", "7"),
]
Note that the System Prompt explicitly encodes the arguments/literals defined in the API; these are
continually updated as new literals are defined by the user (e.g., `TOY_CAR`
) following the
example above. The System Prompt also specifically encodes handling for common speech-to-text errors.
We pair this System Prompt with the actual "functions" (behaviors/skills) in the API specification. These are encoded via OpenAI's Function Calling Format, and are similarly updated continuously.
# Initial Seed "Functions" (Primitives)
FUNCTIONS = [
{
"type": "function",
"function": {
"name": "go_home",
"description": "Return to a neutral home position (compliant)."
}
},
{
"type": "function",
"function": {
"name": "goto",
"description": "Move directly to the specified `Object` (compliant).",
"parameters": {
"type": "object",
"properties": {
"object": {
"type": "string",
"description": "An object in the scene (e.g., RIGHT_HAND)."
},
},
"required": ["object"],
}
}
},
{
"type": "function",
"function": {
"name": "grasp",
"description": "Close the gripper at the current position, potentially grasping an object (non-compliant)."
}
},
{
"type": "function",
"function": {
"name": "release",
"description": "Release the currently held object (if any) by fully opening the gripper (compliant)."
}
},
{
"type": "function",
"function": {
"name": "pickup",
"description": "Go to and pick up the specified object (non-compliant).",
"parameters": {
"type": "object",
"properties": {
"object": {
"type": "string",
"description": "An object in the scene (e.g., SCISSORS)."
}
},
"required": ["object"]
}
}
},
]
Given the above, we can generate a plan (sequence of tool calls with the appropriate arguments) given a new user instruction as follows:
# OpenAI Chat Completion Invocation - All Responses are added to "ICL_EXAMPLES" as running memory
openai_client = OpenAI(api_key=openai_api_key, organization=organization_id)
llm_response = openai_client.chat.completions.create(
model="gpt-3.5-turbo-1106",
messages=[*ICL_EXAMPLES, {"role": "user", "content": "{USER_UTTERANCE}"}],
temperature=0.2,
tools=FUNCTIONS,
tool_choice="auto",
)
Finally, a key component of our framework is the ability to teach new high-level behaviors; to do
this, we define a special `TEACH()`
function that automatically generates the new
specification (name, docstring, type signature). We call this explicitly when the user indicates they
want to "teach" a new behavior.
TEACH_FUNCTION = [
{
"type": "function",
"function": {
"name": "teach_function",
"description": "Signal the user that the behavior or skill they mentioned is not represented in the set of known functions, and needs to be explicitly taught.",
"parameters": {
"type": "object",
"properties": {
"new_function_name": {
"type": "string",
"description": "Informative Python function name for the new behavior/skill that the user needs "
"to add (e.g., `bring_to_user`)."
},
"new_function_signature": {
"type": "string",
"description": "List of arguments from the command for the new function (e.g., '[SCISSORS, RIBBON]' or '[]').'"
},
"new_function_description": {
"type": "string",
"description": "Short description to populate docstring for the new function (e.g., 'Pickup the specified object and bring it to the user (compliant)).'"
},
},
"required": ["new_function_name", "new_function_signature", "new_function_description"]
}
}
}
]
# Invoking the Teach Function
teach_response = openai_client.chat.completions.create(
model="gpt-3.5-turbo-1106",
messages=[*ICL_EXAMPLES, {"role": "user", "content": "{TEACHING_TRACE}"}],
temperature=0.2,
tools=TEACH_FUNCTION,
tool_choice={"type": "function", "function": {"name": "teach_function"}}, # Force invocation
)
The synthesized function is then added to `FUNCTIONS`
immediately, so that it can be used
as soon as the user provides their next utterance.
We summarize the quantitative results from our user study (N = 8) above. We report robot supervision time [Left], behavior complexity (depth of new functions defined) [Middle] and skill failures [Right]. Over time, users working with Vocal Sandbox systems teach more complex high-level behaviors, see fewer skill failures, and need to supervise the robot for shorter periods of time compared to baselines.
We additionally provide illustrative videos showing various users working with our proposed Vocal Sandbox system from our user study.
These sections only provide complementary details for the full implementation of the Vocal Sandbox framework, and only briefly summarize the results from our two experimental settings. Please consult our paper for the complete details!
We additionally provide illustrative videos showing a timelapse of the two hour long continuous collaboration episode for our stop-motion animation application, along with the final minute-long movie.