Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

Quanting Xie *
So Yeon Min *
Tianyi Zhang
Aarav Bajaj
Ruslan Salakhutdinov
Matthew Johnson-Roberson
Yonatan Bisk
Carnegie Mellon University

[Paper]
[Code coming soon]

Video Presentation




What is needed to construct non-parametric memory for embodied agent?



There is no limit to how much a robot might explore and learn, but all of that knowledge needs to be searchable and actionable. Within language research, retrieval augmented generation (RAG) has become the workhouse of large-scale non-parameteric knowledge, however existing techniques do not directly transfer to the embodied domain, which is multimodal, data is highly correlated, and perception requires abstraction. To address these challenges, we introduce Embodied-RAG, a framework that enhances the foundational model of an embodied agent with a non-parametric memory system capable of autonomously constructing hierarchical knowledge for both navigation and language generation. Embodied-RAG handles a full range of spatial and semantic resolutions across diverse environments and query types, whether for a specific object or a holistic description of ambiance. At its core, Embodied-RAG's memory is structured as a semantic forest, storing language descriptions at varying levels of detail. This hierarchical organization allows the system to efficiently generate context-sensitive outputs across different robotic platforms. We demonstrate that Embodied-RAG effectively bridges RAG to the robotics domain, successfully handling over 200 explanation and navigation queries across 19 environments, highlighting its promise for general-purpose non-parametric system for embodied agents.

Task

The Embodied-RAG benchmark contains queries from the cross-product of {explicit, implicit, global} questions with potential {navigational action, language} generation outputs.

A task consists of:

Example tasks are shown in Fig. 1, with instances of explicit, implicit, and global queries. Spatially, the queries range from specific regions small enough to contain certain objects to global regions encompassing the entire scene. Linguistically, global queries are closer to retrieval-augmented generation tasks, while explicit/implicit ones are more retrieval-focused. Explicit and implicit queries are navigational tasks that expect navigation actions and text descriptions of the retrieved location. Global queries, on the other hand, are explanation tasks requiring text generation at a more holistic level. There are no global navigation tasks, as they pertain to larger areas, sometimes the entire environment.



Method


We propose Embodied-RAG, which integrates a topological map and a semantic forest for retrieval-augmented navigation and reasoning. The method operates in three stages:


Results

Result Table
Result Table: Comparison of methods across different tasks.
Result Figure
Result Figure: Example of reasoning output for generation tasks.

Quantitative Results

Table Result Table presents the performance of Embodied-RAG compared to RAG and Semantic Match across explicit, implicit, and global retrieval tasks. Embodied-RAG consistently outperforms the baselines across both small and large environments. Explicit queries show strong performance across all methods, with Embodied-RAG slightly improving over RAG. For implicit queries, Embodied-RAG maintains high success rates, while RAG and Semantic Match experience significant drops, particularly in larger environments. In global queries, Embodied-RAG achieves the highest scores on the Likert scale, while Semantic Match is inapplicable due to its inability to summarize or reason globally.

Qualitative Results

In qualitative comparisons, Embodied-RAG demonstrates superior reasoning capabilities, particularly for implicit and global queries (see Result Figure). For the query “Find where I can buy some drinks?”, Embodied-RAG accurately identifies food service areas, while the baselines retrieve less appropriate results like refrigerators or water fountains. Similarly, for the query “Find somewhere to take a nap outside”, Embodied-RAG identifies public parks, while RAG and Semantic Match incorrectly suggest private backyards, lacking global context.



More Demos


Premapping outdoor-indoor multifunctional environment


Premapping indoor environment


Implicit Query Demo 2


Explicit Query


Description for third video

Example Queries and Prompt

Embodied Generation Task Examples

Queries

Explicit Retrieval Task

      1. Find me a water fountain.
      2. Find me a sofa.
      3. Find me a fire hydrant.
      

Implicit Retrieval Task

      1. I am dehydrated, find me somewhere.
      2. Find me a quiet place to read books.
      3. Find me a place suitable for having a group discussion.
      4. Find me a place suitable for camping.
      5. Find me somewhere I can play with my children but not on the grass.
      

Global Retrieval Task

      1. Can you describe the overall atmosphere of this environment?
      2. How are the safety features in this environment?
      3. Is this environment prepared for a fire hazard?
      4. Can you describe the overall plant trends in this environment?
      5. Can you describe different areas in this environment?
      

Embodied-RAG Prompts

Memory

Caption Prompt

You are a robot equipped with three front cameras. Given three images, describe the objects you see in a single list, and then describe their spatial relationships.

Abstraction Prompt

Given the environment descriptions {environment descriptions}, can you abstract them into a more general form? Try to infer the environment's intrinsic properties.

Retrieval

Selection Prompt

Given the environment descriptions: {environment descriptions}, select the best one to satisfy the query: {query} and show your reasoning. Structure your answer like this: Reasoning: <reasoning> , Node: <node_1>.

Generation

Global Answer Generation

Utilize all available environment descriptions: {environment descriptions}, to provide a comprehensive answer to the question: {query}.

Environments

19X diverse environments, 6X Real, 13X sim in Airsim and Habitat Sim


Paper and Bibtex

[Paper]

Citation
 
Xie,Q., Min,S., Zhang,T., Salakhutdinov, R., Johnson-Roberson, M., Bisk, Y.
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation.

        @misc{xie2024embodiedraggeneralnonparametricembodied,
          title={Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation}, 
          author={Quanting Xie and So Yeon Min and Tianyi Zhang and Aarav Bajaj and Ruslan Salakhutdinov and Matthew Johnson-Roberson and Yonatan Bisk},
          year={2024},
          eprint={2409.18313},
          archivePrefix={arXiv},
          primaryClass={cs.RO},
          url={https://arxiv.org/abs/2409.18313}, 
        }