Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation

Quanting Xie
So Yeon Min
Pengliang Ji
Yue Yang
Tianyi Zhang
Aarav Bajaj
Ruslan Salakhutdinov
Matthew Johnson-Roberson
Yonatan Bisk
Carnegie Mellon University

[Paper]
[Code coming soon]

Video Presentation




What is needed to construct non-parametric memory for embodied agent?



There is no limit to how much a robot might explore and learn, but all of that knowledge needs to be searchable and actionable. Within language research, retrieval augmented generation (RAG) has become the workhorse of large-scale non-parametric knowledge; however, existing techniques do not directly transfer to the embodied domain, which is multimodal, where data is highly correlated, and perception requires abstraction. To address these challenges, we introduce Embodied-RAG, a framework that enhances the foundational model of an embodied agent with a non-parametric memory system capable of autonomously constructing hierarchical knowledge for both navigation and language generation. Embodied-RAG handles a full range of spatial and semantic resolutions across diverse environments and query types, whether for a specific object or a holistic description of ambiance. At its core, Embodied-RAG's memory is structured as a semantic forest, storing language descriptions at varying levels of detail. This hierarchical organization allows the system to efficiently generate context-sensitive outputs across different robotic platforms. We demonstrate that Embodied-RAG effectively bridges RAG to the robotics domain, successfully handling over 250 explanation and navigation queries across kilometer-level environments, highlighting its promise as a general-purpose non-parametric system for embodied agents.

Method


We propose Embodied-RAG, which integrates a topological map and a semantic forest for retrieval-augmented navigation and reasoning. The method operates in three stages:


Results

Result Table
Result Figure: Comparison of methods across different tasks.
Result Table
Result Speed: Comparison of memory building speed compared against baselines.
Result Figure
Result Global: Example of reasoning output for global queries, we can see that Embodied-RAG is able to reason about the overall environment and provide a comprehensive answer.

Quantitative Results

As shown in the Result Figure, we evaluate Embodied-RAG's performance against several baselines (Naive-RAG, GraphRAG, and LightRAG) across different query types. The baselines show notably poor performance for explicit and implicit queries, primarily due to their limitations in chunking multimodal embodied data into text, which often leads to retrieval failures. However, for global queries, LightRAG and GraphRAG demonstrate better performance than Naive-RAG, showcasing the advantages of their graph structures in generating holistic environmental responses.

Embodied-RAG consistently outperforms all baselines, particularly in explicit and implicit queries across all input types. For global queries, especially under spatial constraint conditions, Embodied-RAG demonstrates superior performance thanks to its flexible hybrid re-ranking approach during retrieval. Furthermore, when provided with additional sensor data, Embodied-RAG's performance shows significant improvement, while baseline performance remains unchanged, highlighting the system's effective integration of multimodal information for complex query understanding and response generation.

Qualitative Results

In qualitative comparisons, Embodied-RAG demonstrates superior reasoning capabilities, particularly for global queries (see Result Global). For the query "Describe the environment holistically", Embodied-RAG identifies the clustered regions and provides a comprehensive description of the environment, while the baselines retrieve less appropriate results like segments of the environment.



More Demos


Premapping outdoor-indoor multifunctional environment


Real time retreival on a 3000 nodes 1km diameter environment


Implicit Query Demo 1


Explicit Query


Implicit Query Demo 2

Example Queries and Prompt

Embodied Generation Task Examples

Queries

Explicit Retrieval Task

      1. Find me a water fountain.
      2. Find me a sofa.
      3. Find me a fire hydrant.
      

Implicit Retrieval Task

      1. I am dehydrated, find me somewhere.
      2. Find me a quiet place to read books.
      3. Find me a place suitable for having a group discussion.
      4. Find me a place suitable for camping.
      5. Find me somewhere I can play with my children but not on the grass.
      

Global Retrieval Task

      1. Can you describe the overall atmosphere of this environment?
      2. How are the safety features in this environment?
      3. Is this environment prepared for a fire hazard?
      4. Can you describe the overall plant trends in this environment?
      5. Can you describe different areas in this environment?
      

Embodied-RAG Prompts

Memory

Caption Prompt

You are a robot equipped with three front cameras. Given three images, describe the objects you see in a single list, and then describe their spatial relationships.

Abstraction Prompt

Given the environment descriptions {environment descriptions}, can you abstract them into a more general form? Try to infer the environment's intrinsic properties.

Retrieval

Selection Prompt

Given the environment descriptions: {environment descriptions}, select the best one to satisfy the query: {query} and show your reasoning. Structure your answer like this: Reasoning: <reasoning> , Node: <node_1>.

Generation

Global Answer Generation

Utilize all available environment descriptions: {environment descriptions}, to provide a comprehensive answer to the question: {query}.

Environments

19X diverse environments, 6X Real, 13X sim in Airsim and Habitat Sim


Paper and Bibtex

[Paper]

Citation
 
Xie,Q., Min,S., Zhang,T., Salakhutdinov, R., Johnson-Roberson, M., Bisk, Y.
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation.

        @misc{xie2024embodiedraggeneralnonparametricembodied,
          title={Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation}, 
          author={Quanting Xie and So Yeon Min and Pengliang Ji and Yue Yang and Tianyi Zhang and Aarav Bajaj and Ruslan Salakhutdinov and Matthew Johnson-Roberson and Yonatan Bisk},
          year={2024},
          eprint={2409.18313},
          archivePrefix={arXiv},
          primaryClass={cs.RO},
          url={https://arxiv.org/abs/2409.18313}, 
        }