There is no limit to how much a robot might explore and learn, but all of that knowledge needs to be searchable and actionable. Within language research, retrieval augmented generation (RAG) has become the workhorse of large-scale non-parametric knowledge; however, existing techniques do not directly transfer to the embodied domain, which is multimodal, where data is highly correlated, and perception requires abstraction. To address these challenges, we introduce Embodied-RAG, a framework that enhances the foundational model of an embodied agent with a non-parametric memory system capable of autonomously constructing hierarchical knowledge for both navigation and language generation. Embodied-RAG handles a full range of spatial and semantic resolutions across diverse environments and query types, whether for a specific object or a holistic description of ambiance. At its core, Embodied-RAG's memory is structured as a semantic forest, storing language descriptions at varying levels of detail. This hierarchical organization allows the system to efficiently generate context-sensitive outputs across different robotic platforms. We demonstrate that Embodied-RAG effectively bridges RAG to the robotics domain, successfully handling over 250 explanation and navigation queries across kilometer-level environments, highlighting its promise as a general-purpose non-parametric system for embodied agents.

Method

We propose Embodied-RAG, which integrates a topological map and a semantic forest for retrieval-augmented navigation and reasoning. The method operates in three stages:

Memory Construction: A topological graph is built from nodes containing allocentric coordinates, yaw, ego-centric image paths, and captions generated by a vision-language model. These nodes form a memory-efficient structure for large-scale environments. A hierarchical semantic forest is created by clustering these nodes, with non-leaf nodes summarizing their child nodes using an LLM.
Retrieval: We retrieve the top $k$ chains of semantic information by traversing the forest with BFS and LLM-guided selection, refining the query relevance at each level. This approach extracts multi-scale contextual paths that contain both semantic and spatial information.
Generation: The retrieved chains are used as context for generating either navigational actions or text-based responses. The LLM selects the optimal waypoint for navigation, or generates detailed answers based on the query, leveraging the hierarchical structure to address explicit, implicit, and global queries.

Results

Result Table — **Result Figure**: Comparison of methods across different tasks.

Quantitative Results

As shown in the Result Figure, we evaluate Embodied-RAG's performance against several baselines (Naive-RAG, GraphRAG, and LightRAG) across different query types. The baselines show notably poor performance for explicit and implicit queries, primarily due to their limitations in chunking multimodal embodied data into text, which often leads to retrieval failures. However, for global queries, LightRAG and GraphRAG demonstrate better performance than Naive-RAG, showcasing the advantages of their graph structures in generating holistic environmental responses.

Embodied-RAG consistently outperforms all baselines, particularly in explicit and implicit queries across all input types. For global queries, especially under spatial constraint conditions, Embodied-RAG demonstrates superior performance thanks to its flexible hybrid re-ranking approach during retrieval. Furthermore, when provided with additional sensor data, Embodied-RAG's performance shows significant improvement, while baseline performance remains unchanged, highlighting the system's effective integration of multimodal information for complex query understanding and response generation.

Qualitative Results

In qualitative comparisons, Embodied-RAG demonstrates superior reasoning capabilities, particularly for global queries (see Result Global). For the query "Describe the environment holistically", Embodied-RAG identifies the clustered regions and provides a comprehensive description of the environment, while the baselines retrieve less appropriate results like segments of the environment.

More Demos

Premapping outdoor-indoor multifunctional environment

Real time retreival on a 3000 nodes 1km diameter environment

Implicit Query Demo 1

Explicit Query

Implicit Query Demo 2

Example Queries and Prompt

Embodied Generation Task Examples

Queries

Explicit Retrieval Task

      1. Find me a water fountain.
      2. Find me a sofa.
      3. Find me a fire hydrant.

Implicit Retrieval Task

      1. I am dehydrated, find me somewhere.
      2. Find me a quiet place to read books.
      3. Find me a place suitable for having a group discussion.
      4. Find me a place suitable for camping.
      5. Find me somewhere I can play with my children but not on the grass.

Global Retrieval Task

      1. Can you describe the overall atmosphere of this environment?
      2. How are the safety features in this environment?
      3. Is this environment prepared for a fire hazard?
      4. Can you describe the overall plant trends in this environment?
      5. Can you describe different areas in this environment?

Embodied-RAG Prompts

Memory

Caption Prompt

You are a robot equipped with three front cameras. Given three images, describe the objects you see in a single list, and then describe their spatial relationships.

Abstraction Prompt

Given the environment descriptions {environment descriptions}, can you abstract them into a more general form? Try to infer the environment's intrinsic properties.

Retrieval

Selection Prompt

Given the environment descriptions: {environment descriptions}, select the best one to satisfy the query: {query} and show your reasoning. Structure your answer like this: Reasoning: <reasoning> , Node: <node_1>.

Generation

Global Answer Generation

Utilize all available environment descriptions: {environment descriptions}, to provide a comprehensive answer to the question: {query}.

Environments

19X diverse environments, 6X Real, 13X sim in Airsim and Habitat Sim

Paper and Bibtex

[Paper]

Citation

Xie,Q., Min,S., Zhang,T., Salakhutdinov, R., Johnson-Roberson, M., Bisk, Y.
Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation.

        @misc{xie2024embodiedraggeneralnonparametricembodied,
          title={Embodied-RAG: General non-parametric Embodied Memory for Retrieval and Generation}, 
          author={Quanting Xie and So Yeon Min and Pengliang Ji and Yue Yang and Tianyi Zhang and Aarav Bajaj and Ruslan Salakhutdinov and Matthew Johnson-Roberson and Yonatan Bisk},
          year={2024},
          eprint={2409.18313},
          archivePrefix={arXiv},
          primaryClass={cs.RO},
          url={https://arxiv.org/abs/2409.18313}, 
        }

Video Presentation

What is needed to construct non-parametric memory for embodied agent?