|
|
|
|
|
|
|
|
|
![]() |
![]() |
We propose Embodied-RAG, which integrates a topological map and a semantic forest for retrieval-augmented navigation and reasoning. The method operates in three stages:
![]() ![]() ![]() |
As shown in the Result Figure, we evaluate Embodied-RAG's performance against several baselines (Naive-RAG, GraphRAG, and LightRAG) across different query types. The baselines show notably poor performance for explicit and implicit queries, primarily due to their limitations in chunking multimodal embodied data into text, which often leads to retrieval failures. However, for global queries, LightRAG and GraphRAG demonstrate better performance than Naive-RAG, showcasing the advantages of their graph structures in generating holistic environmental responses.
Embodied-RAG consistently outperforms all baselines, particularly in explicit and implicit queries across all input types. For global queries, especially under spatial constraint conditions, Embodied-RAG demonstrates superior performance thanks to its flexible hybrid re-ranking approach during retrieval. Furthermore, when provided with additional sensor data, Embodied-RAG's performance shows significant improvement, while baseline performance remains unchanged, highlighting the system's effective integration of multimodal information for complex query understanding and response generation.
In qualitative comparisons, Embodied-RAG demonstrates superior reasoning capabilities, particularly for global queries (see Result Global). For the query "Describe the environment holistically", Embodied-RAG identifies the clustered regions and provides a comprehensive description of the environment, while the baselines retrieve less appropriate results like segments of the environment.
Premapping outdoor-indoor multifunctional environment |
Real time retreival on a 3000 nodes 1km diameter environment |
Implicit Query Demo 1 |
Explicit Query |
Implicit Query Demo 2 |
1. Find me a water fountain. 2. Find me a sofa. 3. Find me a fire hydrant.
1. I am dehydrated, find me somewhere. 2. Find me a quiet place to read books. 3. Find me a place suitable for having a group discussion. 4. Find me a place suitable for camping. 5. Find me somewhere I can play with my children but not on the grass.
1. Can you describe the overall atmosphere of this environment? 2. How are the safety features in this environment? 3. Is this environment prepared for a fire hazard? 4. Can you describe the overall plant trends in this environment? 5. Can you describe different areas in this environment?
You are a robot equipped with three front cameras. Given three images, describe the objects you see in a single list, and then describe their spatial relationships.
Given the environment descriptions {environment descriptions}, can you abstract them into a more general form? Try to infer the environment's intrinsic properties.
Given the environment descriptions: {environment descriptions}, select the best one to satisfy the query: {query} and show your reasoning. Structure your answer like this: Reasoning: <reasoning> , Node: <node_1>.
Utilize all available environment descriptions: {environment descriptions}, to provide a comprehensive answer to the question: {query}.
![]() |
Citation |
|