|
|
|
|
|
|
|
|
|
The Embodied-RAG benchmark contains queries from the cross-product of {explicit, implicit, global} questions with potential {navigational action, language} generation outputs.
Example tasks are shown in Fig. 1, with instances of explicit, implicit, and global queries. Spatially, the queries range from specific regions small enough to contain certain objects to global regions encompassing the entire scene. Linguistically, global queries are closer to retrieval-augmented generation tasks, while explicit/implicit ones are more retrieval-focused. Explicit and implicit queries are navigational tasks that expect navigation actions and text descriptions of the retrieved location. Global queries, on the other hand, are explanation tasks requiring text generation at a more holistic level. There are no global navigation tasks, as they pertain to larger areas, sometimes the entire environment.
|
We propose Embodied-RAG, which integrates a topological map and a semantic forest for retrieval-augmented navigation and reasoning. The method operates in three stages:
Table Result Table presents the performance of Embodied-RAG compared to RAG and Semantic Match across explicit, implicit, and global retrieval tasks. Embodied-RAG consistently outperforms the baselines across both small and large environments. Explicit queries show strong performance across all methods, with Embodied-RAG slightly improving over RAG. For implicit queries, Embodied-RAG maintains high success rates, while RAG and Semantic Match experience significant drops, particularly in larger environments. In global queries, Embodied-RAG achieves the highest scores on the Likert scale, while Semantic Match is inapplicable due to its inability to summarize or reason globally.
In qualitative comparisons, Embodied-RAG demonstrates superior reasoning capabilities, particularly for implicit and global queries (see Result Figure). For the query “Find where I can buy some drinks?”, Embodied-RAG accurately identifies food service areas, while the baselines retrieve less appropriate results like refrigerators or water fountains. Similarly, for the query “Find somewhere to take a nap outside”, Embodied-RAG identifies public parks, while RAG and Semantic Match incorrectly suggest private backyards, lacking global context.
Premapping outdoor-indoor multifunctional environment |
Premapping indoor environment |
Implicit Query Demo 2 |
Explicit Query |
Description for third video |
1. Find me a water fountain. 2. Find me a sofa. 3. Find me a fire hydrant.
1. I am dehydrated, find me somewhere. 2. Find me a quiet place to read books. 3. Find me a place suitable for having a group discussion. 4. Find me a place suitable for camping. 5. Find me somewhere I can play with my children but not on the grass.
1. Can you describe the overall atmosphere of this environment? 2. How are the safety features in this environment? 3. Is this environment prepared for a fire hazard? 4. Can you describe the overall plant trends in this environment? 5. Can you describe different areas in this environment?
You are a robot equipped with three front cameras. Given three images, describe the objects you see in a single list, and then describe their spatial relationships.
Given the environment descriptions {environment descriptions}, can you abstract them into a more general form? Try to infer the environment's intrinsic properties.
Given the environment descriptions: {environment descriptions}, select the best one to satisfy the query: {query} and show your reasoning. Structure your answer like this: Reasoning: <reasoning> , Node: <node_1>.
Utilize all available environment descriptions: {environment descriptions}, to provide a comprehensive answer to the question: {query}.
|
Citation |
|