Authors: Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur

Date: March, 8, 2025

Reimagining Language in Embodied AI: Beyond Commands to Cognitive Partnership

In the rapidly evolving landscape of artificial intelligence (AI), we find ourselves at a pivotal moment where language capabilities and physical embodiment are poised to converge in unprecedented ways. While Large Language Models (LLMs) have revolutionized how machines process and generate language (OpenAI et al. 2024; Wei et al., 2022), a crucial frontier remains unconquered: the seamless integration of this linguistic prowess with physical embodiment. The vision of embodied conversational agents discussed in (Cassell et al. 2000) – can be extended to robots (Physical Embodiments) that can not only execute tasks in your environment but also engage in natural, contextually-grounded dialogue with their human partners – representing the next great challenge in AI.

Just as humans learn language through physical interaction with their environment, grounding words in actions, perceptions, and experiences, these future agents must bridge the gap between abstract language understanding and concrete physical reality. The recent advances in foundation models (Zhou, Ce, et al. 2024; Hu et al. 2024) have given us a glimpse of what's possible in language understanding, but the journey toward embodied agents that can naturally converse while navigating the physical world is just beginning. This intersection of language and embodiment isn't merely about functional communication (Hu et al. 2025) – giving and receiving instructions – but about creating agents that can engage in fluid, natural dialogue (Dongre et al. 2024), ask clarifying questions when uncertain (Singh et al. 2022; Ren et al. 2023; Min et al. 2025), and build shared understanding through interaction (Narayan-Chen et al. 2020; Bara et al. 2021). As we stand at this threshold, the challenge before us is clear: how do we move from today's largely disconnected language and robotics systems toward truly integrated agents that can participate in the kind of rich, grounded conversations that characterize human interaction?

Artist: WARPSoL

Artist: WARPSoL

“Embodied agents are intelligent systems integrated with a physical form that enables them to interact with their surroundings. Unlike traditional AI systems, embodied agents have the ability to perceive through sensors, reason, and act within their environments through actuators. These agents combine sensory inputs (e.g., vision, touch, and sound) with advanced decision-making algorithms to perform tasks ranging from navigation and manipulation to complex problem-solving. Embodied Conversational Agents are specifically humanlike in the way they use language along with their embodiment during conversations.

Embodied Conversational Reasoning Demonstration by Figure 01 + OpenAI credits: Figure AI

Embodied Conversational Reasoning Demonstration by Figure 01 + OpenAI credits: Figure AI

The Promise of Foundation Models: A New Horizon

The quest for general-purpose robots—capable of seamlessly operating in diverse environments and performing an array of complex tasks—has long been an aspiration in robotics. While traditional methods have made significant strides, they face challenges in generalization, data scarcity, and adaptability. Enter foundation models, including LLMs and Vision-Language Models (VLMs), which are now emerging as transformative tools in this space (Hu et al. 2024). The advent of foundation models has dramatically reshaped our understanding of what's possible in human-machine communication. These models demonstrate remarkable capabilities in understanding context, maintaining coherent conversations, and even exhibiting common-sense reasoning. However, their knowledge remains largely disconnected from the physical world – they can discuss actions without truly understanding their physical implications or grounding them in real-world experience.

Figure 1: A Typical Architecture of Foundation Models enabling embodied systems

Figure 1: A Typical Architecture of Foundation Models enabling embodied systems

This disconnection highlights one of the central challenges in developing conversational embodied agents: bridging the gap between language understanding and physical reality. It's not enough for an agent to process language fluently; it must understand how words relate to actions, objects, and physical constraints. When a human says, "Could you help me organize these books?" the agent needs to understand not just the words, but the physical implications – recognizing books, understanding spatial relationships, and knowing the practical constraints of book arrangement.

A Case for Conversationality in Embodied Agents

cover_2.png

The conversational abilities of embodied agents are critical for seamless interaction with humans. These agents leverage conversational AI to understand natural language commands, ask clarifying questions, and provide feedback. With the integration of LLMs, such as GPT and PaLM-E (Driess et al. 2023), embodied agents have become proficient at handling complex queries, generating human-like responses, and even engaging in multi-turn dialogues (Team Scotty bot, 2023). LLMs enhance these capabilities by improving contextual understanding, reasoning, and the providing the ability to adapt to different conversational styles (Ref. Figure 2). For example, an embodied agent in a home setting can ask a user about their preferences for cleaning or cooking tasks, explain its decisions, or even learn user-specific patterns over time. This synergy between conversational AI and embodied intelligence is changing the landscape of human-robot interaction, making robots more intuitive, accessible, and aligned with human needs.

Figure 2: Planning: [1,2,3,4]; Interactivity: [1,2,3,4,5,6]; Memory: [1,2,3,4,5,6]; Policy: [1,2,3,4,5]; Decision-Making: [1,2,3,4,5,6]

Figure 2: Planning: [1,2,3,4]; Interactivity: [1,2,3,4,5,6]; Memory: [1,2,3,4,5,6]; Policy: [1,2,3,4,5]; Decision-Making: [1,2,3,4,5,6]

Trends in EAI Research: Evolution of Resources & Themes in Language + Embodied AI

The progression of language in embodied AI (EAI) has followed a clear trajectory from simple instruction following to increasingly interactive scenarios (Ref Plot 1). At its foundation lies navigation, where systems learn to follow route instructions in both indoor and outdoor environments. What began with simple point-to-point navigation has evolved into open vocabulary manipulation task scenarios (Yenamandra et al. 2024) where agents must understand novel object references. Early benchmarks like R2R (Room-to-Room) and Room-Across-Room (Anderson et al. 2018; Ku et al. 2020) focused on single-turn navigation instructions, where agents learned to follow natural language directives in simulated environments. This evolved with datasets like ALFRED (Shridhar et al. 2020) and TOUCHDOWN (Chen et al. 2019) introducing more complex, multi-step task completion and outdoor navigation respectively. The field then saw a shift toward interactive scenarios with CVDN (Thomason et al. 2020) and TEACh (Padmakumar et al. 2021) introducing dialogue-based navigation and task completion, though these datasets remain limited in the depth and naturalness of their interactions.