Ph.D. Public Defense

Adaptive Perception for Efficient Spatio-Temporal Language Grounding in Dynamic Environments

Siddharth Patki

Supervised by Thomas Howard

Thursday, August 17, 2023
1 p.m.

601 Computer Studies Building

As robots are becoming increasingly prevalent in shared spaces such as homes and offices, the need for efficient and effective human-robot collaboration has become imperative. One of the key aspects of human-robot collaboration is the ability of robots to understand and interpret human instructions. However, the challenges associated with understanding natural language instructions in cluttered and dynamic environments remain significant.

Recent approaches to grounded language understanding reason only in the context of an instantaneous state of the world. Though this allows for interpreting a variety of utterances in the current context of the world, these models fail to interpret utterances which require the knowledge of past dynamics of the world, thereby hindering effective human-robot collaboration in dynamic environments. Extending the contemporary models to reason about utterances which require knowledge of the past dynamics of the world introduces non-trivial challenges pertaining to both world state estimation and symbol grounding. Specifically pertaining to world modeling, constructing a comprehensive model of the dynamic world that tracks the states of all objects in the robot’s workspace is computationally expensive and difficult to scale with increasing clutter in the environments. On the other hand, a poorly detailed model of the environment limits the diversity of the utterances that can be interpreted and executed. A fundamental research question then is, how to efficiently reason over this rich information in a manner that enables robots to efficiently execute a variety of instructions in highly cluttered and dynamic worlds.

In this thesis I present an arc of research which investigates how the information in language can be utilized to construct task-specific representations of the world, enabling faster and more accurate symbol grounding in cluttered and dynamic environments. Specifically, first this thesis presents a learned model of language and perception called Language Guided Adaptive Perception (LG-AP) that allows language to steer the interpretation of raw observations to create world models that are minimal but sufficient for the grounding robot instructions in static environments. Second, this thesis presents a novel approach called Language Guided Temporally Adaptive Perception (LG-TAP), that facilitates the construction of temporally compact models of dynamic worlds through closed-loop grounding and perception. This document includes is a discussion of the synergies that exist among these contributions and how adapting perception by exploiting the in- formation in language can improve runtime efficiency and accuracy of robot instruction following in static and dynamic environments with high degree of clutter.