Language & Vision
Interactive Audio-Haptic Perception for Blind and Low-Vision Users
PI: Kathleen McKeown
Co-PI: Tatiana Emmanouil, Nikolaus Kriegeskorte, and Brian Smith,
Abstract
This project seeks to build an assistive technology that gives BLV people an audio-haptic sense of vision that is similarly interactive and flexible as the human visual system, enabling interactive shifts of spatial focus to “zoom in” or “pan” to different areas as needed, and providing information that ranges from the sensory (shape, texture, color), to the semantic (objects and their parts) to the relational (arrows of the scene graph), and on to the contextual level (past, future, knowledge base). Computer vision (including panoptic segmentation and scene understanding) and large language models are now sufficiently advanced and well-integrated to provide the information from cameras and learned contextual knowledge. The challenge is to make this information interactively accessible to BLV people to help them experience the visual world and, in particular, complex forms of art. Our system will be designed based on cognitive and neuroscientific insights (led by cognitive psychologist Emmanouil and computational neuroscientist Kriegeskorte) and, specifically, studies on scene gist, saliency, attention, and eye movements.
Publications
Mechanisms of Audiovisual Language Integration in Humans and AI
PI: Tony Ro
Co-PI: Tatiana Emmanouil, Eva Dyer, Julia Hirschberg, and Christos H. Papadimitriou
Abstract
In this new project, we ask how biological and artificial systems integrate linguistic input with visual context to construct coherent multimodal representations. In natural communication, language rarely occurs in isolation. Visual information, such as a speaker’s gestures, facial expressions, and/or the surrounding scene, provides essential context that shapes or alters comprehension, particularly under conditions of ambiguity or noise [1,2]. Despite its importance,
the neural mechanisms that support context-sensitive audiovisual integration across time remain poorly understood.
Furthermore, in artificial systems, large language models excel at local token prediction, and their multimodal extensions—visual language models (VLMs)—often outperform unimodal systems on visually grounded tasks (e.g., image-text reasoning and audiovisual scene understanding; [3,4]). However, it remains unclear whether VLMs integrate multimodal contexts in a manner that resembles human cognitive and neural processing or whether their performance reflects superficial statistical regularities.
