Speaker: Katherine Xu – Language and Vision Working Group
March 23 @ 3:00 pm - 4:00 pm

Title: Are Vision-Language Models Checking or Looking?
Abstract:
Today’s AI vision systems are trained on vast amounts of data, yet it remains unclear whether they simply retrieve memorized answers or actively reason. We conjecture that hallucinations and limited creativity in these models stem from an over-reliance on superficial “checking” rather than active “looking.” Checking retrieves the most probable memorized association, which often fails when novel inputs mismatch stored patterns. In contrast, looking involves reasoning on the fly by iteratively sampling information, revising interpretations, and integrating evidence across modalities. First, I will share our recent work on Vibe Spaces for creatively connecting visual concepts. Second, I will propose visual humor as a lens to probe these cross-modal reasoning deficits. I will conclude with early findings from my ongoing research to open a discussion on potential collaborative directions for our working group.
Zoom: upon request@ [email protected]
