
PI: Richard Zemel

co-PI: Kathleen McKeown
Abstract:
Modern multimodal vision-language models (VLMs) are increasingly becoming capable of performing tasks across the spectrum of human interests, including games (e.g., card games, chess), abstract reasoning tasks (e.g., summarization, mathematical problem solving), and embodied intelligence tasks (pick-and-place). However, each of these generally requires a specialized VLM with task-specific training to perform well, and VLM post-training generally
involves a single phase of fine-tuning/RL to achieve proficiency on one or a few narrow task domains. In contrast, humans simultaneously excel at many of these tasks, especially ones that are performed (and thus improved upon) frequently. As an example, the figure below (from here) shows that the average adult human spends several hours a day on household chores, work, education, and leisure, often interleaving these activities.