
Hee Seung (Will) Hwang will present his General Exam " COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning " on Monday, October 6, 2025 at 1:00 PM in CS 402. Committee Members: Olga Russakovsky (advisor), Zhuang Liu , Manoel Horta Ribeiro Abstract: Visual Instruction Tuning (VIT) data of Multi-modal Large Language Models (MLLMs) have continuously scaled in volume, with “the more the merrier” mentality when it comes to data. However, a natural question to ask is whether this blind scaling can be wasteful, and whether leveraging more informative samples can enable faster training and perhaps even more accurate models. In this paper, we present COMPACT (COMPositional Atomic-to-complex Visual Capability Tuning), a VIT data recipe that explicitly controls training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective visual instruction tuning. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLaVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.18% of the full VIT performance (compared to only 97.47% by state-of-the-art) across ten multimodal benchmarks. Further, training on this COMPACT data even improves performance compared to training with full-scale data on particularly complex benchmarks such as MM-Vet (8.6%) and MMStar (2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on visual language tasks. Reading List: https://docs.google.com/document/d/1lhdxFBNmEkr1Kw_D6Te-4Bp4Btp7Zp7u1UAI38hG... Everyone is invited to attend the talk, and those faculty wishing to remain for the oral exam following are welcome to do so.