Look Before You Zoom: Adaptive Routing for the Resolution-Context Trade-off in Visual RAG

Published in EMM-QA @ ICML 2026, 2026

Vision-Language Models can struggle when query-relevant objects are very small. This paper introduces ViRGo, a lightweight adaptive routing framework that chooses between global perception, patch-based retrieval, and attention-based retrieval based on object scale and semantic confidence.

The paper is available on arXiv.

Recommended citation: Oanh N. Tran, Thanh Quoc Hung Le, Oscar Chew, Kuan-Hao Huang, and Khoa D. Doan. "Look Before You Zoom: Adaptive Routing for the Resolution-Context Trade-off in Visual RAG." ICML 2026 Workshop on Efficient Multimodal Question Answering, 2026. https://arxiv.org/abs/2606.21968