tl;dr: We introduce TWIN, a large-scale dataset of 561K image-pair queries that trains VLMs to detect subtle visual differences by deciding whether two similar images show the same object. Post-training VLMs on TWIN significantly boosts fine-grained perception across diverse domains, evaluated with our new FGVQA benchmark suite.
Consider the two vacuum cleaners shown here. We ask a simple question: do these two images show the exact same vacuum cleaner? At a glance, they share the same brand and similar colors, but they are clearly not the same physical object. A human quickly notices differences in dustbin geometry, handle design, and color accents—subtle cues that distinguish one instance from another. Yet when asked this question, Qwen2.5-VL, a strong open-source VLM, answers yes. The model not only reaches the wrong conclusion but also reveals flawed visual reasoning. In this work, we aim to address such shortcomings in fine-grained visual understanding by introducing TWIN, a large-scale training dataset for fine-grained VQA, and FGVQA, an accompanying benchmark suite to measure progress.
We attribute the limitations of current open-source VLMs in fine-grained perception partly to their training data. Most large-scale image–text corpora emphasize general visual reasoning – such as spatial relations, common knowledge, grounding, or mathematical reasoning – over detailed visual discrimination. While these datasets enable broad understanding, they provide little incentive to attend to subtle, instance-level differences. For a concrete example, the recent open-source VLM PerceptionLM [1] details their exact training data (shown below):
Among all these datasets that span tens of millions of data points, only two emphasize fine-grained understanding: SpotTheDiff [2] and Birds-To-Words [3]. These datasets combine for less than $35$K samples and we find them to be considerably easier than our new dataset TWIN. To improve fine-grained perception in VLMs, we need more large-scale training datasets that emphasize fine-grained image understanding!
We present TWIN, a large-scale VQA dataset for advancing fine-grained visual understanding in VLMs. TWIN introduces $561{,}000$ instance-centric queries where models are tasked to judge whether two similar-looking images depict the same object instance. This design rewards attention to nuanced, instance-level details such as shape, texture, and part geometry, going beyond category-level understanding.
To improve fine-grained understanding in VLMs and evaluate the impact of our new dataset, we post-train existing models on TWIN. Our task requires recognizing subtle attributes to distinguish similar instances, and we hypothesize that optimizing for this task enhances broader fine-grained abilities. We post-train with reinforcement learning as it has been shown to improve model capabilities while preserving prior skills.
Given image pairs $(I_1, I_2)$ with ground truth label $y ∈ \{ \text{yes}, \text{no} \}$, a VLM $\pi_\theta$, parametrized by $θ$, is prompted to produce a textual explanation and a final answer $\hat{y}$ whether both images depict the same instance. We use a binary outcome reward comparing prediction and ground truth: $R(y, \hat{y}) = \mathbf{1}\{y = \hat{y} \}$. Importantly, supervision relies only on pairwise assignments, without any descriptive textual annotations. We tune our VLM $\pi_\theta$ on this task using GRPO.
Fine-grained understanding is a general skill: models attuned to subtle differences should generalize across domains. To evaluate this, we introduce FGVQA, a suite of fine-grained VQA benchmarks. FGVQA repurposes recognition and retrieval datasets, totaling $12{,}000$ queries spanning retail products; animals and plants; landmarks; birds; and art.
FGVQA is composed of:
The breadth of FGVQA enables assessment of cross-domain generalization. For each benchmark, we construct two query types:
Each dataset includes $1000$ balanced examples per type: pair queries split evenly among positive and negative cases, while multi queries distribute uniformly across answer counts ($250$ each from $0$-$3$).
We compare the outputs of base vs. TWIN post-trained models on the FGVQA benchmark suite below.
Quantitative Results. We post-train Qwen2.5-VL 3B Instruct on TWIN and compare performance on FGVQA to assess if training on TWIN improves fine-grained perception across domains. Our direct comparisons show substantial gains from training on TWIN. We observe large gains on the in-domain ILIAS (+18.3%) and TWIN-Eval (+17.2%). Importantly, we find that improvements transfer to unseen domains. Substantial gains are observed on animal and plant species (INQUIRE & CUB), and for art and landmarks (MET & LANDMARKS), which are distinct from the objects in TWIN. We include comparisons with a smaller InternVL3.5 1B model in the paper. Importantly, improved performance on FGVQA does not compromise performance on general VQA benchmarks, which we report in the paper.
Importance of data scale. We conduct a scaling analysis to highlight the importance of collecting TWIN at scale. We train a Qwen2.5-VL 3B Instruct model on varying number of pairs from TWIN. Performance improves consistently across all datasets from $5$K to $561$K samples, reinforcing our decision to collect TWIN at scale. Notably, scale also improves performance on CUB and Inquire, which feature domains not represented in TWIN.
We introduce TWIN, a large-scale VQA dataset of $561{,}000$ queries designed for improving fine-grained perception in VLMs. To measure progress on fine-grained understanding, we additionally introduce FGVQA, a benchmark suite for precise visual understanding across a wide range of domains. Current open-source VLMs struggle on FGVQA, but post-training them on TWIN substantially improves fine-grained reasoning, even on unseen domains. We envision TWIN as a drop-in addition to VLM training corpora and hope FGVQA serves as a benchmark suite for measuring progress in fine-grained VQA.
AcknowledgementsWe thank Aadarsh Sahoo, Ilona Demler, and Ziqi Ma for their feedback on the project. The project is funded by Meta through the LLM evaluation research grant and partly through Caltech’s CAST program. We also thank Google’s Gemma Academic program for granting us API credits for their LLMs.
References
TODO.