Same or Not? Enhancing Visual Perception in Vision-Language Models

Same or Not?

TWIN example — Figure 1: Example from the TWIN dataset.

Consider the two vacuum cleaners shown here. We ask a simple question: do these two images show the exact same vacuum cleaner? At a glance, they share the same brand and similar colors, but they are clearly not the same physical object. A human quickly notices differences in dustbin geometry, handle design, and color accents—subtle cues that distinguish one instance from another. Yet when asked this question, Qwen2.5-VL, a strong open-source VLM, answers yes. The model not only reaches the wrong conclusion but also reveals flawed visual reasoning. In this work, we aim to address such shortcomings in fine-grained visual understanding by introducing TWIN, a large-scale training dataset for fine-grained VQA, and FGVQA, an accompanying benchmark suite to measure progress.

Why do Open-source VLMs Struggle?

We attribute the limitations of current open-source VLMs in fine-grained perception partly to their training data. Most large-scale image–text corpora emphasize general visual reasoning – such as spatial relations, common knowledge, grounding, or mathematical reasoning – over detailed visual discrimination. While these datasets enable broad understanding, they provide little incentive to attend to subtle, instance-level differences. For a concrete example, the recent open-source VLM PerceptionLM [1] details their exact training data (shown below):

PerceptionLM omits fine-grained understanding. — Figure 2: Training datasets used by PerceptionLM [1]. Only two datasets emphasize fine-grained image understanding.

Among all these datasets that span tens of millions of data points, only two emphasize fine-grained understanding: SpotTheDiff [2] and Birds-To-Words [3]. These datasets combine for less than $35$K samples and we find them to be considerably easier than our new dataset TWIN. To improve fine-grained perception in VLMs, we need more large-scale training datasets that emphasize fine-grained image understanding!

The TWIN Dataset

We present TWIN, a large-scale VQA dataset for advancing fine-grained visual understanding in VLMs. TWIN introduces $561{,}000$ instance-centric queries where models are tasked to judge whether two similar-looking images depict the same object instance. This design rewards attention to nuanced, instance-level details such as shape, texture, and part geometry, going beyond category-level understanding.

TWIN dataset overview — Figure 3: TWIN is a large-scale VQA dataset for fine-grained visual understanding, where VLMs determine whether two images depict the same instance. TWIN contains $561$K pairwise VQA queries across $1{,}836$ object instances, spanning $36$ categories of common objects and over $22$K images.

Sourcing Instances. We define an instance as a set of images of the same physical object under varied viewpoints, lighting, and backgrounds. We source object instances across diverse categories from Amazon Reviews [4]. From this definition, we design a VQA task where a VLM receives two images and determines if they depict the same instance.

Hard Negative Pairs. A balanced dataset for fine-grained understanding requires both positive and negative pairs. If all examples were positive, the task would be trivial. Likewise, random negatives are often too easy (e.g., a mug paired with a fan). We therefore focus on hard negatives – distinct objects that appear similar. We collect these hard negatives with the help of human annotators.

Statistics and Scalability. TWIN features $561$K pairwise VQA queries, including $22{,}157$ unique images of $1{,}836$ object instances. Our instances span a wide range of household objects categories. Importantly, our pairwise formulation enables TWIN to scale favorably with the number of object instances.

Post-Training VLMs with TWIN

To improve fine-grained understanding in VLMs and evaluate the impact of our new dataset, we post-train existing models on TWIN. Our task requires recognizing subtle attributes to distinguish similar instances, and we hypothesize that optimizing for this task enhances broader fine-grained abilities. We post-train with reinforcement learning as it has been shown to improve model capabilities while preserving prior skills.

Post-Training VLMs on TWIN — Figure 4: We post-train VLMs using reinforcement learning on TWIN. Reward is computed by comparing the predicted answer with the ground truth pair assignment.

Given image pairs $(I_1, I_2)$ with ground truth label $y ∈ \{ \text{yes}, \text{no} \}$, a VLM $\pi_\theta$, parametrized by $θ$, is prompted to produce a textual explanation and a final answer $\hat{y}$ whether both images depict the same instance. We use a binary outcome reward comparing prediction and ground truth: $R(y, \hat{y}) = \mathbf{1}\{y = \hat{y} \}$. Importantly, supervision relies only on pairwise assignments, without any descriptive textual annotations. We tune our VLM $\pi_\theta$ on this task using GRPO.

The FGVQA Benchmark Suite

Fine-grained understanding is a general skill: models attuned to subtle differences should generalize across domains. To evaluate this, we introduce FGVQA, a suite of fine-grained VQA benchmarks. FGVQA repurposes recognition and retrieval datasets, totaling $12{,}000$ queries spanning retail products; animals and plants; landmarks; birds; and art.

FGVQA is composed of:

TWIN-Eval is the evaluation set of TWIN. It is collected identically to TWIN, but features distinct instances and images.
ILIAS [5] is a large-scale test dataset of instance-level image retrieval. It predominantly features images of retail products taken in various contexts, backgrounds, and lighting.
Google Landmarks v2 [6] is a landmark recognition dataset featuring human-made and natural landmarks. The dataset has been used in both classification and retrieval settings.
MET [7] is an image retrieval dataset featuring artwork from the Metropolitan Museum of Art in New York. The dataset features images of the same art piece or sculpture from varying viewpoints, emphasizing multi-view consistency in retrieval.
CUB [8] is a fine-grained classification dataset that focuses on identifying bird species from images
Inquire [9] is a benchmark for natural world image retrieval, featuring images of animal and plant species sourced from the iNaturalist [10] dataset.

The breadth of FGVQA enables assessment of cross-domain generalization. For each benchmark, we construct two query types:

Pair queries show two images and ask whether they depict the same instance, artwork, or species.
Multi queries provide a reference image and three candidates, and ask how many match the reference.

Each dataset includes $1000$ balanced examples per type: pair queries split evenly among positive and negative cases, while multi queries distribute uniformly across answer counts ($250$ each from $0$-$3$).

Training on TWIN improves fine-grained understanding.

We compare the outputs of base vs. TWIN post-trained models on the FGVQA benchmark suite below.

Quantitative Results. We post-train Qwen2.5-VL 3B Instruct on TWIN and compare performance on FGVQA to assess if training on TWIN improves fine-grained perception across domains. Our direct comparisons show substantial gains from training on TWIN. We observe large gains on the in-domain ILIAS (+18.3%) and TWIN-Eval (+17.2%). Importantly, we find that improvements transfer to unseen domains. Substantial gains are observed on animal and plant species (INQUIRE & CUB), and for art and landmarks (MET & LANDMARKS), which are distinct from the objects in TWIN. We include comparisons with a smaller InternVL3.5 1B model in the paper. Importantly, improved performance on FGVQA does not compromise performance on general VQA benchmarks, which we report in the paper.

Importance of data scale. We conduct a scaling analysis to highlight the importance of collecting TWIN at scale. We train a Qwen2.5-VL 3B Instruct model on varying number of pairs from TWIN. Performance improves consistently across all datasets from $5$K to $561$K samples, reinforcing our decision to collect TWIN at scale. Notably, scale also improves performance on CUB and Inquire, which feature domains not represented in TWIN.

Conclusion

We introduce TWIN, a large-scale VQA dataset of $561{,}000$ queries designed for improving fine-grained perception in VLMs. To measure progress on fine-grained understanding, we additionally introduce FGVQA, a benchmark suite for precise visual understanding across a wide range of domains. Current open-source VLMs struggle on FGVQA, but post-training them on TWIN substantially improves fine-grained reasoning, even on unseen domains. We envision TWIN as a drop-in addition to VLM training corpora and hope FGVQA serves as a benchmark suite for measuring progress in fine-grained VQA.

Acknowledgements

We thank Aadarsh Sahoo, Ilona Demler, and Ziqi Ma for their feedback on the project. The project is funded by Meta through the LLM evaluation research grant and partly through Caltech’s CAST program. We also thank Google’s Gemma Academic program for granting us API credits for their LLMs.

References

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, et al. Perceptionlm: Open-access data and models for detailed visual understanding. arXiv preprint, 2025.
Harsh Jhamtani and Taylor Berg-Kirkpatrick. Learning to describe differences between pairs of similar images. EMNLP, 2018.
Maxwell Forbes, Christine Kaeser-Chen, Piyush Sharma, and Serge Belongie. Neural naturalist: Generating fine-grained image comparisons EMNLP, 2019.
Yupeng Hou, Jiacheng Li, Zhankui He, An Yan, Xiusi Chen, and Julian McAuley. Bridging language and items for retrieval and recommendation. arXiv preprint, 2024.
Giorgos Kordopatis-Zilos, Vladan Stojnić, Anna Manko, Pavel Suma, Nikolaos-Antonios Ypsilantis, Nikos Efthymiadis, Zakaria Laskar, Jiri Matas, Ondrej Chum, and Giorgos Tolias. Ilias: Instance-level image retrieval at scale. in CVPR, 2025.
Tobias Weyand, Andre Araujo, Bingyi Cao, and Jack Sim Google Landmarks Dataset v2 - a large-scale benchmark for instance-level recognition and retrieval.. in CVPR, 2020.
Nikolaos-Antonios Ypsilantis, Noa Garcia, Guangxing Han, Sarah Ibrahimi, Nanne van Noord, and Giorgos Tolias. The met dataset: Instance-level recognition for artworks.. in NeurIPS, 2021.
Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie The caltech-ucsd birds-200-2011 dataset., 2011.
Edward Vendrow, Omiros Pantazis, Alexander Shepard, Gabriel Brostow, Kate E Jones, Oisin Mac Aodha, Sara Beery, and Grant Van Horn Inquire: a natural world text-to-image retrieval benchmark.. in NeurIPS, 2024.
Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. in CVPR, 2018.

TWIN