No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

Progress in visual reasoning has been slower than text-based reasoning.

The improvements on text-based benchmarks have been far greater than visual
reasoning benchmarks when comparing GPT-5-mini to GPT4o. — Figure 1: The improvements on text-based benchmarks have been far greater than visual reasoning benchmarks when comparing GPT-5-mini to GPT4o.

The progress of LLMs in text-based reasoning in 2024-2025 has been remarkable. In the course of a year, we saw LLMs go from general-purpose chatbots to solving competition-level math problems, writing sophisticated code in large repositories, and answering expert-level questions in the sciences. This rate of progress is apparent when comparing benchmark results from different model releases. For example, GPT4o achieved just 15.0% on the math problems in AIME 2025, but OpenAI's later model release of GPT-5-mini achieves a staggering 94.0%. Similar results are seen on software engineering benchmark LiveCodeBench and science VQA benchmark GPQA.

Unfortunately, progress in visual reasoning has been significantly slower. Comparing the same two models, the improvements across benchmarks are much more modest: +7.3% on BLINK, +3.5% on VSR, and a slight decrease on CountBenchQA. Note, however, that in absolute terms most of the visual benchmarks are easier: GPT4o already attains relatively high accuracy on BLINK, VSR, and CountBenchQA, whereas its performance on the text-only benchmarks is substantially lower. On the harder 3D spatial reasoning benchmark Omni3D-Bench, accuracy is still quite low: GPT-5-mini reaches 40.9%, only 5.9% above GPT-4o. This discrepancy is not surprising; visual reasoning is a challenging task. It requires precise object grounding and understanding complex spatial relationships, both of which remain challenging for current models.

GPT-5 failing on Omni3D-Bench. — Figure 2: GPT-5-Thinking on Omni3D-Bench. GPT-5-Thinking ignores real-world 3D object sizes and considers only pixel-wise dimensions, leading to an incorrect answer. We highlight erroneous reasoning in red.

Methods to improve visual reasoning broadly fall into two categories. The first integrates grounding with language reasoning, where vision-language models (VLMs) generate chain-of-thought explanations in text. Examples include Thinking with Images [1], GRIT [2], and Visually Grounded RL [3]. These methods can handle simple spatial relations, but suffer from weak visual understanding and logical errors. For instance, in the above example, GPT-5-Thinking ignores real-world 3D object sizes and considers only pixel-wise dimensions, incorrectly concluding the coffee table is six times shorter than the sofa. These methods are also data-hungry, requiring extensive supervision.

Another line of work uses LLMs for program synthesis with vision specialists. Examples include VADAR [4], VisProg [5], and ViperGPT [6]. These training-free approaches rely on proprietary LLMs and pre-trained specialists that are poorly aligned for visual and spatial reasoning.

The VALOR Framework

We introduce VALOR, a scalable; annotation-free training framework that tackles spatial reasoning from images by combining LLM-powered reasoning with specialized tool use. VALOR employs an LLM to generate plans and executable programs and invokes vision specialists for execution. Both the reasoning and the vision grounding model are tuned for the task via a label-free training paradigm. This is achieved by leveraging multimodal verifiers that critique model outputs. Their feedback serves as a learning signal to improve both components, the LLM responsible for logic and the vision specialists responsible for grounding. We name our approach VALOR as it integrates Verifiers for Annotation-free LOgic and Reasoning.

VALOR framework overview — Figure 3: VALOR is a training framework for visual reasoning, tackling spatial reasoning in both 2D and 3D. During training, LLM verifiers improve reasoning via RL while VLM verifiers serve as critics to tune vision grounding models via SFT.

Plan and Code Generation. Given a query, the LLM generates a natural language plan followed by a corresponding program in Python. Available to the LLM are the APIs of three function calls:

GD_DETECT, returns the bounding box of all object instances specified by the noun description — e.g., GD_DETECT("CAR") using a GroundingDINO model [7].
DEPTH, returns the depth of a pixel in the image — DEPTH(IMAGE, X, Y) using MoGe2 [8].
VQA, returns an object’s attribute (e.g., color) from the input image crop around the object — e.g., VQA(IMAGE_CROP, "WHAT IS THE COLOR OF THE OBJECT IN THE IMAGE?") using GPT-5-mini [9].

Improving Reasoning with LLM Verifiers

VALOR leverages LLM verifiers as a reward signal to improve reasoning via reinforcement learning. LLM verifiers critique model outputs across a rubric of six criteria targeting specific aspects of spatial reasoning.

Video 1: VALOR uses LLM verifiers to improve reasoning via reinforcement learning. Our reward consists of six binary components that target specific aspects of spatial reasoning.

Our reward is composed of six binary rewards:

Format Reward ($r_{\mathrm{fmt}}$): Ensures model outputs are properly formatted. Format reward is 1 if the model output contains the proper <plan>...</plan> and <answer>...</answer> tags, and 0 otherwise.
Syntax Reward ($r_{\mathrm{sn}}$): Evaluates if the predicted program executes without Python errors. Syntax reward is 1 if the program executes properly with placeholder variables, 0 otherwise.
Spatial Reward ($r_{\mathrm{sp}}$): An LLM verifies that the predicted plan addresses all spatial relationships in the query (above, behind, left of, etc.). The verifier returns 1 if all spatial relationships are addressed, 0 otherwise.
Attribute Reward ($r_{\mathrm{att}}$): An LLM verifier assesses whether the plan specifies explicitly and correctly how to compute all relevant attributes (height, color, etc.) in the query. The LLM returns 1 if all attributes are computed correctly, otherwise 0.
Logic Reward ($r_{\mathrm{log}}$): An LLM verifier is given the query and predicted plan, it returns 1 if it considers the plan reasonable and coherent for the given query, 0 otherwise.
Adherence Reward ($r_{\mathrm{ad}}$): The predicted plan and code are given to an LLM verifier, which returns 1 if the code faithfully implements the plan without deviations, 0 otherwise.

Our final reward is:

\[ R(q,p,c) = r_{\mathrm{fmt}}(p,c) \cdot \big [ \lambda_{\mathrm{sn}}\, r_{\mathrm{sn}}(c) + \lambda_{\mathrm{log}}\, r_{\mathrm{log}}(q,p) + \lambda_{\mathrm{att}}\, r_{\mathrm{att}}(q,p) + \lambda_{\mathrm{sp}}\, r_{\mathrm{sp}}(q,p) + \lambda_{\mathrm{ad}}\, r_{\mathrm{ad}}(p,c)\big ] \]

The format reward $r_{\text{fmt}}$ acts as a hard constraint and is applied as a multiplier, while the weighted sum of the remaining rewards evaluates content quality. All $r_k ∈ \{0, 1\}$ and $\sum_k \lambda_k = 1.0$.

Improving Visual Grounding with VLM Verifiers

In addition to logic, visual reasoning relies on accurate grounding. Modern detectors like GroundingDINO, trained on web data, are error-prone and struggle to generalize beyond their training domains. Fine-tuning with domain-specific labels can mitigate these issues, but collecting such annotations is labor intensive. We propose an alternative: improving visual grounding through VLM verifiers. Vision specialists cast predictions, VLM verifiers evaluate them, and the feedback augments their training set. This approach requires no manual annotations and scales across domains without additional labels.

Video 2: VALOR uses VLM verifiers to improve visual grounding via automated hard-negative mining. A high-recall detector produces noisy initial detections, which VLM verifiers refine via three stages.

Our approach for verifier-improved visual grounding relies on image-query pairs $\{(I_j, q_j)\}^M_{j=1}$. For each query $q_j$, our LLM reasoning model generates a plan and code, $(p_j, c_j)$. From code $c_j$, we parse all grounding queries – e.g., GD_DETECT(“HELMET”) – and execute them with a pre-trained detector. To ensure high recall, we lower the detector’s confidence threshold. This leads to overprediction, which we validate with a frozen VLM verifier in three steps:

Coarse Filtering: The input image with all candidate detections is passed to the verifier, which is prompted to discard all boxes where the object does not match the box label.
Per-crop Object Check: The verifier is given a cropped image of each remaining box and asked to verify if the object visible in the crop matches the predicted label. All incorrect boxes are discarded.
Deduplication: The input image with all remaining detections is shown to the verifier, which is tasked with discarding all duplicate predictions. For each set of duplicates, the VLM is asked to retain the most correct box.

Confirmed detections form a new training set, which we use to fine-tune a pre-trained GroundingDINO detector.

Inference

The predicted Python programs, that invoke our vision-specialist APIs, are executed to produce answers to visual reasoning queries.

Video 4: VALOR executes generated Python programs to solve visual reasoning queries.

Aside: can trained models ever outperform the verifiers?

Despite being a good verifier for visual grounding, GPT5 struggles with visual grounding. — Figure 4: Despite being a powerful VLM verifier for visual grounding, GPT-5-mini struggles with generating boxes for object grounding.

A natural question to ask is whether VALOR can ever outperform the verifiers it uses during training. To this end, we first note that VALOR uses multimodal verifiers to select and critique data, not generate it. Thus, VALOR is not bound by the generation abilities of an LLM/VLM, but rather it's verification ability. This distinction is important as we find there are tasks where VLMs are better verifiers than generators. For a concrete example, in VALOR, we use GPT-5-mini as our VLM verifier for improving the visual grounding module. Although highly effective at evaluating object detections, we observe that it often struggles generating bounding boxes itself. In the figure above, we find that GPT-5-mini frequently outputs misaligned or overly large boxes, failing to localize objects that VALOR (trained with GPT-5-mini as a verifier) correctly detects. We find that a VLM can provide reliable binary judgments about correctness even when its own grounding predictions are imperfect.

Model Comparisons

We compare VALOR to a series of models on visual reasoning benchmarks below.

Benchmark Evaluations

We evaluate a series of open-source models, as well as VALOR, across a wide range of spatial reasoning benchmarks. Each LLM is used language-only, and is prompted to generate Python programs that can invoke an API of vision specialist models (detection, depth estimation, VQA), as described above. We execute the generated programs to determine accuracy on each of the benchmarks.

How do open-source models perform?

open-source model performance on spatial reasoning benchmarks — Figure 5: Open-source language-only models generate Python programs with access to an API of vision specialist models. We execute the programs and report accuracy.

Among the open-source models we evaluate (Llama3.2-11B, Gemma3-12B,and Qwen3-8B), Qwen3 consistently performs the best. We find that despite using the instruction-tuned variants, Gemma3 and Llama3.2 routinely ignore our system prompts. For example, both models frequently overwrite the input image path, define "placeholder" values, or argue the query is impossible and refuse to answer altogether. In contrast, Qwen3 consistently produces reasonable programs, but incorrectly handles nuanced details in the query and fails to use tools effectively. We feel these are issues that can be addressed via post-training, so we build on the capable Qwen3 model for VALOR.

Qwen3 vs VALOR-RL: Training with verifiers improves model reasoning.

We compare VALOR-RL with Qwen3 to isolate the impact of verifier-improved reasoning. VALOR-RL uses a verifier-trained Qwen3 model with the same vision specialist models. Thus any improvements from Qwen3 to VALOR-RL stem from our LLM-verifier guided training. VALOR-RL shows gains over Qwen3: +3.4% on BLINK, +2.1% on VSR, and +1.3% on RoboSpatial. Most notably, VALOR-RL greatly improves on Omni3D-Bench (+6.4%), our most reasoning-intensive benchmark. On counting tasks TallyQA and CountBenchQA, reasoning is less critical, and VALOR-RL matches Qwen3.

VALOR-RL vs VALOR: Training with verifiers improves visual grounding.

In the above plot we compare VALOR, our final method, to VALOR-RL. The two variants execute identical programs, though VALOR uses the verifier-improved visual grounding module. VALOR yields strong gains across the board, particularly on grounding-focused benchmarks: +8.3% on CountBenchQA, +7.7% on RoboSpatial, and +5.3% on VSR. Improvements on Omni3D-Bench are smaller, as complex queries make reasoning the main challenge for smaller LLMs. Notably, improving visual grounding for spatial reasoning does not harm general object detection; our training slightly boosts performance on the COCO validation set: 48.4% to 48.7% mAP.

Conclusion

We introduce VALOR, an annotation-free training paradigm for visual reasoning that leverages multimodal verifiers to improve LLM reasoning and visual grounding, leading to significant improvements on a wide range of spatial reasoning benchmarks. We find that VLMs/LLMs are increasingly capable verifiers, not merely generators. In fact, we find there are tasks where they are excellent verifiers but not great generators (e.g. object detection). This suggests an alternative method to improving reasoning in the visual domain: leveraging the multimodal verification capabilities of these models to enable training in domains where ground truth is unavailable.

Acknowledgements

We thank Aadarsh Sahoo, Ilona Demler, and Ziqi Ma for their feedback on the project. The project is funded by Meta through the LLM evaluation research grant and partly through Caltech’s CAST program. We also thank Google’s Gemma Academic program for granting us API credits for their LLMs.

References

OpenAI. Thinking with Images. URL: https://openai.com/index/thinking-with-images/
Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. in NeurIPS, 2025.
Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning. in NeurIPS, 2025.
Damiano Marsili, Rohun Agrawal, Yisong Yue, and Georgia Gkioxari. Visual agentic ai for spatial reasoning with a dynamic api. in CVPR, 2025.
Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. in CVPR, 2023.
Didac Suris, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. in ICCV, 2023.
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.. in CVPR, 2025.
OpenAI. GPT-5-mini. URL: https://openai.com/index/introducing-gpt-5/

VALOR