tl;dr: We introduce VALOR, an annotation-free framework that boosts both visual reasoning and grounding by training with AI verifiers instead of human labels. A language model verifier improves reasoning through reinforcement learning, while a vision-language verifier enhances grounding via automatic hard-negative mining. The result is a stronger visual reasoning system that outperforms open-source and proprietary models across a suite of visual reasoning benchmarks.
The progress of LLMs in text-based reasoning in 2024-2025 has been remarkable. In the course of a year, we saw LLMs go from general-purpose chatbots to solving competition-level math problems, writing sophisticated code in large repositories, and answering expert-level questions in the sciences. This rate of progress is apparent when comparing benchmark results from different model releases. For example, GPT4o achieved just 15.0% on the math problems in AIME 2025, but OpenAI's later model release of GPT-5-mini achieves a staggering 94.0%. Similar results are seen on software engineering benchmark LiveCodeBench and science VQA benchmark GPQA.
Unfortunately, progress in visual reasoning has been significantly slower. Comparing the same two models, the improvements across benchmarks are much more modest: +7.3% on BLINK, +3.5% on VSR, and a slight decrease on CountBenchQA. Note, however, that in absolute terms most of the visual benchmarks are easier: GPT4o already attains relatively high accuracy on BLINK, VSR, and CountBenchQA, whereas its performance on the text-only benchmarks is substantially lower. On the harder 3D spatial reasoning benchmark Omni3D-Bench, accuracy is still quite low: GPT-5-mini reaches 40.9%, only 5.9% above GPT-4o. This discrepancy is not surprising; visual reasoning is a challenging task. It requires precise object grounding and understanding complex spatial relationships, both of which remain challenging for current models.
Methods to improve visual reasoning broadly fall into two categories. The first integrates grounding with language reasoning, where vision-language models (VLMs) generate chain-of-thought explanations in text. Examples include Thinking with Images [1], GRIT [2], and Visually Grounded RL [3]. These methods can handle simple spatial relations, but suffer from weak visual understanding and logical errors. For instance, in the above example, GPT-5-Thinking ignores real-world 3D object sizes and considers only pixel-wise dimensions, incorrectly concluding the coffee table is six times shorter than the sofa. These methods are also data-hungry, requiring extensive supervision.
Another line of work uses LLMs for program synthesis with vision specialists. Examples include VADAR [4], VisProg [5], and ViperGPT [6]. These training-free approaches rely on proprietary LLMs and pre-trained specialists that are poorly aligned for visual and spatial reasoning.
We introduce VALOR, a scalable; annotation-free training framework that tackles spatial reasoning from images by combining LLM-powered reasoning with specialized tool use. VALOR employs an LLM to generate plans and executable programs and invokes vision specialists for execution. Both the reasoning and the vision grounding model are tuned for the task via a label-free training paradigm. This is achieved by leveraging multimodal verifiers that critique model outputs. Their feedback serves as a learning signal to improve both components, the LLM responsible for logic and the vision specialists responsible for grounding. We name our approach VALOR as it integrates Verifiers for Annotation-free LOgic and Reasoning.
Plan and Code Generation. Given a query, the LLM generates a natural language plan followed by a corresponding program in Python. Available to the LLM are the APIs of three function calls:
GD_DETECT, returns the bounding box of all object instances specified by
the noun description —
e.g., GD_DETECT("CAR") using a GroundingDINO model [7].
DEPTH, returns the depth of a pixel in the image —
DEPTH(IMAGE, X, Y) using MoGe2 [8].
VQA, returns an object’s attribute (e.g., color) from the input image crop
around the object —
e.g., VQA(IMAGE_CROP, "WHAT IS THE COLOR OF THE OBJECT IN THE IMAGE?")
using GPT-5-mini [9].
VALOR leverages LLM verifiers as a reward signal to improve reasoning via reinforcement learning. LLM verifiers critique model outputs across a rubric of six criteria targeting specific aspects of spatial reasoning.
Our reward is composed of six binary rewards:
<plan>...</plan> and
<answer>...</answer> tags, and 0 otherwise.
Our final reward is:
\[ R(q,p,c) = r_{\mathrm{fmt}}(p,c) \cdot \big [ \lambda_{\mathrm{sn}}\, r_{\mathrm{sn}}(c) + \lambda_{\mathrm{log}}\, r_{\mathrm{log}}(q,p) + \lambda_{\mathrm{att}}\, r_{\mathrm{att}}(q,p) + \lambda_{\mathrm{sp}}\, r_{\mathrm{sp}}(q,p) + \lambda_{\mathrm{ad}}\, r_{\mathrm{ad}}(p,c)\big ] \]The format reward $r_{\text{fmt}}$ acts as a hard constraint and is applied as a multiplier, while the weighted sum of the remaining rewards evaluates content quality. All $r_k ∈ \{0, 1\}$ and $\sum_k \lambda_k = 1.0$.
In addition to logic, visual reasoning relies on accurate grounding. Modern detectors like GroundingDINO, trained on web data, are error-prone and struggle to generalize beyond their training domains. Fine-tuning with domain-specific labels can mitigate these issues, but collecting such annotations is labor intensive. We propose an alternative: improving visual grounding through VLM verifiers. Vision specialists cast predictions, VLM verifiers evaluate them, and the feedback augments their training set. This approach requires no manual annotations and scales across domains without additional labels.
Our approach for verifier-improved visual grounding relies on image-query pairs $\{(I_j,
q_j)\}^M_{j=1}$. For each query $q_j$, our LLM reasoning model generates a plan and code,
$(p_j, c_j)$. From code $c_j$, we parse all grounding queries – e.g.,
GD_DETECT(“HELMET”) – and execute them with a pre-trained detector. To ensure
high recall, we lower the detector’s confidence threshold. This leads to overprediction,
which we validate with a frozen VLM verifier in three steps:
Confirmed detections form a new training set, which we use to fine-tune a pre-trained GroundingDINO detector.
The predicted Python programs, that invoke our vision-specialist APIs, are executed to produce answers to visual reasoning queries.
A natural question to ask is whether VALOR can ever outperform the verifiers it uses during training. To this end, we first note that VALOR uses multimodal verifiers to select and critique data, not generate it. Thus, VALOR is not bound by the generation abilities of an LLM/VLM, but rather it's verification ability. This distinction is important as we find there are tasks where VLMs are better verifiers than generators. For a concrete example, in VALOR, we use GPT-5-mini as our VLM verifier for improving the visual grounding module. Although highly effective at evaluating object detections, we observe that it often struggles generating bounding boxes itself. In the figure above, we find that GPT-5-mini frequently outputs misaligned or overly large boxes, failing to localize objects that VALOR (trained with GPT-5-mini as a verifier) correctly detects. We find that a VLM can provide reliable binary judgments about correctness even when its own grounding predictions are imperfect.
We compare VALOR to a series of models on visual reasoning benchmarks below.
We evaluate a series of open-source models, as well as VALOR, across a wide range of spatial reasoning benchmarks. Each LLM is used language-only, and is prompted to generate Python programs that can invoke an API of vision specialist models (detection, depth estimation, VQA), as described above. We execute the generated programs to determine accuracy on each of the benchmarks.
Among the open-source models we evaluate (Llama3.2-11B, Gemma3-12B,and Qwen3-8B), Qwen3 consistently performs the best. We find that despite using the instruction-tuned variants, Gemma3 and Llama3.2 routinely ignore our system prompts. For example, both models frequently overwrite the input image path, define "placeholder" values, or argue the query is impossible and refuse to answer altogether. In contrast, Qwen3 consistently produces reasonable programs, but incorrectly handles nuanced details in the query and fails to use tools effectively. We feel these are issues that can be addressed via post-training, so we build on the capable Qwen3 model for VALOR.
We compare VALOR-RL with Qwen3 to isolate the impact of verifier-improved reasoning. VALOR-RL uses a verifier-trained Qwen3 model with the same vision specialist models. Thus any improvements from Qwen3 to VALOR-RL stem from our LLM-verifier guided training. VALOR-RL shows gains over Qwen3: +3.4% on BLINK, +2.1% on VSR, and +1.3% on RoboSpatial. Most notably, VALOR-RL greatly improves on Omni3D-Bench (+6.4%), our most reasoning-intensive benchmark. On counting tasks TallyQA and CountBenchQA, reasoning is less critical, and VALOR-RL matches Qwen3.
In the above plot we compare VALOR, our final method, to VALOR-RL. The two variants execute identical programs, though VALOR uses the verifier-improved visual grounding module. VALOR yields strong gains across the board, particularly on grounding-focused benchmarks: +8.3% on CountBenchQA, +7.7% on RoboSpatial, and +5.3% on VSR. Improvements on Omni3D-Bench are smaller, as complex queries make reasoning the main challenge for smaller LLMs. Notably, improving visual grounding for spatial reasoning does not harm general object detection; our training slightly boosts performance on the COCO validation set: 48.4% to 48.7% mAP.
We introduce VALOR, an annotation-free training paradigm for visual reasoning that leverages multimodal verifiers to improve LLM reasoning and visual grounding, leading to significant improvements on a wide range of spatial reasoning benchmarks. We find that VLMs/LLMs are increasingly capable verifiers, not merely generators. In fact, we find there are tasks where they are excellent verifiers but not great generators (e.g. object detection). This suggests an alternative method to improving reasoning in the visual domain: leveraging the multimodal verification capabilities of these models to enable training in domains where ground truth is unavailable.
AcknowledgementsWe thank Aadarsh Sahoo, Ilona Demler, and Ziqi Ma for their feedback on the project. The project is funded by Meta through the LLM evaluation research grant and partly through Caltech’s CAST program. We also thank Google’s Gemma Academic program for granting us API credits for their LLMs.
References
coming soon.