VADAR: Visual Agentic AI for Spatial Reasoning with a Dynamic API

Caltech
*Equal Contribution

tl;dr VADAR is a new program synthesis approach for 3D spatial reasoning; LLM agents collaboratively generate a Pythonic API to solve common reasoning subproblems and tackle challenging 3D spatial reasoning queries.

Abstract

Visual reasoning is essential for embodied agents in 3D environments. While vision-language models can answer image-based questions, they struggle with 3D spatial reasoning. To address this, we propose an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API, enabling dynamic function creation for solving subproblems. Unlike static, human-defined APIs, our method adapts to diverse queries. We also introduce a new benchmark for 3D understanding, requiring multi-step grounding and inference. Our approach outperforms prior zero-shot models, demonstrating its effectiveness for 3D spatial reasoning.

Approach

VADAR leverages an agentic program synthesis approach to produce a dynamic API that can be extended to address new queries that require novel skills. The goal of the API is to break down complex reasoning problems into simpler subproblems that can be addressed with vision specialist modules (e.g. Object Detection), and subsequently composed via program synthesis. The generated API is written in Python, and programs are tested and executed with Pythonic Agents.

Approach

DFS Method Implementation

To avoid the case that the implementation of a signature calls a method that hasn’t been implemented yet, we produce a tree of dependencies and implement our methods with a depth-first traversal.

Omni3D-Bench

To further assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. Omni3D-Bench features 500 challenging, non-templated queries with images sourced from Omni3D, a dataset of images from diverse real-world scenes with 3D object annotations. Our queries test reasoning in 3D, as they require grounding objects in 3D and combining predicted attributes to reason about distances and dimensions in three dimensions. We show more samples from Omni3D-Bench here.

Approach

Results

Here we show some example programs generated by VADAR and the execution output. We suggest zooming in to read the programs clearly!

VADAR correctly handles the non-templated, complex queries on 3D spatial relationships present in Omni3D-Bench. Importantly, VADAR does not rely on priors of object sizes and thus can appropriately handle queries with hypotheticals.
VADAR is able to solve complex CLEVR queries involving multiple steps of reasoning with zero supervision.
We show results on GQA and the concurrent work of VSI-Bench. Unlike Omni3D-Bench and CLEVR, GQA primarily focuses on object appearance, not 3D spatial reasoning.

Failure Cases

We show examples of failure cases below. The most common errors stem from errors of the vision specialist modules (e.g. missed detections, incorrect VQA responses). Severe occlusions are particularly problematic for the vision specialists. Additionally, we find VADAR often struggles with queries that require 5 or more inference steps (e.g. " There is a yellow cylinder to the right of the cube that is behind the purple block; is there a brown object in front of it?" ).

BibTeX

@misc{marsili2025visualagenticaispatial,
    title={Visual Agentic AI for Spatial Reasoning with a Dynamic API}, 
    author={Damiano Marsili and Rohun Agrawal and Yisong Yue and Georgia Gkioxari},
    year={2025},
    eprint={2502.06787},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2502.06787}, 
}