Aligning Text, Images, and 3D Structure Token-by-Token

tl;dr We propose a unified LLM framework that aligns language, images, and structured 3D scenes to perform core 3D tasks.

Abstract

Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We evaluate performance across four core 3D tasks – rendering, recognition, instruction-following, and question-answering – and four 3D datasets, synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings, and show our model's effectiveness on real-world 3D object recognition tasks.

Approach

Kyvo: a decoder-only transformer aligns a structured 3D modality with language and vision. This 3D modality represents scenes as lists of objects, each defined by its 3D shape, type, 3D position, pose and size parameters. Kyvo unifies the token space of images, text, and 3D to enable a variety of complex visual 3D tasks.

Example image generations for the rendering task on CLEVR. The model takes a 3D scene as input and produces a corresponding image. Additionally, we show the ground-truth image rendered using Blender.

Given a single input image, Kyvo predicts shape sequences and reconstructs individual objects (bottle, cheeseburger, etc.) along with their 3D locations and poses via our structured 3D modality, effectively reconstructing the 3D scene with consistent spatial relations between the objects, visualized using Blender.

Example image generations for the rendering task on ObjaWorld. The model takes a 3D scene with embedded shape encodings as input and produces a corresponding image. Additionally, we show the ground-truth image rendered using Blender.

Example image generations for the rendering task on ObjaWorld. The model takes a 3D scene as input and produces a corresponding image. Additionally, we show the ground-truth image rendered using Blender.

Two example predictions from the recognition task on ObjaWorld. The colored numbers indicate object matching between the predicted and ground-truth scenes, based on the criteria for Jaccard Index as defined in Algorithm 1 in the paper.

Example cases from the question-answering task on CLEVR. The model takes an image, a 3D scene, and a question as input to generate the corresponding answer.

Interactive Rendering Demos

Below is a JSON panel with exactly 3 objects. Each object has a fixed xy-coords, but you can change its shape to balloon, barrel, vase, bottle, or basketball. Click "Generate" to retrieve a corresponding pre-computed ObjaWorld image.

ObjaWorld Scene Objects

Generated Image (Kyvo)

Below is a JSON panel with exactly 3 objects. The first and third are read-only, and you can only modify the middle object. Choose shape, material, color, and size for the second object. The 3d-coords are fixed. Click "Generate" to retrieve a corresponding precomputed image.

Scene Objects (Modify the middle one)

{
    "shape": "sphere",
    "material": "metal",
    "color": "red",
    "size": "large",
    "xy-coords": "(-1.60,-2.20)"
}

{
    "shape": "",
    "material": "",
    "color": "",
    "size": "",
    "xy-coords": "(2.40,-2.35)"
}

{
    "shape": "cylinder",
    "material": "rubber",
    "color": "yellow",
    "size": "large",
    "xy-coords": "(2.75,0.35)"
}

Generated Image (Kyvo)

Groundtruth Image (Blender)

Varying the X-coordinate (Rolling Ball Demo)

Below is another interactive demo that lets you vary the x-coordinate from -3.00 to +3.00 in steps of 0.05. For each x value, we display a pair of images: generated vs. groundtruth.

X-coordinate value:

0.00

Generated Image (Kyvo)

Groundtruth Image (Blender)

BibTeX

@misc{sahoo2025aligningtextimages3d,
  title={Aligning Text, Images, and 3D Structure Token-by-Token}, 
  author={Aadarsh Sahoo and Vansh Tibrewal and Georgia Gkioxari},
  year={2025},
  eprint={2506.08002},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.08002}, 
}

Aligning Text, Images, and 3D Structure Token-by-Token

Abstract

Approach

Unified Shape and Scene Reconstruction

Shape and Scene Rendering

Qualitative Examples

Interactive Rendering Demos

ObjaWorld Scene Objects

Generated Image (Kyvo)

Scene Objects (Modify the middle one)

Generated Image (Kyvo)

Groundtruth Image (Blender)

Varying the X-coordinate (Rolling Ball Demo)

Generated Image (Kyvo)

Groundtruth Image (Blender)

BibTeX