Feedforward 3D Editing via Text-Steerable Image-to-3D

Ziqi Ma, Hongqiao Chen, Yisong Yue, Georgia Gkioxari

California Institute of Technology

pdf arXiv Demo Code Benchmark Checkpoints Dataset

We present Steer3D: a method to add text steerability to pretrained image-to-3D generative models. Steer3D adapts the ControlNet architecture to 3D generation. Training only on 100k-scale synthetic data generated by our data engine, Steer3D and can perform diverse edits, and even generalize to objects in iPhone photos or online images. Scroll down for an interactive demo!

Feedforward 3D Editing on Edit3D-Bench

Steer3D, our feedforward model which injects text steering to pretrained image-to-3D models, is able to edit diverse objects. Below we show Steer3D's predictions on examples from Edit3D-Bench.

Replace the natural antlers with glowing neon blue

Replace legs with sleek robotic lims colored silver

Remove the green top part of the pumpkin

Remove the knob from the side of the cup

Replace the spherical studs on the body with bright LED lights

Replace the hollow backrest with a transparent blue glass panel

Replace the black lampshade with a metallic gold lampshade

Remove one leg from the bench

Add a wooden lid on top of the barrel

Replace the flat disc surface with a textured carbon fiber

Make the texture of the teapot rough and rustic

Add short hair on top of the head

Remove the string from the crossbow

Make the finish of the armrests a polished chrome

3D Editing of In-The-Wild Objects

Steer3D, despite only trained on synthetic data based on Objaverse assets, can generalize to "in-the-wild" objects, such as objects from iPhone photos or online images.

Select an object to edit

→

Approach

Steer3D adapts ControlNet to 3D generation, and thus injects text steerability to pretrained image-to-3D models. As shown below, given an image (e.g. of a crab), existing image-to-3D models can generate a 3D crab that looks like the image. Steer3D allows the user to edit the 3D crab with language, such as "replacing its legs with sleek robotic limbs colored silver". The new crab aligns with the editing text, and keeps consistent with the original crab. Steer3D trains on 100k-scale synthetic data generated by our automated data engine. Our data engine combines existing image-to-3D models and vision language models to provide editing pairs that are diverse, consistent, and correct. Both our scalable data engine approach and our data-efficient architecture design help yield a strong editing model.

Architecture and Recipe

To facilitate data-efficient training, we design a ControlNet-based architecture to leverage the shape and geometry prior of pretrained image-to-3D models. The architecture is shown below. We design a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO) to avoid the trivial local minumum of "no edit". More details can be found in the paper!

Data Engine

We build a data engine to generate synthetic data with a two-stage filter to provide diverse, consistent, and correct editing pairs as our training data. Check out the paper for our scaling analysis that backs up the importance of this scalable data strategy!

BibTeX

@misc{ma2025feedforward3deditingtextsteerable,
        title={Feedforward 3D Editing via Text-Steerable Image-to-3D}, 
        author={Ziqi Ma and Hongqiao Chen and Yisong Yue and Georgia Gkioxari},
        year={2025},
        eprint={2512.13678},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2512.13678}, 
  }