Feedforward 3D Editing via Text-Steerable Image-to-3D

California Institute of Technology

We present Steer3D: a method to add text steerability to pretrained image-to-3D generative models. Steer3D adapts the ControlNet architecture to 3D generation. Training only on 100k-scale synthetic data generated by our data engine, Steer3D and can perform diverse edits, and even generalize to objects in iPhone photos or online images. Scroll down for an interactive demo!

Feedforward 3D Editing on Edit3D-Bench

Steer3D, our feedforward model which injects text steering to pretrained image-to-3D models, is able to edit diverse objects. Below we show Steer3D's predictions on examples from Edit3D-Bench.

Replace the natural antlers with glowing neon blue
Replace legs with sleek robotic lims colored silver
Remove the green top part of the pumpkin
Remove the knob from the side of the cup
Replace the spherical studs on the body with bright LED lights
Replace the hollow backrest with a transparent blue glass panel
Replace the black lampshade with a metallic gold lampshade
Remove one leg from the bench
Add a wooden lid on top of the barrel
Replace the flat disc surface with a textured carbon fiber
Make the texture of the teapot rough and rustic
Add short hair on top of the head
Remove the string from the crossbow
Make the finish of the armrests a polished chrome

3D Editing of In-The-Wild Objects

Steer3D, despite only trained on synthetic data based on Objaverse assets, can generalize to "in-the-wild" objects, such as objects from iPhone photos or online images.

Select an object to edit

Approach

Steer3D adapts ControlNet to 3D generation, and thus injects text steerability to pretrained image-to-3D models. As shown below, given an image (e.g. of a crab), existing image-to-3D models can generate a 3D crab that looks like the image. Steer3D allows the user to edit the 3D crab with language, such as "replacing its legs with sleek robotic limbs colored silver". The new crab aligns with the editing text, and keeps consistent with the original crab. Steer3D trains on 100k-scale synthetic data generated by our automated data engine. Our data engine combines existing image-to-3D models and vision language models to provide editing pairs that are diverse, consistent, and correct. Both our scalable data engine approach and our data-efficient architecture design help yield a strong editing model.

qualitative

Architecture and Recipe

To facilitate data-efficient training, we design a ControlNet-based architecture to leverage the shape and geometry prior of pretrained image-to-3D models. The architecture is shown below. We design a two-stage training recipe based on flow-matching training and Direct Preference Optimization (DPO) to avoid the trivial local minumum of "no edit". More details can be found in the paper!

qualitative

Data Engine

We build a data engine to generate synthetic data with a two-stage filter to provide diverse, consistent, and correct editing pairs as our training data. Check out the paper for our scaling analysis that backs up the importance of this scalable data strategy!

qualitative

BibTeX

@misc{ma2025feedforward3deditingtextsteerable,
        title={Feedforward 3D Editing via Text-Steerable Image-to-3D}, 
        author={Ziqi Ma and Hongqiao Chen and Yisong Yue and Georgia Gkioxari},
        year={2025},
        eprint={2512.13678},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2512.13678}, 
  }