Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., βleft-most appleβ) and overlooks functional and physical reasoning (e.g., βwhere can I safely store the knife?β).
We address this gap and introduce Conversational Image Segmentation (CIS) and ConvSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConvSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates promptβmask pairs without human supervision. We show that ConvSeg-Net achieves significant gains on ConvSeg and maintains strong performance on existing benchmarks.
Select an image from the row below, then choose a conversational prompt to see how ConvSeg-Net reasons.
Concept Family
Affordance
Drag slider to reveal mask
ConvSeg-Net generalizes across diverse reasoning tasks.
Hover over cards to compare. Click to view in high resolution.
"Segment the elephant acting as the vanguard of the herd."
"Identify the sphere farthest away from the polar bear."
"Segment the figure actively wearing eye protection."
"Segment surfaces suitable for holding a stable cup."
"Segment the animal posing the greatest immediate collision risk."
"Segment the object serving a substituted functional role."
"Segment the object used to gain attention."
"Segment vessels currently holding a liquid beverage."
"Segment the moving object with the highest momentum."
"Identify the upholstered furniture piece."
"Segment primary furniture intended for concealed storage."
"Segment objects most likely to tip over."
Existing segmentation benchmarks are heavily skewed toward simple entities and spatial relations. To measure progress in grounding abstract concepts, we introduce ConvSeg, a benchmark featuring balanced coverage across five concept families: Entities, Spatial & Layout, Relations & Events, Affordances & Functions, and Physics & Safety.
Collecting pixel-accurate masks and realistic, reasoning-rich prompts at scale is prohibitively expensive with human annotators. We introduce a fully automated, VLM-driven data engine that synthesizes high-quality promptβmask pairs without human supervision. By leveraging high-performing VLMs (like Gemini-2.5-Flash) in an iterative generate-and-verify loop, we scale training data diversity across five key reasoning concepts.
ConvSeg-Net is a single-pass model that grounds conversational concepts into pixels. It avoids complex, multi-turn tool-use workflows, favoring an end-to-end approach. We fuse SAM2 (for strong segmentation priors) with a Qwen2.5-VL-3B prompt encoder (for visual reasoning).
Architecture: The image is processed by SAM2's image encoder. The prompt and image are jointly processed by the Qwen2.5-VL vision-language backbone to produce text embeddings. Lightweight adapters project these embeddings into the SAM2 mask decoder, where they condition the mask generation via cross-attention.
Curriculum Learning: To handle the complexity of abstract reasoning, we employ a two-phase training curriculum:
@article{sahoo2026conversational,
title={Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision},
author={Sahoo, Aadarsh and Gkioxari, Georgia},
journal={arXiv preprint},
year={2026}
}