Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

Abstract

Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., “left-most apple”) and overlooks functional and physical reasoning (e.g., “where can I safely store the knife?”).

We address this gap and introduce Conversational Image Segmentation (CIS) and ConvSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConvSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt–mask pairs without human supervision. We show that ConvSeg-Net achieves significant gains on ConvSeg and maintains strong performance on existing benchmarks.

Interactive Demo

Select an image from the row below, then choose a conversational prompt to see how ConvSeg-Net reasons.

Select Image

Conversational Prompt

ConvSeg Output Original Image

Drag slider to reveal mask

Qualitative Results

ConvSeg-Net generalizes across diverse reasoning tasks.
Hover over cards to compare. Click to view in high resolution.

"Segment the elephant acting as the vanguard of the herd."

"Identify the sphere farthest away from the polar bear."

"Segment the figure actively wearing eye protection."

"Segment surfaces suitable for holding a stable cup."

"Segment the animal posing the greatest immediate collision risk."

"Segment the object serving a substituted functional role."

"Segment the object used to gain attention."

"Segment vessels currently holding a liquid beverage."

"Segment the moving object with the highest momentum."

"Identify the upholstered furniture piece."

"Segment primary furniture intended for concealed storage."

"Segment objects most likely to tip over."

The ConvSeg Benchmark

Existing segmentation benchmarks are heavily skewed toward simple entities and spatial relations. To measure progress in grounding abstract concepts, we introduce ConvSeg, a benchmark featuring balanced coverage across five concept families: Entities, Spatial & Layout, Relations & Events, Affordances & Functions, and Physics & Safety.

🤗 ConvSeg Benchmark

Concept Coverage in Benchmarks — Distribution of concepts across existing benchmarks versus **ConvSeg**. Prior datasets primarily focus on entities/spatial relations, whereas ConvSeg offers near-uniform coverage.

The Conversational Data Engine

Collecting pixel-accurate masks and realistic, reasoning-rich prompts at scale is prohibitively expensive with human annotators. We introduce a fully automated, VLM-driven data engine that synthesizes high-quality prompt–mask pairs without human supervision. By leveraging high-performing VLMs (like Gemini-2.5-Flash) in an iterative generate-and-verify loop, we scale training data diversity across five key reasoning concepts.

Model Architecture & Training

ConvSeg-Net is a single-pass model that grounds conversational concepts into pixels. It avoids complex, multi-turn tool-use workflows, favoring an end-to-end approach. We fuse SAM2 (for strong segmentation priors) with a Qwen2.5-VL-3B prompt encoder (for visual reasoning).

Architecture: The image is processed by SAM2's image encoder. The prompt and image are jointly processed by the Qwen2.5-VL vision-language backbone to produce text embeddings. Lightweight adapters project these embeddings into the SAM2 mask decoder, where they condition the mask generation via cross-attention.

Curriculum Learning: To handle the complexity of abstract reasoning, we employ a two-phase training curriculum:

Phase 1: Literal & Basic Concepts. The model first learns to segment literal concepts and basic referring expressions using standard datasets (COCO, RefCOCO).
Phase 2: Conversational Concepts. We then train on our generated conversational data to learn abstract reasoning (affordances, physics), while mixing in foundational data to preserve general segmentation capabilities.

BibTeX

@article{sahoo2026conversational,
  title={Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision}, 
  author={Sahoo, Aadarsh and Gkioxari, Georgia},
  journal={arXiv preprint},
  year={2026}
}