Conversational Image Segmentation:
Grounding Abstract Concepts with Scalable Supervision

California Institute of Technology

Conversational Image Segmentation (CIS) grounds abstract, intent-oriented concepts into pixel-accurate masks, reasoning about affordances, physics, and functional properties.

Abstract

Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., β€œleft-most apple”) and overlooks functional and physical reasoning (e.g., β€œwhere can I safely store the knife?”).

We address this gap and introduce Conversational Image Segmentation (CIS) and ConvSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConvSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt–mask pairs without human supervision. We show that ConvSeg-Net achieves significant gains on ConvSeg and maintains strong performance on existing benchmarks.

Interactive Demo

Select an image from the row below, then choose a conversational prompt to see how ConvSeg-Net reasons.

Select Image

Conversational Prompt

Concept Family

Affordance

ConvSeg Output Original Image

Drag slider to reveal mask

Qualitative Results

ConvSeg-Net generalizes across diverse reasoning tasks.
Hover over cards to compare. Click to view in high resolution.

The ConvSeg Benchmark

Existing segmentation benchmarks are heavily skewed toward simple entities and spatial relations. To measure progress in grounding abstract concepts, we introduce ConvSeg, a benchmark featuring balanced coverage across five concept families: Entities, Spatial & Layout, Relations & Events, Affordances & Functions, and Physics & Safety.

Concept Coverage in Benchmarks
Distribution of concepts across existing benchmarks versus ConvSeg. Prior datasets primarily focus on entities/spatial relations, whereas ConvSeg offers near-uniform coverage.

The Conversational Data Engine

Collecting pixel-accurate masks and realistic, reasoning-rich prompts at scale is prohibitively expensive with human annotators. We introduce a fully automated, VLM-driven data engine that synthesizes high-quality prompt–mask pairs without human supervision. By leveraging high-performing VLMs (like Gemini-2.5-Flash) in an iterative generate-and-verify loop, we scale training data diversity across five key reasoning concepts.

Model Architecture & Training

ConvSeg-Net is a single-pass model that grounds conversational concepts into pixels. It avoids complex, multi-turn tool-use workflows, favoring an end-to-end approach. We fuse SAM2 (for strong segmentation priors) with a Qwen2.5-VL-3B prompt encoder (for visual reasoning).

Model Architecture Diagram

Architecture: The image is processed by SAM2's image encoder. The prompt and image are jointly processed by the Qwen2.5-VL vision-language backbone to produce text embeddings. Lightweight adapters project these embeddings into the SAM2 mask decoder, where they condition the mask generation via cross-attention.

Curriculum Learning: To handle the complexity of abstract reasoning, we employ a two-phase training curriculum:

  1. Phase 1: Literal & Basic Concepts. The model first learns to segment literal concepts and basic referring expressions using standard datasets (COCO, RefCOCO).
  2. Phase 2: Conversational Concepts. We then train on our generated conversational data to learn abstract reasoning (affordances, physics), while mixing in foundational data to preserve general segmentation capabilities.

BibTeX

@article{sahoo2026conversational,
  title={Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision}, 
  author={Sahoo, Aadarsh and Gkioxari, Georgia},
  journal={arXiv preprint},
  year={2026}
}