Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation

Zhihua Liu¹, Amrutha Saseendran², Lei Tong², Xilin He³, Fariba Yousefi², Nikolay Burlutskiy², Dino Oglic², Tom Diethe², Philip Teare², Huiyu Zhou^1†, Chen Jin^2†

¹University of Leicester, UK ²AstraZeneca, Cambridge, UK ³Shenzhen University

^†Corresponding Authors

ICML 2025

OpenReview arXiv Code Poster

Segment Anyword segments any visual concept described by text — nouns, adjectives, noun phrases, and even predicates — using only a frozen diffusion model, with no training or fine-tuning required.

Abstract

Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates or mask prompts, which are then refined into targeted object masks.

Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts and significant improvements in segmentation accuracy.

The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf — the most complex open-set grounded segmentation benchmark in the field.

Method Overview

Segment Anyword operates entirely at test time. No training or fine-tuning is required — fewer than 0.1M parameters are updated per image-text pair.

Hover over each step to highlight it on the pipeline above

Mask Prompt Inversion

Given an image and a text description, Segment Anyword optimizes token-level textual embeddings for each visual concept using the frozen diffusion model's image reconstruction objective. Only the text embeddings are updated — the rest of the network remains entirely frozen.

Cross-Attention Collection

Averaged cross-attention maps are collected across all denoising time steps. Each map localizes one word token within the image, serving as a mask prompt surrogate that captures where in the image that concept appears.

Linguistic-Guided Regularization

Sentence dependency and syntax structures guide two regularizations: positive adjective binding (linking modifiers to their nouns) and negative mutual-exclusive binding (separating independent noun phrases). The refined prompts feed into SAM for precise, noise-tolerant mask generation.

Cross-Attention Visualization

Click any highlighted word token to see its cross-attention map overlaid on the image. Each token independently localizes its visual concept.

🖼️ Input Image → static/images/demo_input.jpg

← Click a word to see its attention map

How it works: Segment Anyword learns a visual prompt for each concept word by inverting the image through a frozen diffusion model. The token-level cross-attention map reveals where in the image that concept is located — without any training.

No token selected — click a colored token above

Unique capability: Segment Anyword can also segment predicate words such as "holding" or "pulling" — linking subject and object entities through human-object interaction. This goes beyond what all prior segmentation methods can do.

Beyond Nouns: Predicate Segmentation

Most segmentation methods are designed exclusively for nouns and noun phrases. Segment Anyword is the first approach that can also localize abstract predicate words — verbs and gerunds that describe relations between entities.

Input image

Vanilla SAM segmentation

Why this matters: Predicate words encode the semantic bridge between visual entities. By learning to localize predicates, Segment Anyword can reduce hallucinations in generative models, support relational visual grounding, and enable scientific knowledge discovery from experimental observations or textbook figures — capabilities beyond the reach of any prior segmentation method.

Failure Case Analysis

We acknowledge the limitations of Segment Anyword and present representative failure cases to support transparent evaluation.

Segment Anyword inherits a resolution constraint from the cross-attention mechanism of the underlying diffusion model: attention maps are computed at a fixed spatial resolution of 16×16 before upsampling. This limited resolution means that very small or structurally thin objects — such as scissors blades, wire-like structures, or objects occupying only a handful of pixels — cannot be precisely localized by the attention map alone.

When the target concept is tiny, the low-resolution attention map may activate over a broader, semantically related region rather than the exact object boundary. This leads to false positive prompt regions being passed to SAM, causing the final mask to spill into nearby areas or fail to isolate the intended object.

Addressing this limitation through a higher-resolution backbone or a hierarchical attention aggregation strategy is a promising direction for future work.

Related Projects

We have a list of interesting projects relates to concept learning, prompt tuning and its application for novel content generation. You are welcome to check them out.

An Image is Worth Multiple Words: Discovering Object Level Concepts using Multi-Concept Prompt Learning (ICML 2024)

Multi-Concept Prompt Learning (MCPL) pioneers mask-free text-guided learning for multiple prompts from one scene. Our approach not only enhances current methodologies but also paves the way for novel applications, such as facilitating knowledge discovery through natural language-driven interactions between humans and machines.

Segment anyword: Mask prompt inversion for open-set grounded segmentation (ICML 2025)

We leverage cross-attention maps from a diffusion inversion process to guide open-set grounded segmentation. This inversion helps mitigate the sensitivity to ambiguous text prompts. The resulting cross-attention based visual point prompts are further regularized using linguistic syntax and dependency information.

Lavender: Diffusion Instruction Tuning (ICML 2025)

Lavender (Language-and-Vision fine-tuning with Diffusion Aligner) is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion.

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation (ArXiv 2025)

We present Causal-Adapter, a modular method that tames frozen text-to-image diffusions for counterfactual image generation. The method enables causal interventions, consistently propagates their effects to dependent attributes and preserves identity.

BibTeX

@inproceedings{liu2025seganyword,
  author    = {Liu, Zhihua and Saseendran, Amrutha and Tong, Lei and He, Xilin and
               Yousefi, Fariba and Burlutskiy, Nikolay and Oglic, Dino and Diethe, Tom
               and Teare, Philip and Zhou, Huiyu and Jin, Chen},
  title     = {Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
}