Open-set image segmentation poses a significant challenge because existing methods often demand extensive training or fine-tuning and generally struggle to segment unified objects consistently across diverse text reference expressions. Motivated by this, we propose Segment Anyword, a novel training-free visual concept prompt learning approach for open-set language grounded segmentation that relies on token-level cross-attention maps from a frozen diffusion model to produce segmentation surrogates or mask prompts, which are then refined into targeted object masks.
Initial prompts typically lack coherence and consistency as the complexity of the image-text increases, resulting in suboptimal mask fragments. To tackle this issue, we further introduce a novel linguistic-guided visual prompt regularization that binds and clusters visual prompts based on sentence dependency and syntactic structural information, enabling the extraction of robust, noise-tolerant mask prompts and significant improvements in segmentation accuracy.
The proposed approach is effective, generalizes across different open-set segmentation tasks, and achieves state-of-the-art results of 52.5 (+6.8 relative) mIoU on Pascal Context 59, 67.73 (+25.73 relative) cIoU on gRefCOCO, and 67.4 (+1.1 relative to fine-tuned methods) mIoU on GranDf — the most complex open-set grounded segmentation benchmark in the field.
Segment Anyword operates entirely at test time. No training or fine-tuning is required — fewer than 0.1M parameters are updated per image-text pair.
Given an image and a text description, Segment Anyword optimizes token-level textual embeddings for each visual concept using the frozen diffusion model's image reconstruction objective. Only the text embeddings are updated — the rest of the network remains entirely frozen.
Averaged cross-attention maps are collected across all denoising time steps. Each map localizes one word token within the image, serving as a mask prompt surrogate that captures where in the image that concept appears.
Sentence dependency and syntax structures guide two regularizations: positive adjective binding (linking modifiers to their nouns) and negative mutual-exclusive binding (separating independent noun phrases). The refined prompts feed into SAM for precise, noise-tolerant mask generation.
Click any highlighted word token to see its cross-attention map overlaid on the image. Each token independently localizes its visual concept.
Unique capability: Segment Anyword can also segment predicate words such as "holding" or "pulling" — linking subject and object entities through human-object interaction. This goes beyond what all prior segmentation methods can do.
Most segmentation methods are designed exclusively for nouns and noun phrases. Segment Anyword is the first approach that can also localize abstract predicate words — verbs and gerunds that describe relations between entities.
Why this matters: Predicate words encode the semantic bridge between visual entities. By learning to localize predicates, Segment Anyword can reduce hallucinations in generative models, support relational visual grounding, and enable scientific knowledge discovery from experimental observations or textbook figures — capabilities beyond the reach of any prior segmentation method.
We acknowledge the limitations of Segment Anyword and present representative failure cases to support transparent evaluation.
Segment Anyword inherits a resolution constraint from the cross-attention mechanism of the underlying diffusion model: attention maps are computed at a fixed spatial resolution of 16×16 before upsampling. This limited resolution means that very small or structurally thin objects — such as scissors blades, wire-like structures, or objects occupying only a handful of pixels — cannot be precisely localized by the attention map alone.
When the target concept is tiny, the low-resolution attention map may activate over a broader, semantically related region rather than the exact object boundary. This leads to false positive prompt regions being passed to SAM, causing the final mask to spill into nearby areas or fail to isolate the intended object.
Addressing this limitation through a higher-resolution backbone or a hierarchical attention aggregation strategy is a promising direction for future work.
We have a list of interesting projects relates to concept learning, prompt tuning and its application for novel content generation. You are welcome to check them out.
Multi-Concept Prompt Learning (MCPL) pioneers mask-free text-guided learning for multiple prompts from one scene. Our approach not only enhances current methodologies but also paves the way for novel applications, such as facilitating knowledge discovery through natural language-driven interactions between humans and machines.
We leverage cross-attention maps from a diffusion inversion process to guide open-set grounded segmentation. This inversion helps mitigate the sensitivity to ambiguous text prompts. The resulting cross-attention based visual point prompts are further regularized using linguistic syntax and dependency information.
Lavender (Language-and-Vision fine-tuning with Diffusion Aligner) is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion.
We present Causal-Adapter, a modular method that tames frozen text-to-image diffusions for counterfactual image generation. The method enables causal interventions, consistently propagates their effects to dependent attributes and preserves identity.
@inproceedings{liu2025seganyword,
author = {Liu, Zhihua and Saseendran, Amrutha and Tong, Lei and He, Xilin and
Yousefi, Fariba and Burlutskiy, Nikolay and Oglic, Dino and Diethe, Tom
and Teare, Philip and Zhou, Huiyu and Jin, Chen},
title = {Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
}