What Do You Really Want? - Open-Vocabulary Intention-Guided Object Detection Meets Vision Language Models

Anonymous Author(s)

Anonymous Organization
ICRA 2026 (Under Review)

Paper - Coming Soon Code - Coming Soon Dataset - Coming Soon

Fig. 1. Open-Vocabulary Intention-Guided Object Detection task overview. Users provide natural language instructions describing their intentions, and our model localizes objects in indoor scenes that can fulfill these intentions, bridging the gap between human intentions and object affordances.

Abstract

Understanding and responding to human intentions in visual environments is a crucial capability for embodied agents to provide intuitive assistance. We introduce Open-Vocabulary Intention-Guided Object Detection (OV-IGOD), a novel task requiring detection of objects that fulfill free-form human intentions. This task presents unique challenges beyond traditional object detection or visual grounding, as it demands understanding implicit needs, reasoning about object affordances, and precise localizing relevant targets.

To advance research in this direction, we introduce OV-IGOD Bench, a novel benchmark featuring 9.3k images and 21.5k diverse intention annotations that preserve contextual relevance, spatial precision, and natural language variation. Our data construction pipeline employs a multi-stage approach combining detail caption generation by InternVL3-8B, intention prompting with GPT-4o, and quality review.

Leveraging this benchmark, we develop PF-Florence, which enhances the Florence-2 vision-language model with our proposed Prompted Feature-wise Linear Modulation (P-FiLM) mechanism. P-FiLM addresses limitations in conventional modulation approaches by incorporating learnable queries that selectively extract and utilize textual information for visual feature conditioning. Our method significantly outperforms existing approaches on standard object detection metrics. Real-world experiments further validate the practical applicability of our approach in zero-shot transfer settings.

OV-IGOD Benchmark Construction

Fig. 2. Our multi-stage data construction pipeline for OV-IGOD dataset. The process consists of three key stages: detail caption generation, intention generation, and quality checking, ultimately producing triplets of ⟨image, intention, bounding box⟩.

We build upon the SUN-RGBD dataset and construct OV-IGOD Bench through a novel multi-stage pipeline:

Detail Caption Generation: Using InternVL3-8B to generate comprehensive scene descriptions capturing fine-grained visual details and spatial relationships.
Intention Generation: Employing GPT-4o with specialized prompting to create authentic human intentions that reflect natural language expressions of needs without explicitly naming target objects.
Quality Checking: Implementing rigorous validation across scene-context alignment, ambiguity assessment, and format diversity to ensure high annotation quality.

The resulting dataset contains 9.3K images with 21.5K intention annotations, averaging 2.31 intentions per image with significant diversity in linguistic structure (9-24 words, mean 14.52).

PF-Florence: Prompted Feature-wise Linear Modulation

Fig. 3. Architecture overview of our PF-Florence model. (a) Overall design showing the sequence-to-sequence framework with tokenizer, vision encoder enhanced with P-FiLM layers, and transformer decoders for intention-guided object detection. (b) Integration of P-FiLM layers within the DaViT vision encoder. (c) Detailed structure of the P-FiLM layer with Prompted Generation Module (PGM).

Key Innovation: P-FiLM Mechanism

Our Prompted Feature-wise Linear Modulation (P-FiLM) addresses limitations in conventional FiLM approaches through:

Prompted Generation Module (PGM)

Dynamically generates input-conditioned prompts from learnable components
Uses global average pooling and learnable queries to extract relevant textual information
Enables adaptive focus on different aspects of intention text based on visual context

Feature Modulation

Generates scaling (γ) and shift (β) parameters from prompt features
Modulates visual features through element-wise operations
Strategically integrated between spatial and channel attention blocks in DaViT encoder

Unlike conventional FiLM that applies static text representations, P-FiLM actively queries relevant information from intention text, enabling better understanding of complex free-form intentions and achieving superior performance in intention-guided object detection.

Experimental Results

Quantitative Results

Our PF-Florence model achieves state-of-the-art performance on the OV-IGOD benchmark, significantly outperforming existing approaches:

Method	mAP50:95	mAP50	mAP75
Direct Prediction Methods
Qwen2.5-VL-7B	3.94	6.88	3.12
GPT-4V	5.31	10.47	6.28
Two-Stage Pipeline Methods
Cambrian-1-8B + Grounding DINO	38.88	48.12	41.78
Phi-4-multimodal + Grounding DINO	40.03	49.13	43.29
Qwen2.5-VL-7B + Grounding DINO	39.92	49.30	42.18
InternVL3-8B + Grounding DINO	37.61	46.65	40.61
GPT-4V + Grounding DINO	40.75	49.21	42.72
PF-Florence (Ours)	52.91	66.97	56.47

Our method demonstrates significant improvements: +12.16 (30% relative improvement) mAP50:95, +17.76 (36% improvement) mAP50, and +13.75 (30% improvement) mAP75 compared to the best baseline (GPT-4V + Grounding DINO).

Qualitative Results

Fig. 4. Qualitative comparison on OV-IGOD test cases. Our method demonstrates superior intention understanding and precise localization compared to baseline approaches. Green boxes indicate ground truth annotations, while red boxes show model predictions.

Our qualitative results show that PF-Florence consistently identifies objects that truly align with user intentions, while baseline methods often miss targets or identify contextually inappropriate alternatives. The P-FiLM mechanism enables precise understanding of complex free-form intentions and accurate object localization.

Real-World Experiments

To further validate the practical applicability of our approach, we conduct real-world experiments using naturally captured scenarios that reflect typical user intentions in everyday environments. Due to the limited sample size in real-world settings, we evaluate performance using Intersection over Union (IoU) as the primary metric, comparing our method against representative baselines from both single-stage and two-stage approaches.

Quantitative Results

Our PF-Florence achieves superior performance in real-world scenarios, demonstrating strong practical applicability:

Method	IoU (%)
GPT-4V	9.35
GPT-4V + Grounding DINO	74.98
PF-Florence (Ours)	81.97

Our method achieves 81.97% IoU, significantly outperforming GPT-4V (9.35% IoU) and GPT-4V + Grounding DINO (74.98% IoU), with +6.99% improvement over the best baseline.

Qualitative Analysis

Real-world qualitative comparison results

Fig. 5. Qualitative comparison on real-world test cases. Green boxes indicate ground truth, red boxes show predictions.

Figure 5 demonstrates the practical advantages of our approach. Single-stage methods like GPT-4V fail to provide precise localization despite understanding textual intentions. Two-stage pipeline methods achieve better localization but suffer from intention-object misalignment—for example, when seeking "something to snack on", they incorrectly detect irrelevant objects like tennis balls alongside the intended snack. Our PF-Florence consistently identifies objects that truly align with user intentions while avoiding contextually irrelevant detections.

BibTeX

Coming Soon...

BibTeX citation will be available after publication.