Understanding and responding to human intentions in visual environments is a crucial capability for embodied agents to provide intuitive assistance. We introduce Open-Vocabulary Intention-Guided Object Detection (OV-IGOD), a novel task requiring detection of objects that fulfill free-form human intentions. This task presents unique challenges beyond traditional object detection or visual grounding, as it demands understanding implicit needs, reasoning about object affordances, and precise localizing relevant targets.
To advance research in this direction, we introduce OV-IGOD Bench, a novel benchmark featuring 9.3k images and 21.5k diverse intention annotations that preserve contextual relevance, spatial precision, and natural language variation. Our data construction pipeline employs a multi-stage approach combining detail caption generation by InternVL3-8B, intention prompting with GPT-4o, and quality review.
Leveraging this benchmark, we develop PF-Florence, which enhances the Florence-2 vision-language model with our proposed Prompted Feature-wise Linear Modulation (P-FiLM) mechanism. P-FiLM addresses limitations in conventional modulation approaches by incorporating learnable queries that selectively extract and utilize textual information for visual feature conditioning. Our method significantly outperforms existing approaches on standard object detection metrics. Real-world experiments further validate the practical applicability of our approach in zero-shot transfer settings.
Fig. 2. Our multi-stage data construction pipeline for OV-IGOD dataset. The process consists of three key stages: detail caption generation, intention generation, and quality checking, ultimately producing triplets of ⟨image, intention, bounding box⟩.
We build upon the SUN-RGBD dataset and construct OV-IGOD Bench through a novel multi-stage pipeline:
The resulting dataset contains 9.3K images with 21.5K intention annotations, averaging 2.31 intentions per image with significant diversity in linguistic structure (9-24 words, mean 14.52).
Fig. 3. Architecture overview of our PF-Florence model. (a) Overall design showing the sequence-to-sequence framework with tokenizer, vision encoder enhanced with P-FiLM layers, and transformer decoders for intention-guided object detection. (b) Integration of P-FiLM layers within the DaViT vision encoder. (c) Detailed structure of the P-FiLM layer with Prompted Generation Module (PGM).
Our Prompted Feature-wise Linear Modulation (P-FiLM) addresses limitations in conventional FiLM approaches through:
Unlike conventional FiLM that applies static text representations, P-FiLM actively queries relevant information from intention text, enabling better understanding of complex free-form intentions and achieving superior performance in intention-guided object detection.
Our PF-Florence model achieves state-of-the-art performance on the OV-IGOD benchmark, significantly outperforming existing approaches:
| Method | mAP50:95 | mAP50 | mAP75 |
|---|---|---|---|
| Direct Prediction Methods | |||
| Qwen2.5-VL-7B | 3.94 | 6.88 | 3.12 |
| GPT-4V | 5.31 | 10.47 | 6.28 |
| Two-Stage Pipeline Methods | |||
| Cambrian-1-8B + Grounding DINO | 38.88 | 48.12 | 41.78 |
| Phi-4-multimodal + Grounding DINO | 40.03 | 49.13 | 43.29 |
| Qwen2.5-VL-7B + Grounding DINO | 39.92 | 49.30 | 42.18 |
| InternVL3-8B + Grounding DINO | 37.61 | 46.65 | 40.61 |
| GPT-4V + Grounding DINO | 40.75 | 49.21 | 42.72 |
| PF-Florence (Ours) | 52.91 | 66.97 | 56.47 |
Our method demonstrates significant improvements: +12.16 (30% relative improvement) mAP50:95, +17.76 (36% improvement) mAP50, and +13.75 (30% improvement) mAP75 compared to the best baseline (GPT-4V + Grounding DINO).
Fig. 4. Qualitative comparison on OV-IGOD test cases. Our method demonstrates superior intention understanding and precise localization compared to baseline approaches. Green boxes indicate ground truth annotations, while red boxes show model predictions.
Our qualitative results show that PF-Florence consistently identifies objects that truly align with user intentions, while baseline methods often miss targets or identify contextually inappropriate alternatives. The P-FiLM mechanism enables precise understanding of complex free-form intentions and accurate object localization.
To further validate the practical applicability of our approach, we conduct real-world experiments using naturally captured scenarios that reflect typical user intentions in everyday environments. Due to the limited sample size in real-world settings, we evaluate performance using Intersection over Union (IoU) as the primary metric, comparing our method against representative baselines from both single-stage and two-stage approaches.
Our PF-Florence achieves superior performance in real-world scenarios, demonstrating strong practical applicability:
| Method | IoU (%) |
|---|---|
| GPT-4V | 9.35 |
| GPT-4V + Grounding DINO | 74.98 |
| PF-Florence (Ours) | 81.97 |
Our method achieves 81.97% IoU, significantly outperforming GPT-4V (9.35% IoU) and GPT-4V + Grounding DINO (74.98% IoU), with +6.99% improvement over the best baseline.
Fig. 5. Qualitative comparison on real-world test cases. Green boxes indicate ground truth, red boxes show predictions.
Figure 5 demonstrates the practical advantages of our approach. Single-stage methods like GPT-4V fail to provide precise localization despite understanding textual intentions. Two-stage pipeline methods achieve better localization but suffer from intention-object misalignment—for example, when seeking "something to snack on", they incorrectly detect irrelevant objects like tennis balls alongside the intended snack. Our PF-Florence consistently identifies objects that truly align with user intentions while avoiding contextually irrelevant detections.
Coming Soon...
BibTeX citation will be available after publication.