Efficient Real-Time Scene Description with Vision-Language Models

Abraham Casas; Jesús Martínez Gómez; Luis de la Ossa

Ayuda

Efficient Real-Time Scene Description with Vision-Language Models

Abraham Casas ^[1] ; Jesus Martínez-Gómez ^[1] ; Luis de la Ossa ^[1]
1. [1] Universidad de Castilla-La Mancha
  
  Universidad de Castilla-La Mancha
  
  Ciudad Real, España
Localización: Proceedings of the XXV International Workshop on Physical Agents / David Herrero Pérez (ed. lit.) , Humberto Martínez Barberá (ed. lit.) , Pablo Bernal Polo (ed. lit.), Nieves Pavón Pulido (ed. lit.), 2025, ISBN 978-84-10327-06-1, págs. 119-131
Idioma: inglés
DOI: 10.31428/10317/21055
Enlaces
- Texto Completo Libro
Resumen
- This work presents a privacy-preserving system for real-timescene understanding using Vision-Language Models (VLMs). Unlike conventional approaches, our method avoids storing raw video data, retaining only textual descriptions and event logs. To reduce computational cost while maintaining descriptive accuracy, we propose an efficient keyframe selection pipeline that filters video input before VLM processing. We evaluate three strategies: equidistant sampling (baseline), SSIM-based visual diversity, and CLIP-based semantic filtering. Experiments conducted on the Charades dataset show that CLIP-based selection consistently outperforms both baseline and SSIM approaches, especially in scenarios involving fast motion or occluded actions. Furthermore, certain static scenes are accurately described by any method, while distant or low-detail actions remain a challenge for all strategies. Notably, reducing the number of frames—regardless of the selection method—proves beneficial not only for computational efficiency but also for avoiding overgeneration of irrelevant or hallucinated actions.By minimizing the number of frames processed while preserving semantic content, our system enables efficient and privacy-aware deployment of VLMs in smart home environments, paving the way for real-time monitoring, activity recognition, and scalable on-device inference.