Ir al contenido

Documat


Efficient Real-Time Scene Description with Vision-Language Models

  • Abraham Casas [1] ; Jesus Martínez-Gómez [1] ; Luis de la Ossa [1]
    1. [1] Universidad de Castilla-La Mancha

      Universidad de Castilla-La Mancha

      Ciudad Real, España

  • Localización: Proceedings of the XXV International Workshop on Physical Agents / David Herrero Pérez (ed. lit.) Árbol académico, Humberto Martínez Barberá (ed. lit.) Árbol académico, Pablo Bernal Polo (ed. lit.), Nieves Pavón Pulido (ed. lit.), 2025, ISBN 978-84-10327-06-1, págs. 119-131
  • Idioma: inglés
  • DOI: 10.31428/10317/21055
  • Enlaces
  • Resumen
    • This work presents a privacy-preserving system for real-timescene understanding using Vision-Language Models (VLMs). Unlike conventional approaches, our method avoids storing raw video data, retaining only textual descriptions and event logs. To reduce computational cost while maintaining descriptive accuracy, we propose an efficient keyframe selection pipeline that filters video input before VLM processing. We evaluate three strategies: equidistant sampling (baseline), SSIM-based visual diversity, and CLIP-based semantic filtering. Experiments conducted on the Charades dataset show that CLIP-based selection consistently outperforms both baseline and SSIM approaches, especially in scenarios involving fast motion or occluded actions. Furthermore, certain static scenes are accurately described by any method, while distant or low-detail actions remain a challenge for all strategies. Notably, reducing the number of frames—regardless of the selection method—proves beneficial not only for computational efficiency but also for avoiding overgeneration of irrelevant or hallucinated actions.By minimizing the number of frames processed while preserving semantic content, our system enables efficient and privacy-aware deployment of VLMs in smart home environments, paving the way for real-time monitoring, activity recognition, and scalable on-device inference.


Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno