Ciudad Real, España
, Humberto Martínez Barberá (ed. lit.)
, Pablo Bernal Polo (ed. lit.), Nieves Pavón Pulido (ed. lit.), 2025, ISBN 978-84-10327-06-1, págs. 119-131This work presents a privacy-preserving system for real-timescene understanding using Vision-Language Models (VLMs). Unlike conventional approaches, our method avoids storing raw video data, retaining only textual descriptions and event logs. To reduce computational cost while maintaining descriptive accuracy, we propose an efficient keyframe selection pipeline that filters video input before VLM processing. We evaluate three strategies: equidistant sampling (baseline), SSIM-based visual diversity, and CLIP-based semantic filtering. Experiments conducted on the Charades dataset show that CLIP-based selection consistently outperforms both baseline and SSIM approaches, especially in scenarios involving fast motion or occluded actions. Furthermore, certain static scenes are accurately described by any method, while distant or low-detail actions remain a challenge for all strategies. Notably, reducing the number of frames—regardless of the selection method—proves beneficial not only for computational efficiency but also for avoiding overgeneration of irrelevant or hallucinated actions.By minimizing the number of frames processed while preserving semantic content, our system enables efficient and privacy-aware deployment of VLMs in smart home environments, paving the way for real-time monitoring, activity recognition, and scalable on-device inference.
© 2008-2026 Fundación Dialnet · Todos los derechos reservados