Ir al contenido

Documat


Resumen de Recognizing action and activities from egocentric images

Alejandro Cartas Ayala

  • Egocentric action recognition consists in determining what a wearable camera user is doing from his perspective. Its defining characteristic is that the person himself is only partially visible in the images through his hands. As a result, the recognition of actions can rely solely on user interactions with objects, other people, and the scene. Egocentric action recognition has numerous assistive technology applications, in particular in the field of rehabilitation and preventive medicine.

    The type of egocentric camera determines the activities or actions that can be predicted. There are roughly two kinds: lifelogging and video cameras. The former can continuously take pictures every 20-30 seconds during day-long periods. The sequences of pictures produced by them are called visual lifelogs or photo-streams. In comparison with video, they lack of motion that typically has been used to disambiguate actions. We present several egocentric action recognition approaches for both settings.

    We first introduce an approach that classifies still-images from lifelogs by combining a convolutional network and a random forest. Since lifelogs show temporal coherence within consecutive images, we also present two architectures that are based on the long short-term memory (LSTM) network.

    In order to thoroughly measure their generalization performance, we introduce the largest photo-streams dataset for activity recognition. These tests not only consider hidden days and multiple users but also the effect of time boundaries from events. We finally present domain adaptation strategies for dealing with unknown domain images in a real-world scenario.

    Our work on egocentric action recognition from videos is primarily focused on object-interactions. We present a deep network that in the first level models person-to-object interactions, and in the second level models sequences of actions as part of a single activity. The spatial relationship between hands and objects is modeled using a region-based network, whereas the actions and activities are modeled using a hierarchical LSTM. Our last approach explores the importance of audio produced by the egocentric manipulations of objects. It combines a sparse temporal sampling strategy with a late fusion of audio, RGB, and temporal streams. Experimental results on the EPIC-Kitchen dataset show that multimodal integration leads to better performance than unimodal approaches.


Fundación Dialnet

Mi Documat