Reconocimiento de Acciones Humanas en Videos usando una Red Neuronal CNN LSTM Robusta

Carlos Ismael Orozco; Eduardo Xamena; María Elena Buemi; Julio Jacobo Berlles

Ayuda

Reconocimiento de Acciones Humanas en Videos usando una Red Neuronal CNN LSTM Robusta

Orozco, Carlos Ismael ^[1] ; Xamena, Eduardo ; Buemi, María Elena ; Berlles, Julio Jacobo
1. [1] Universidad Nacional de Salta
  
  Universidad Nacional de Salta
  
  Argentina
Localización: Ciencia y tecnología, ISSN 1850-0870, ISSN-e 2344-9217, Nº. 20, 2020, págs. 23-36
Idioma: español
DOI: 10.18682/cyt.vi0.3288
Títulos paralelos:
- Human Action Recognition in Videos using a Robust CNN LSTM Approach
Enlaces
- Texto completo (pdf)
Resumen
- español
  El reconocimiento de acciones en videos es actualmente un tema de interés en el área de visión por computadora, debido a potenciales aplicaciones como: indexación multimedia, vigilancia en espacios públicos, entre otras. En este artículo proponemos: (1) Implementar una arquitectura CNN–LSTM para esta tarea. Primero, una red neuronal convolucional VGG16 previamente entrenada extrae las características del video de entrada. Luego, una capa LSTM determina la clase particular del video. (2) Estudiar cómo la cantidad de unidades LSTM afecta el rendimiento del sistema. Para llevar a cabo las fases de entrenamiento y prueba, utilizamos los conjuntos de datos KTH, UCF-11 y HMDB-51. (3) Evaluar el rendimiento de nuestro sistema utilizando la precisión como métrica de evaluación, dado el balance existente entre las clases de los conjuntos de datos. Obtenemos un 93%, 91% y 47% de precisión respectivamente para cada conjunto de datos, mejorando los resultados del estado del arte para los primeros dos. Además de los resultados obtenidos, la principal contribución de este trabajo yace en la evaluación de diferentes arquitecturas CNN-LSTM para la tarea de reconocimiento de acciones.
- English
  Action recognition in videos is currently a topic of interest in the area of computer vision, due to potential applications such as: multimedia indexing, surveillance in public spaces, among others. In this paper we propose (1) The implementation of a CNN–LSTM architecture. First, a pre-trained VGG16 convolutional neural network extracts the features of the input video. Then, an LSTM classifies the video sequence in a particular class. (2) A study of how the number of LSTM units affects the performance of the system. To carry out the training and test phases, we used the KTH, UCF-11 and HMDB-51 datasets. (3) An evaluation of the performance of our system using accuracy as evaluation metric, given the existing balance of the classes in the datasets. We obtain 93%, 91% and 47% accuracy respectively for each dataset, improving state of the art results for the former two. Besides the results attained, the main contribution of this work lays on the evaluation of different CNN-LSTM architectures for the action recognition task.
Referencias bibliográficas
- Citas Liu D., Shyu M., and Zhao G. (2013). Spatial-temporal motion information integration for action detection and recognition in non-static...
- Wang H., Klser A., Schmid C., and Liu C. (2011). Action recognition by dense trajectories. In CVPR 2011, pages 3169–3176.
- Sharma S., Kiros R., and Salakhutdinov R. (2015). Action recognition using visual attention. CoRR, abs/1511.04119.
- Simonyan K. and Zisserman A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.
- Chollet F. et al. Keras. https://keras.io, 2015.
- Hochreiter S. and Schmidhuber J. (1997). Long short-term memory. Neural Comput., 9(8):1735–1780, November 1997.
- Bastien F., Lamblin P., Pascanu R., Bergstra J., Goodfellow I., Bergeron A., Bouchard N., and Bengio B. (2012). Theano: new features and speed...
- Bergstra J., Breuleux O., Bastien F., Lamblin P., Pascanu R., Desjardins G., Turian J., Warde-Farley D., and Bengio Y. (2010). Theano: a CPU...
- Dauphin Y., Harm de Vries, and Bengio Y. (2015). Rmsprop and equilibrated adaptive learning rates for non-convex optimization. In NIPS.
- Schuldt C., Laptev I. and Caputo B. (2004). Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference...
- Liu J., Luo J., and Shah M. (2009). Recognizing realistic actions from videos in the wild. pages 1996 – 2003, 07.
- Kuehne H., Jhuang H., Garrote E., Poggio T., and Serre T. (2011). HMDB: a large video database for human motion recognition. In Proceedings...
- Baccouche M., Mamalet F., Wolf C., Garcia C., and Baskurt A. (2011). Sequential Deep Learning for Human Action Recognition. In B. Lepri A.A....
- Cho J., Lee M., Chang H, and Oh S. (2014). Robust action recognition using local motion and group sparsity. Pattern Recognition, 47(5):1813...
- Jiang Y., Dai Q., Xue X., Liu W., and Ngo C. (2012). Trajectory-based modeling of human actions with motion reference points. In European...
- Jones S., Shao L., Zhang J., and Liu Y. (2012). Relevance feedback for real-world human action retrieval. Pattern Recognition Letters, 33(4):446...
- Kliper-Gross O., Gurovich Y., Hassner T., and Wolf L. (2012). Motion interchange patterns for action recognition in unconstrained videos....
- Soomro K., Zamir A., and Shah M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
- Marszalek M., Laptev I., and Schmid C. (2009). Actions in context. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages...
- Reddy K. and Shah M. (2013). Recognizing 50 human action categories of web videos. Mach. Vision Appl., 24(5):971–981, July 2013.
- Laokulrat N., Phan S., Nishida N., Shu R., Ehara Y., Okazaki N., Miyao Y., and Nakayama H. (2016). Generating video description using sequence-to-sequence...
- Bahdanau D., Cho K., and Bengio Y. (2014). Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.