Automatic understanding of human activity and action is a very important and chal- lenging research area of Computer Vision with wide scale applications in video surveil- lance, motion analysis, virtual reality interfaces, robot navigation and recognition, video indexing, content based video retrieval, HCI, health care, choreography and sports video analysis etc. This thesis presents a series of techniques to solve the prob- lem of human action recognition in video. First approach towards this goal is based on a probabilistic optimization model of body parts using Hidden Markov Model (HMM). This strong model based approach is able to distinguish between similar actions by only considering the body parts having major contributions to the actions, for ex- ample legs for walking and jogging; arms for boxing and clapping. Next approach is based on the observation that the action recognition can be done by only using the visual cue, i.e. human pose variation during the action, even with the information of few frames instead of examining the whole sequence. In this method, actions are represented by a Bag-of-key-poses model to capture the human pose changes during an action.
To tackle the problem of recognizing the action in complex scenes, we propose a model free approach which is based on the Spatio-temporal interest points (STIPs) and local feature. To this end, a novel selective STIP detector is proposed which uses a mechanism similar to that of the non-classical receptive field inhibition that is exhibited by most orientation selective neurons in the primary visual cortex. An extension of the selective STIP based action recognition is applied to the human ac- tion recognition in multi-camera systems. In this case, selective STIPs from each camera view point are combined using the 3D reconstructed data, to form 4D STIPs (3D space + time) for multi-view action recognition. The concluding part of the the- sis dedicates to the continuous visual event recognition (CVER) on large scale video dataset. This is an extremely challenging problem due to high scalability, diverse real environment state and wide scene variability. To address these issues, a motion region extraction technique is applied as a preprocessing step. A max-margin generalized Hough Transform framework is used to learn the feature vote distribution around the activity center to obtain an activity hypothesis which is verified by a Bag-of-words + SVM action recognition system. We validate our proposed approaches on several benchmark action recognition datasets as well as small scale and large scale activity recognition datasets. We obtain state-of-the results which shows a progressive im- provement of our proposed techniques to solve human action and activity recognition in video.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados