Demystifying Human Action Recognition in Deep Learning with Space-Time Feature Descriptors

Research Paper (postgraduate) 2018 33 Pages

Computer Science - Internet, New Technologies



Abstract.. i

1 Introduction.. 1

2 Background and Related Work.. 4

2.1 Feature Extraction and Descriptor Representation.. 5

2.1.1 Space-Time Interest Points (STIP).. 5

2.1.2 Dense Sampling.. 7

2.1.3 Histogram of Oriented Gradients (HOG).. 7

2.1.4 N-Jets.. 8

2.1.5 Histograms of Oriented Optical Flow (HOF).. 9

2.1.6 Feature Combination.. 10

2.2 Learning Algorithms.. 12

2.2.1 Support Vector Machines (SVM).. 12

2.2.2 Convolutional Neural Networks (CNN).. 14

2.2.3 Recurrent Neural Networks (RNNs).. 15

2.3 Conclusion.. 16

3 Research Method.. 18

3.1 Research Hypothesis.. 18

3.2 Methodology.. 18

3.2.1 Phase 1: Implementation.. 18

3.2.2 Phase 2: Training.. 19

3.2.3 Phase 3: Testing.. 19

3.3 Motivation for Method.. 19

3.3.1 Features.. 19

3.3.2 Classier.. 20

3.4 Conclusion.. 20

4 Research Plan 21 4.1 Deliverables.. 21

4.1.1 Phase 1: Implementation.. 21

4.1.2 Phase 2: Training.. 22

4.1.3 Phase 3: Testing.. 22

4.2 Potential Issues.. 23

4.2.1 Lengthy Training Time.. 23

4.2.2 Low Accuracies.. 24

4.3 Conclusion.. 24

5 Conclusion.. 25


Human Action Recognition is the task of recognizing a set of actions being performed in a video sequence. Reliably and eciently detecting and identifying actions in video could have vast impacts in the surveillance, security, healthcare and entertainment spaces. The problem addressed in this paper is to explore dierent engineered spatial and temporal image and video features (and combinations thereof) for the purposes of Human Action Recognition, as well as explore dierent Deep Learning architectures for non-engineered features (and classication) that may be used in tandem with the handcrafted features. Further, comparisons between the dierent combinations of features will be made and the best, most discriminative feature set will be identified.

Chapter 1: Introduction

Human action recognition is a widely-studied area of research in computer vision and machine learning. This is most likely due to the fact that nding a possible, viable solution could have vast impacts in the surveillance, entertainment, and healthcare spaces (Ke et al., 2013). In the surveillance space, a solution would allow for detection of anomalous occurrences in surveillance footage, which could then trigger an alert to the relevant authorities/ personnel. For instance, Facial Recognition is the new hot tech topic in China (see Figure 1.1).

[Figure is omitted from this preview]

Figure 1.1: A CCTV display using the facial-recognition system Face in Beijing (https://www.washingtonpost.com/news/world/wp/2018/01/07/).

Banks, airports, hotels and even public toilets are all trying to verify people's identities by analyzing their faces. Chinese's scientists used Facial Recognition and Articial Intelligence to analyze and understand the mountain of incoming video evidence; to track suspects, spot suspicious behaviors and even predict crime; to coordinate the work of emergency services; and to monitor the comings and goings of the country's 1.4 billion people.

Further, automatic detection of unwanted events (e.g. shoplifting and ghting) would be possible. In the entertainment space, human-computer interaction would reach new levels of eectiveness since reliable detection of emotion and user behaviour would be possible. In the healthcare space, assistance in the rehabilitation of patients would be realisable.

However, finding such a generalised solution is still an open problem. This is most likely due to the fact that there are many common issues that plague video-based Human Action Recognition. For instance, occlusions in natural human appearance are rife, with problems such as diering clothing. This can dramatically effect the robustness and representational power of the extracted features. Additionally, there is the problem of perspective changes and viewpoint variation in the actual video sequences. This problem occurs because of camera angle changes and human pose changes. Most current solutions to Human Action Recognition are limited to a small set of possible viewpoints of the humans. Any datasets in which these perspective changes, occlusions, and complicated backgrounds are rife (e.g. Hollywood2) have very poor state-of-the-art performance.

The Human Action Recognition problem can be posed as attempting to estimate a mapping [formula is omitted from this preview]. The problem is typically divided into three main stages (Ke et al., 2013):

- Object segmentation

- Feature extraction and representation

- Classication

The object segmentation stage involves extracting key frames - frames in which actions of interest are occurring. These key frames can then be fed into the feature extraction phase, in which a robust, highly-discriminative set of features [formula is omitted from this previes] is extracted from each video sequence. The features can be extracted using any approach (e.g. gradient-based features), but typically should contain both spatial and temporal information. This set of features can then be combined and/or conglomerated in some novel fashion.

The feature extraction and representation phase involves extracting powerful features that can then be fed into some classication algorithm. These features must capture as much salient space-time information as possible, while disregarding as much unimportant information as possible. In other words, the features need to capture both pertinent motion and shape/appearance information in the video sequence. Typically, these features are computed by extracting local descriptors from some set of detected interest points, however, other techniques such as body modeling and frequency domain approaches also exist (Ke et al., 2013). Once this novel set of feature descriptors has been computed, they can be used to train a classier. Examples of function approximation and classication algorithms used to estimate f are Support Vector Machines and Neural Networks.

In this paper, we will investigate a method to solve the problem of Human Action Recognition. Chapter 2 will present a detailed, concise literature review of what relevant work has been done in this problem domain in the elds of Machine Learning and Computer Vision. The current state-of-theart approach to the problem will also be identied and discussed in Chapter 2. Chapter 3 will discuss the proposed methodologies being employed in this paper, as well as the techniques that will be used. Chapter 4 will present a research plan and time line, including deliverables at each stage of development, as well as possible issues that could arise (and probable solutions to each). Lastly, Chapter 5 will give an overview of the proposed research, past research, along with emphasis on the obtained results.

Chapter 2: Background and Related Work

Human Action Recognition has been widely studied in the machine learning and Computer Vision research communities. It is a problem domain that could have great impact if solved in a generalised manner. There is still much work to be done in this area as most performance on non-trivial benchmark datasets is still relatively poor. The core of any Human Action Recognition problem is the feature extraction and representation. Without very reliable, highly-discriminative features, performance degrades very quickly. Additionally, such a set of features should contain only the pertinent and salient information in the video sequence (in both the spatial and temporal domain), while ignoring as much irrelevant information and noise as possible.

Most approaches adopt a hand-crafted feature engineering approach for the feature extraction phase (Laptev, 2005); (Schuldt et al., 2004); (Schuldt et al., 2004); (Wang et al., 2009); (Kienzle et al., 2007)-combining various features within the video sequence in a hierarchical manner. Often, a bag-of-features approach is then employed on this extracted feature set, and fed into a novel classication algorithm such as a Support Vector Machine or Neural Network. However, with the recent onset of Deep Learning in academia and industry, feature extraction approaches leveraging Deep Learning techniques are more frequently being utilised. This is often in the form of an automated feature learning process whereby the Deep Learning algorithm is tasked with learning a novel set of features from the raw data/pixels (Ravanbakhsh et al., 2015). Otherwise, Deep Learning techniques may be used in one or more steps of the hierarchical feature extraction process.

Such multi-faceted approaches are almost always employed, as they can be hand-engineered to accommodate for, and include, both the temporal and spatial information. Thus, these approaches often yield the best performance and results since they are able to distinguish between actions more reliably and robustly. Furthermore, one common intuition behind a hierarchical approach is the fact that most human actions consist of complex temporal compositions of more simple, base actions (Ravanbakhsh et al., 2015). Either approach, however, can be effective if particular attention is given to the features fed into the models.


ISBN (eBook)
ISBN (Book)
File size
1.3 MB
Catalog Number



Title: Demystifying Human Action Recognition in Deep Learning with Space-Time Feature Descriptors