Volume 10, Issue 4

Guest Editorial

Author(s): Hui Wang ; Marcos Nieto ; Zhen Lei ; Suzanne Lyttle
Source: IET Computer Vision, Volume 10, Issue 4, p. 235 –236
DOI: 10.1049/iet-cvi.2016.0102
Type: Article
+ Show details - Hide details
p. 235 –236 (2)
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

Video analytics revisited

Author(s): Ayesha Choudhary and Santanu Chaudhury
Source: IET Computer Vision, Volume 10, Issue 4, p. 237 –249
DOI: 10.1049/iet-cvi.2015.0321
Type: Article
+ Show details - Hide details
p. 237 –249 (13)

Video, rich in visual real-time content, is however, difficult to interpret and analyse. Video collections necessarily have large data volume. Video analytics strives to automatically discover patterns and correlations present in the large volume of video data, which can help the end-user to take informed and intelligent decisions as well as predict the future based on the patterns discovered across space and time. In this study, the authors discuss various issues and problems in video analytics, proposed solutions and present some of the important current applications of video analytics.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

Human action recognition using histogram of motion intensity and direction from multiple views

Author(s): SungYong Chun and Chan-Su Lee
Source: IET Computer Vision, Volume 10, Issue 4, p. 250 –257
DOI: 10.1049/iet-cvi.2015.0233
Type: Article
+ Show details - Hide details
p. 250 –257 (8)

This study presents a human action recognition system from multi-view image sequences. The authors’ approach to human action recognition is based on an estimation of local motion from multiple camera views. The authors propose a new motion descriptor, called histogram of motion intensity and direction, to capture local motion characteristics of human activity. After image normalisation, they estimate motion flow using dense optical flow. Using regular grids, they extract local flow motion and estimate the dominant angle and the intensity of optical flow. The histogram of the dominant angle and its intensity are used as a descriptor for each sequence. After the identification of head direction, they concatenate descriptors in each view as a single feature vector from multiple-view sequences. Classification based on the proposed feature vector using support vector machine shows better performance than three-dimensional optical flow-based approaches, but with lower computational requirements. The authors evaluated action recognition on the publicly available i3DPost and the Institut de Recherche en Informatique et en Automatique (INRIA) Xmas Motion Acquisition Sequences database. Experimental results show promising state-of-the-art results and validate the advanced performance of the authors’ approach.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

Video anomaly detection using deep incremental slow feature analysis network

Author(s): Xing Hu ; Shiqiang Hu ; Yingping Huang ; Huanlong Zhang ; Hanbing Wu
Source: IET Computer Vision, Volume 10, Issue 4, p. 258 –267
DOI: 10.1049/iet-cvi.2015.0271
Type: Article
+ Show details - Hide details
p. 258 –267 (10)

Existing anomaly detection (AD) approaches rely on various hand-crafted representations to represent video data and can be costly. The choice or designing of hand-crafted representation can be difficult when faced with a new dataset without prior knowledge. Motivated by feature learning, e.g. deep leaning and the ability to directly learn useful representations and model high-level abstraction from raw data, the authors investigate the possibility of using a universal approach. The objective is learning data-driven high-level representation for the task of video AD without relying on hand-crafted representation. A deep incremental slow feature analysis (D-IncSFA) network is constructed and applied to directly learning progressively abstract and global high-level representations from raw data sequence. The D-IncSFA network has the functionalities of both feature extractor and anomaly detector that make AD completion in one step. The proposed approach can precisely detect global anomaly such as crowd panic. To detect local anomaly, a set of anomaly maps, produced from the network at different scales, is used. The proposed approach is universal and convenient, working well in different types of scenarios with little human intervention and low memory and computational requirements. The advantages are validated by conducting extensive experiments on different challenge datasets.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

Multiple deep features learning for object retrieval in surveillance videos

Author(s): Haiyun Guo ; Jinqiao Wang ; Hanqing Lu
Source: IET Computer Vision, Volume 10, Issue 4, p. 268 –272
DOI: 10.1049/iet-cvi.2015.0291
Type: Article
+ Show details - Hide details
p. 268 –272 (5)

Efficient indexing and retrieving objects of interest from large-scale surveillance videos are a significant and challenging topic. In this study, the authors present an effective multiple deep features learning approach for object retrieval in surveillance videos. Based on the discriminative convolutional neural network (CNN), they can learn multiple deep features to comprehensively describe the visual object. To be specific, they utilise the CNN model pre-trained on ImageNet ILSVRC12 and fine-tuned on our dataset to abstract structure information. In addition, they train another CNN model supervised by 11 colour names to deliver the colour information. To improve the retrieval performance, the deep features are encoded into short binary codes by locality-sensitive hash and fused to fast retrieve the object of interest. Retrieval experiments are performed on a dataset of 100k objects extracted from multi-camera surveillance videos. Comparison results with other common visual features show the effectiveness of the proposed approach.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

Two-layer discriminative model for human activity recognition

Author(s): Mouna Selmi ; Mounîm A. El-Yacoubi ; Bernadette Dorizzi
Source: IET Computer Vision, Volume 10, Issue 4, p. 273 –279
DOI: 10.1049/iet-cvi.2015.0235
Type: Article
+ Show details - Hide details
p. 273 –279 (7)

Most of recent methods for action/activity recognition, usually based on static classifiers, have achieved improvements by integrating context of local interest point (IP) features such as spatiotemporal IPs by characterising their neighbourhood under different scales. In this study, the authors propose a new approach that explicitly models the sequential aspect of activities. First, a sliding window segmentation technique splits the video stream into overlapping short segments. Each window is characterised by a local bag of words of IPs encoded by motion information. A first-layer support vector machine provides for each window a vector of conditional class probabilities that summarises all discriminant information that is relevant for sequence recognition. The sequence of these stochastic vectors is then fed to a hidden conditional random field for inference at the sequence level. They also show how their approach can be naturally extended to the problem of conjoint segmentation and recognition of a sequence of action classes within a continuous video stream. They have tested their model on various human action and activity datasets and the obtained results compare favourably with current state of the art.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

New fusional framework combining sparse selection and clustering for key frame extraction

Author(s): Mengjuan Fei ; Wei Jiang ; Weijie Mao ; Zhendong Song
Source: IET Computer Vision, Volume 10, Issue 4, p. 280 –288
DOI: 10.1049/iet-cvi.2015.0237
Type: Article
+ Show details - Hide details
p. 280 –288 (9)

Key frame extraction can facilitate rapid browsing and efficient video indexing in many applications. However, to be effective, key frames must preserve sufficient video content while also being compact and representative. This study proposes a syncretic key frame extraction framework that combines sparse selection (SS) and mutual information-based agglomerative hierarchical clustering (MIAHC) to generate effective video summaries. In the proposed framework, the SS algorithm is first applied to the original video sequences to obtain optimal key frames. Then, using content-loss minimisation and representativeness ranking, several candidate key frames are efficiently selected and grouped as initial clusters. A post-processor – an improved MIAHC – subsequently performs further processing to eliminate redundant images and generate the final key frames. The proposed framework overcomes issues such as information redundancy and computational complexity that afflict conventional SS methods by first obtaining candidate key frames instead of accurate key frames. Subsequently, application of the improved MIAHC to these candidate key frames rather than the original video not only results in the generation of accurate key frames, but also reduces the computation time for clustering large videos. The results of comparative experiments conducted on two benchmark datasets verify that the performance of the proposed SS–MIAHC framework is superior to that of conventional methods.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

Multi-object tracking using dominant sets

Author(s): Yonatan T. Tesfaye ; Eyasu Zemene ; Marcello Pelillo ; Andrea Prati
Source: IET Computer Vision, Volume 10, Issue 4, p. 289 –298
DOI: 10.1049/iet-cvi.2015.0297
Type: Article
+ Show details - Hide details
p. 289 –298 (10)

Multi-object tracking is an interesting but challenging task in the field of computer vision. Most previous works based on data association techniques merely take into account the relationship between detection responses in a locally limited temporal domain, which makes them inherently prone to identity switches and difficulties in handling long-term occlusions. In this study, a dominant set clustering based tracker is proposed, which formulates the tracking task as a problem of finding dominant sets in an auxiliary edge weighted graph. Unlike most techniques which are limited in temporal locality (i.e. few frames are considered), the authors utilised a pairwise relationships (in appearance and position) between different detections across the whole temporal span of the video for data association in a global manner. Meanwhile, temporal sliding window technique is utilised to find tracklets and perform further merging on them. The authors’ robust tracklet merging step renders the tracker to long term occlusions with more robustness. The authors present results on three different challenging datasets (i.e. PETS2009-S2L1, TUD-standemitte and ETH dataset (‘sunny day’ sequence)), and show significant improvements compared with several state-of-art methods.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

Contextualised learning-free three-dimensional body pose estimation from two-dimensional body features in monocular images

Author(s): Luis Unzueta ; Nerea Aranjuelo ; Jon Goenetxea ; Mikel Rodriguez ; Maria Teresa Linaza
Source: IET Computer Vision, Volume 10, Issue 4, p. 299 –307
DOI: 10.1049/iet-cvi.2015.0283
Type: Article
+ Show details - Hide details
p. 299 –307 (9)

In this study, the authors present a learning-free method for inferring kinematically plausible three-dimensional (3D) human body poses contextualised in a predefined 3D world, given a set of 2D body features extracted from monocular images. This contextualisation has the advantage of providing further semantic information about the observed scene. Their method consists of two main steps. Initially, the camera parameters are obtained by adjusting the reference floor of the predefined 3D world to four key-points in the image. Then, the person's body part lengths and pose are estimated by fitting a parametrised multi-body 3D kinematic model to 2D image body features, which can be located by state-of-the-art body part detectors. The adjustment is carried out by a hierarchical optimisation procedure, where the model's scale variations are considered first and then the body part lengths are refined. At each iteration, tentative poses are inferred by a combination of efficient perspective-n-point camera pose estimation and constrained viewpoint-dependent inverse kinematics. Experimental results show that their method obtains good results in terms of accuracy with respect to state-of-the-art alternatives, but without the need of learning 2D/3D mapping models from training data. Their method works efficiently, allowing its integration in video soft sensing systems.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

‘Owl’ and ‘Lizard’: patterns of head pose and eye pose in driver gaze classification

Author(s): Lex Fridman ; Joonbum Lee ; Bryan Reimer ; Trent Victor
Source: IET Computer Vision, Volume 10, Issue 4, p. 308 –314
DOI: 10.1049/iet-cvi.2015.0296
Type: Article
+ Show details - Hide details
p. 308 –314 (7)

Accurate, robust, inexpensive gaze tracking in the car can help keep a driver safe by facilitating the more effective study of how to improve (i) vehicle interfaces and (ii) the design of future advanced driver assistance systems. In this study, the authors estimate head pose and eye pose from monocular video using methods developed extensively in prior work and ask two new interesting questions. First, how much better can they classify driver gaze using head and eye pose versus just using head pose? Second, are there individual-specific gaze strategies that strongly correlate with how much gaze classification improves with the addition of eye pose information? The authors answer these questions by evaluating data drawn from an on-road study of 40 drivers. The main insight of the study is conveyed through the analogy of an ‘owl’ and ‘lizard’ which describes the degree to which the eyes and the head move when shifting gaze. When the head moves a lot (‘owl’), not much classification improvement is attained by estimating eye pose on top of head pose. On the other hand, when the head stays still and only the eyes move (‘lizard’), classification accuracy increases significantly from adding in eye pose. The authors characterise how that accuracy varies between people, gaze strategies, and gaze regions.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

Forensic video solution using facial feature-based synoptic Video Footage Record

Author(s): Yogameena Balasubramanian ; Kokila Sivasankaran ; Sindhu Priya Krishraj
Source: IET Computer Vision, Volume 10, Issue 4, p. 315 –322
DOI: 10.1049/iet-cvi.2015.0238
Type: Article
+ Show details - Hide details
p. 315 –322 (8)

Person specific identification is an important problem in computer vision. However, forensic video analysis is the tool in surveillance applications, such as a specific person Video Footage Record can be used to help personalised monitoring. This study proposes a solution to identify the specific person very quickly through offline which will be valuable to analyse the incident/crime earlier. The main idea of this study is to reduce the enormous volume of video data by using an object-based video synopsis. After that, Viola–Jones face detection, deformable part based models are used to detect the face attributes. Subsequently, histogram of oriented gradients and oriented centre symmetric local binary pattern features are extracted. Support vector machine classifier is used to classify the weak and strong features. These strong features are used to recognise the person. The algorithm works well even in complicated situations such as expression changes, pose, illumination variations and even if the face is partially as well as fully occluded in few frames. The advantage of synoptic video helps to recognise the person who is not occluded in some other frames. Experimental results on benchmark and real time datasets demonstrate the effectiveness of the proposed algorithm.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

Facial video-based detection of physical fatigue for maximal muscle activity

Author(s): Mohammad A. Haque ; Ramin Irani ; Kamal Nasrollahi ; Thomas B. Moeslund
Source: IET Computer Vision, Volume 10, Issue 4, p. 323 –330
DOI: 10.1049/iet-cvi.2015.0215
Type: Article
+ Show details - Hide details
p. 323 –330 (8)

Physical fatigue reveals the health condition of a person at, for example, health checkup, fitness assessment, or rehabilitation training. This study presents an efficient non-contact system for detecting non-localised physical fatigue from maximal muscle activity using facial videos acquired in a realistic environment with natural lighting where subjects were allowed to voluntarily move their head, change their facial expression, and vary their pose. The proposed method utilises a facial feature point tracking method by combining a ‘good feature to track’ and a ‘supervised descent method’ to address the challenges that originate from realistic scenario. A face quality assessment system was also incorporated in the proposed system to reduce erroneous results by discarding low quality faces that occurred in a video sequence due to problems in realistic lighting, head motion, and pose variation. Experimental results show that the proposed system outperforms video-based existing system for physical fatigue detection.
- View Fulltext
Add to favourites

Login to add to favourites

Save links to your favourite articles.
Export citations
- BibTEX
- Endnote
- Plain text
- RefWorks

Share

Tools

Thank you

Latest tweets

Key

IET Computer Vision

Volume 10, Issue 4, June 2016

Volumes & issues:

Volume 10, Issue 4

June 2016

Special Issue: Video Analytics

Most viewed content

Most cited content for this Journal