IET Computer Vision
Volume 11, Issue 7, October 2017
Volumes & issues:
Volume 11, Issue 7
October 2017
-
- Author(s): Karol Matusiak ; Piotr Skulimowski ; Pawel Strumillo
- Source: IET Computer Vision, Volume 11, Issue 7, p. 507 –516
- DOI: 10.1049/iet-cvi.2016.0434
- Type: Article
- + Show details - Hide details
-
p.
507
–516
(10)
The authors present the results of a comparative performance study of algorithms for detecting keypoints in digital images. The Harris, good features to track (GFTT), SIFT, SURF, FAST, ORB, BRISK, and the MSER keypoint detectors were tested using two types of images: POV-Ray simulated images and photographs from the Caltech 256 image dataset. They tested the repeatability of detection of the image keypoints for the evaluated detectors for a series of images with one degree rotations from 0 to 180° (3982 images in total). In the evaluation scenario they adopted an original approach in which they did not hold back a single image to be the reference image. They conclude that the most computationally complex detector, i.e. the SIFT performs best under rotation transformation of images. However, the FAST and ORB detectors, while being less computationally demanding, perform almost equally well. Hence, they can be viable choices in image processing tasks for mobile applications.
Unbiased evaluation of keypoint detectors with respect to rotation invariance
-
- Author(s): Yulan Han ; Yongping Zhao ; Haifeng Yu
- Source: IET Computer Vision, Volume 11, Issue 7, p. 517 –529
- DOI: 10.1049/iet-cvi.2016.0274
- Type: Article
- + Show details - Hide details
-
p.
517
–529
(13)
In this study, the authors propose a novel approach for single image super-resolution. Their method is based on the idea of learning a mapping function, which can reveal the intrinsic relationship between sparse coefficients of low-resolution (LR) and high-resolution (HR) image patch pairs with respect to their individual dictionaries. Adaptive regularised l 2-boosting algorithm is proposed to learn this type of mapping function. Specifically, to reduce time consumption, the authors cluster training patches into several clusters. Within each cluster, a pair of dictionaries for LR and HR image patches is jointly trained. Adaptive regularised l 2-boosting algorithm is then employed to obtain the function. Thus, in a reconstruction stage, for each given input LR image patch, the authors can effectively estimate its corresponding HR image patch. Their extensive experimental results demonstrated that the proposed method achieves a performance of similar quality performance to that of the top methods.
- Author(s): Bassem Seddik ; Sami Gazzah ; Najoua Essoukri Ben Amara
- Source: IET Computer Vision, Volume 11, Issue 7, p. 530 –540
- DOI: 10.1049/iet-cvi.2016.0326
- Type: Article
- + Show details - Hide details
-
p.
530
–540
(11)
This study addresses the problem of efficiently combining the joint, RGB and depth modalities of the Kinect sensor in order to recognise human actions. For this purpose, a multi-layered fusion scheme concatenates different specific features, builds specialised local and global SVM models and then iteratively fuses their different scores. The authors essentially contribute in two levels: (i) they combine the performance of local descriptors with the strength of global bags-of-visual-words representations. They are able then to generate improved local decisions that allow noisy frames handling. (ii) They also study the performance of multiple fusion schemes guided by different features concatenations, Fisher vectors representations concatenation and later iterative scores fusion. To prove the efficiency of their approach, they have evaluated their experiments on two challenging public datasets: CAD-60 and CGC-2014. Competitive results are obtained for both benchmarks.
- Author(s): Fabio Martínez ; Antoine Manzanera ; Eduardo Romero
- Source: IET Computer Vision, Volume 11, Issue 7, p. 541 –549
- DOI: 10.1049/iet-cvi.2016.0055
- Type: Article
- + Show details - Hide details
-
p.
541
–549
(9)
This study presents a spatio-temporal motion descriptor that is computed from a spatially-constrained decomposition and applied to online classification and recognition of human activities. The method starts by computing a dense optical flow without explicit spatial regularisation. Potential human actions are detected at each frame as spatially consistent moving regions of interest (RoIs). Each of these RoIs is then sequentially partitioned to obtain a spatial representation of small overlapped subregions with different sizes. Each of these region parts is characterised by a set of flow orientation histograms. A particular RoI is then described along the time by a set of recursively calculated statistics that collect information from the temporal history of orientation histograms, to form the action descriptor. At any time, the whole descriptor can be extracted and labelled by a previously trained support vector machine. The method was evaluated using three different public datasets: (i) the ViSOR dataset was used for global classification obtaining an average accuracy of 95% and for recognition in long sequences, achieving an average per-frame accuracy of 92.3%. (ii) The KTH dataset was used for global classification and (iii) the UT-datasets were used for recognition task, obtaining an average accuracy of 80% (frame rate).
- Author(s): Yong Luo and Ye-Peng Guan
- Source: IET Computer Vision, Volume 11, Issue 7, p. 550 –559
- DOI: 10.1049/iet-cvi.2016.0295
- Type: Article
- + Show details - Hide details
-
p.
550
–559
(10)
Reliable and accurate facial skin extraction is the most critical and urgent issue for adaptive skin detection. Aiming at resolving this issue, the authors propose an adaptive skin detection method using face location and facial structure estimation. The face location algorithm is developed to improve the reliability of face detection and extract a face region with a high proportion of skin. Facial structure estimation is exploited to further reduce the impact of non-skin factors on dynamic skin colour modelling. The colour space distribution model of extracted facial skin is very close to that of real facial skin. Finally, the skin in an image is obtained by using a hybrid colour space strategy. Extensive experimental comparisons with some state-of-the-art methods have shown the superior performance of the proposed method.
- Author(s): Jianjun Li ; Xia Mao ; Lijiang Chen ; Lan Wang
- Source: IET Computer Vision, Volume 11, Issue 7, p. 560 –566
- DOI: 10.1049/iet-cvi.2017.0025
- Type: Article
- + Show details - Hide details
-
p.
560
–566
(7)
Human interaction recognition has played a major role in building intelligent video surveillance systems. Recently, depth data captured by the emerging RGB-D sensors began to show its importability in human interaction recognition. This study proposes a novel framework for human interaction recognition using depth information including an algorithm to reconstruct depth sequence with as few key frames as possible. The proposed framework includes two essential modules. First, key frames extraction by sparse constraint, then the fusion multi-feature, is constructed by using two types of available features and Max-pooling, respectively. Finally, multiple features are directly sent to the SVM for the recognition of the human activity. This study explores the static and dynamic feature fusion method to improve the recognition performance with contextual relevance of continuous frames. A weight is used to fuse shape and optical flow features, which not only enhance the description capability of human behavioural characteristics in the spatiotemporal domain, but also effectively reduces the adverse impact of certain distortion point of interest for target recognition. Experimental results show that the proposed approach yields considerable performance improvement over the state-of-the-art approaches with respect to accuracy on a public action dataset.
- Author(s): Pejman Rasti ; Kamal Nasrollahi ; Olga Orlova ; Gert Tamberg ; Cagri Ozcinar ; Thomas B. Moeslund ; Gholamreza Anbarjafari
- Source: IET Computer Vision, Volume 11, Issue 7, p. 567 –576
- DOI: 10.1049/iet-cvi.2016.0463
- Type: Article
- + Show details - Hide details
-
p.
567
–576
(10)
In this study, a novel single image super-resolution (SR) method, which uses a generated dictionary from pairs of high-resolution (HR) images and their corresponding low-resolution (LR) representations, is proposed. First, HR and LR dictionaries are created by dividing HR and LR images into patches Afterwards, when performing SR, the distance between every patch of the input LR image and those of available LR patches in the LR dictionary are calculated. The minimum distance between the input LR patch and those in the LR dictionary is taken, and its counterpart from the HR dictionary will be passed through an illumination enhancement process resulting in consistency of illumination between neighbour patches. This process is applied to all patches of the LR image. Finally, in order to remove the blocking effect caused by merging the patches, an average of the obtained HR image and the interpolated image is calculated. Furthermore, it is shown that the stabe of dictionaries is reducible to a great degree. The speed of the system is improved by 62.5%. The quantitative and qualitative analyses of the experimental results show the superiority of the proposed technique over the conventional and state-of-the-art methods.
- Author(s): Jian Wu ; Chen Ye ; Victor S. Sheng ; Jing Zhang ; Pengpeng Zhao ; Zhiming Cui
- Source: IET Computer Vision, Volume 11, Issue 7, p. 577 –584
- DOI: 10.1049/iet-cvi.2016.0243
- Type: Article
- + Show details - Hide details
-
p.
577
–584
(8)
Multi-label image classification has attracted considerable attention in machine learning recently. Active learning is widely used in multi-label learning because it can effectively reduce the human annotation workload required to construct high-performance classifiers. However, annotation by experts is costly, especially when the number of labels in a dataset is large. Inspired by the idea of semi-supervised learning, in this study, the authors propose a novel, semi-supervised multi-label active learning (SSMAL) method that combines automated annotation with human annotation to reduce the annotation workload associated with the active learning process. In SSMAL, they capture three aspects of potentially useful information – classification prediction information, label correlation information, and example spatial information – and they use this information to develop an effective strategy for automated annotation of selected unlabelled example-label pairs. The experimental results obtained in this study demonstrate the effectiveness of the authors' proposed approach.
- Author(s): Weichao Shen ; Yuwei Wu ; Yunde Jia
- Source: IET Computer Vision, Volume 11, Issue 7, p. 585 –595
- DOI: 10.1049/iet-cvi.2016.0287
- Type: Article
- + Show details - Hide details
-
p.
585
–595
(11)
Object representations are of great importance for robust visual tracking. Although the high-dimensional representation can effectively encode the input data with more information, exploiting it in a real-time tracking system would be intractable and infeasible due to the high computational cost and memory requirements. In this study, the authors propose a compact discriminative object representation to achieve both good tracking accuracy and efficiency. An ensemble of weak training sets is generated based on the self-representative ability of tracking samples, which is applied to learn discriminative functions. Each candidate is represented by the concatenation of project values on all the weak training sets. Tracking is then carried out within a Bayesian inference framework where the classification score of the support vector machine is used to construct the observation model. The evaluations on TB50 benchmark dataset demonstrate that the proposed algorithm is much more computationally efficient than the state-of-the-art methods with comparable accuracy.
- Author(s): Zhandong Liu ; Yong Li ; Xiangwei Qi ; Yong Yang ; Mei Nian ; Haijun Zhang ; Reziwanguli Xiamixiding
- Source: IET Computer Vision, Volume 11, Issue 7, p. 596 –604
- DOI: 10.1049/iet-cvi.2016.0452
- Type: Article
- + Show details - Hide details
-
p.
596
–604
(9)
Text detection in natural scene images is an important prerequisite for many content-based multimedia understanding applications. The authors present a simple and effective text detection method in natural scene image. Firstly, MSERs are extracted by the V-MSER algorithm from channels of G, H, S, O 1, and O 2, as component candidates. Since text is composed of character candidates, the authors design an MRF model to exploit the relationship between characters. Secondly, in order to filter out non-text components, they design a set of two-layers filtering scheme: most of the non-text components can be filtered by the first layer of the filtering scheme; the second layer filtering scheme is an AdaBoost classifier, which is trained by the features of compactness, horizontal variance and vertical variance, and aspect ratio. Then, only four simple features are adopted to generate component pairs. Finally, according to the orientation similarity of the component pairs, component pairs which have roughly the same orientation are merged into text lines. The proposed method is evaluated on two public datasets: ICDAR 2011 and MSRA-TD500. It achieves 82.94 and 75% F-measure, respectively. Especially, the experimental results, on their URMQ_LHASA-TD220 dataset which contains 220 images for multi-orientation and multi-language text lines evaluation, show that the proposed method is general for detecting scene text lines in different languages.
- Author(s): Shuohao Li ; Min Tang ; Qiang Guo ; Jun Lei ; Jun Zhang
- Source: IET Computer Vision, Volume 11, Issue 7, p. 605 –612
- DOI: 10.1049/iet-cvi.2016.0404
- Type: Article
- + Show details - Hide details
-
p.
605
–612
(8)
The authors present a deep neural network (DNN) with attention model for scene text recognition. The proposed model does not require any segmentation of the input text image. The framework is inspired by the attention model presented recently for speech recognition and image captioning. In the proposed framework, feature extraction, feature attention and sequence recognition are integrated in a jointly trainable network. Compared with previous approaches, the following contributions are mainly made. (i) The attention model is applied into DNN to recognise scene text, and it can effectively solve the sequence recognition problem caused by variable length labels. (ii) Rigorous experiments are performed across a number of challenging benchmarks, including IIIT5K, SVT, ICDAR2003 and ICDAR2013 datasets. Results in experiments show that the proposed model is comparable or better than the state-of-the-art methods. (iii) This model only contains 6.5 million parameters. Compared with other DNN models for scene text recognition, this model has the least number of parameters so far.
- Author(s): Manuel Keglevic and Robert Sablatnig
- Source: IET Computer Vision, Volume 11, Issue 7, p. 613 –619
- DOI: 10.1049/iet-cvi.2017.0161
- Type: Article
- + Show details - Hide details
-
p.
613
–619
(7)
The authors propose TripNet as method for calculating similarities between striated toolmark images. The objective for this system is detecting and comparing characteristics of the tools while being invariant to varying parameters like angle of attack, substrate material, and lighting conditions. Instead of designing a handcrafted feature extractor customised for this task, the authors propose the use of a convolutional neural network. With the proposed system, one-dimensional profiles extracted from images of striated toolmarks are mapped into an embedding. The system is trained by minimising a triplet loss function, so that a similarity measure is defined by the distance in this embedding. The performance is evaluated on the NFI Toolmark database containing 300 striated toolmarks of screwdrivers published by the Netherlands Forensic Institute. The system proposed is able to adapt to a large range of angles of attack, achieving a mean average precision of 0.95 for toolmark comparisons with differences in angle of attack of –. Furthermore, four different triplet selection approaches are proposed and their effect on the retrieval of toolmarks from a database of unseen tools is evaluated in detail.
Adaptive regularised l 2-boosting on clustered sparse coefficients for single image super-resolution
Human-action recognition using a multi-layered fusion scheme of Kinect modalities
Spatio-temporal multi-scale motion descriptor from a spatially-constrained decomposition for online action recognition
Adaptive skin detection using face location and facial structure estimation
Human interaction recognition fusing multiple features of depth sequences
A new low-complexity patch-based image super-resolution
Active learning with label correlation exploration for multi-label image classification
Compact discriminative object representation via weakly supervised learning for real-time visual tracking
Method for unconstrained text detection in natural scene image
Deep neural network with attention model for scene text recognition
Retrieval of striated toolmarks using convolutional neural networks
Most viewed content
Most cited content for this Journal
-
Brain tumour classification using two-tier classifier with adaptive segmentation technique
- Author(s): V. Anitha and S. Murugavalli
- Type: Article
-
Driving posture recognition by convolutional neural networks
- Author(s): Chao Yan ; Frans Coenen ; Bailing Zhang
- Type: Article
-
Local directional mask maximum edge patterns for image retrieval and face recognition
- Author(s): Santosh Kumar Vipparthi ; Subrahmanyam Murala ; Anil Balaji Gonde ; Q.M. Jonathan Wu
- Type: Article
-
Fast and accurate algorithm for eye localisation for gaze tracking in low-resolution images
- Author(s): Anjith George and Aurobinda Routray
- Type: Article
-
‘Owl’ and ‘Lizard’: patterns of head pose and eye pose in driver gaze classification
- Author(s): Lex Fridman ; Joonbum Lee ; Bryan Reimer ; Trent Victor
- Type: Article