IET Computer Vision
Volume 13, Issue 3, April 2019
Volumes & issues:
Volume 13, Issue 3
April 2019
-
- Author(s): Maryam Amirmazlaghani
- Source: IET Computer Vision, Volume 13, Issue 3, p. 249 –260
- DOI: 10.1049/iet-cvi.2018.5254
- Type: Article
- + Show details - Hide details
-
p.
249
–260
(12)
A new contourlet domain image watermark detector is proposed in the present study. As the performance of the detector completely depends on the accuracy of the statistical model, the contourlet coefficients and statistical properties are studied first. The heavy-tailed distribution and heteroscedasticity of these coefficients are demonstrated in this study. These characteristics cannot be captured simultaneously by the models, which are proposed previously. A two-dimensional generalised autoregressive conditional heteroscedasticity (2D GARCH) model is suggested to overcome this problem. Dependencies of the contourlet coefficients can be explained by the efficient structure provided by this model. A 2D GARCH model-based contourlet domain watermark detector is designed and its performance analysed by computing the receiver operating characteristics. The high accuracy of the proposed detector, its robustness under several types of attacks, and its outperformance compared to alternative watermarking methods are verified by the obtained experimental results.
- Author(s): Hui Zeng ; Ran Zhang ; Xiuqing Wang ; Dongmei Fu ; Qingting Wei
- Source: IET Computer Vision, Volume 13, Issue 3, p. 261 –266
- DOI: 10.1049/iet-cvi.2018.5293
- Type: Article
- + Show details - Hide details
-
p.
261
–266
(6)
This study introduces a novel multi-feature-based non-rigid three-dimensional (3D) model retrieval method. First, for each 3D model, compute the scale-invariant heat kernel signature (SI-HKS) descriptor and the wave kernel signature (WKS) descriptor of each vertex. Then, the normalised weighted bags of phrases feature is obtained and they are fed to the convolutional neural networks. The trust degree of each kind of descriptor is computed, and the total trust degree can be obtained. Finally, the fusion network is trained and the retrieval results can be obtained according to the ranking of the total trust degrees. For the training phase and the testing phase, the authors define different computation methods of the trust degrees and the total trust degrees. The Dempster–Shafer (DS) evidence-based total trust degrees are used not only in the feature layer but also in the decision layer. The final decision results of the total trust degrees are used in the process of the network learning. So the proposed method can make full use of the complementary information of the SI-HKS descriptor and the WKS descriptor. Extensive experiments have shown that the proposed multi-feature fusion method has better performance than a single feature-based method, and also outperforms other existing state-of-the-art methods.
- Author(s): Peter Boyi Zhang and Yeung Sam Hung
- Source: IET Computer Vision, Volume 13, Issue 3, p. 267 –276
- DOI: 10.1049/iet-cvi.2018.5365
- Type: Article
- + Show details - Hide details
-
p.
267
–276
(10)
The aim of this study is to perform motion segmentation and three-dimensional shape recovery of a dynamic human body from an image sequence. The authors note that human body motion generally consists of large articulations between different body parts and small local deformations within each body part. On the basis of this notion, they develop an integrated framework that combines articulated structure from motion and non-rigid SFM to estimate human body motion and shape as an articulated deformable structure. Unlike existing approaches that apply a low-rank subspace method for motion segmentation, they use a metric constraint for identifying rigid subsets, which is more robust and, therefore, allow a more relaxed error threshold to be set for fitting rigid subsets, catering for small deformations within individual rigid subsets. They provide an automated statistical procedure for setting the aforementioned error threshold. The rigid subsets are then linked into articulated kinematic chains by minimum spanning tree search in a graph of joint costs. Finally, the blend-shape method is applied to model local deformations of each individual subset. Experimental results show that the proposed method provides better performance for human motion segmentation and shape recovery compared with existing methods.
- Author(s): Bin Zhu ; Lian-Fang Tian ; Qi-Liang Du ; Qiu-Xia Wu ; Farisi Zeyad Sahl ; Yao Yeboah
- Source: IET Computer Vision, Volume 13, Issue 3, p. 277 –284
- DOI: 10.1049/iet-cvi.2018.5285
- Type: Article
- + Show details - Hide details
-
p.
277
–284
(8)
Insufficient illumination and illumination variation in image sequences make it challenging for algorithms to obtain clear outlines for objects in motion. This study proposes a high-performance adaptive dual fractional-order variational optical flow model which could be used to resolve these issues. The proposed method revitalises the original dual fractional-order optical flow model and adopts a fractional differential mask in both the data and smoothness terms of the traditional Horn–Schunck model. The main innovation of this work is to fit a flow field regional to a variety of fractional-order differential masks. The domain of each region is determined adaptively. The order and size of the fractional-order differential masks for each region are adjusted by image signal to noise ratio while the shape of the fractional-order differential mask is regulated to prevent interference from surrounding regions. Adjusting the fractional-order differential mask adaptively enables the proposed method to accurately segment motion objects in poor and variable illumination regions as well. The experimental results show that our algorithm outperforms the current state-of-the-art algorithms on low-light real scene videos and also achieves competitive results on the Middlebury, KITTI and MPI Sintel public benchmarks.
- Author(s): Hakan Aytaylan and Seniha Esen Yuksel
- Source: IET Computer Vision, Volume 13, Issue 3, p. 285 –293
- DOI: 10.1049/iet-cvi.2018.5067
- Type: Article
- + Show details - Hide details
-
p.
285
–293
(9)
Semantic segmentation is an emerging field in the computer vision community where one can segment and label an object all at once, by considering the effects of the neighbouring pixels. In this study, the authors propose a new semantic segmentation model that fuses hyperspectral images with light detection and ranging (LiDAR) data in the three-dimensional space defined by Universal Transverse Mercator (UTM) coordinates and solves the task using a fully-connected conditional random field (CRF). First, the authors’ pairwise energy in the CRF model takes into account the UTM coordinates of the data; and performs fusion in the real world coordinates. Second, as opposed to the commonly used Markov random fields (MRFs) which consider only the nearby pixels; the fully-connected CRF considers all the pixels in an image to be connected. In doing so, they show that these long-term interactions significantly enhance the results when compared to traditional MRF models. Third, they propose an adaptive scaling scheme to decide the weights of LiDAR and hyperspectral sensors in shadowy or sunny regions. Experimental results on the Houston dataset indicate the effectiveness of their method in comparison to the several MRF based approaches as well as other competing methods.
- Author(s): He Zheng ; Jiahong Wu ; Rui Liang ; Ye Li ; Xuzhi Li
- Source: IET Computer Vision, Volume 13, Issue 3, p. 294 –301
- DOI: 10.1049/iet-cvi.2018.5005
- Type: Article
- + Show details - Hide details
-
p.
294
–301
(8)
Recent captioning models are limited in their ability to describe concepts unseen in paired image–sentence pairs. This study presents a framework of multi-task learning for describing novel words not present in existing image-captioning datasets. The authors’ framework takes advantage of external sources-labelled images from image classification datasets, and semantic knowledge extracted from the annotated text. They propose minimising a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings. When in the inference step they change the BeamSearch step by considering both the caption model and language model enabling the model to generalise novel words outside of image-captioning datasets. They demonstrate that in the framework by adding an annotated text data which can help the image captioning model to describe images with the right corresponding novel words. Extensive experiments are conducted on both AI Challenger and Microsoft coco (MSCOCO) image captioning datasets of two different languages, demonstrating the ability of their framework to describe novel words such as scenes and objects.
- Author(s): Xiyin Wu ; Xiaodi Ma ; Jinxia Zhang ; Zhong Jin
- Source: IET Computer Vision, Volume 13, Issue 3, p. 302 –311
- DOI: 10.1049/iet-cvi.2018.5013
- Type: Article
- + Show details - Hide details
-
p.
302
–311
(10)
Salient object detection can identify the most distinctive objects in a scene. In this study, a novel graph-based approach is proposed to detect a salient object via reliable boundary seeds and saliency refinement. A natural image is firstly mapped to a graph with superpixels as nodes. Saliency information is then diffused over the graph using seeds. For the reason that the boundary nodes may contain salient nodes, it is not appropriate to use all boundary nodes as the background seeds. Therefore, a boundary saliency measurement is proposed to obtain more accurate background seeds. After that, the information of background seeds is diffused by a two-stage scheme. A background-based map and a foreground-based map are generated based on the two-stage scheme. Furthermore, in order to enhance the detection accuracy, a refinement model is presented to fuse the information of background-based and foreground-based maps. Experiments on seven public datasets show the proposed algorithm out-performs the state-of-the-art salient object detection algorithms.
- Author(s): Mingjie Liu ; Cheng-Bin Jin ; Bin Yang ; Xuenan Cui ; Hakil Kim
- Source: IET Computer Vision, Volume 13, Issue 3, p. 312 –318
- DOI: 10.1049/iet-cvi.2018.5499
- Type: Article
- + Show details - Hide details
-
p.
312
–318
(7)
The goal of multiple object tracking (MOT) is to estimate the locations of objects and maintain their identities consistently to yield their individual trajectories. MOT has been developed enormously, but it is still a challenging work due to similar appearances of different objects and occlusion by other objects or background in a complex scene. In this study, the authors propose confidence score-based appearance model learning and hierarchical data association for MOT. First, the confidence score is used to divide associated tracklet-detection in the first stage data association into confident and unconfident results, and in the second stage, data association is applied to unconfident tracklet-detection to improve the performance. Furthermore, it can be employed to enhance the robustness of the appearance model and due to the fast confidence score calculation, it can balance the accuracy and processing time. The experimental results with challenging public datasets show distinct performance improvement over other state-of-the-art methods and demonstrate the effect of the authors’ method for online MOT.
- Author(s): Huy-Hieu Pham ; Louahdi Khoudour ; Alain Crouzil ; Pablo Zegers ; Sergio A. Velastin
- Source: IET Computer Vision, Volume 13, Issue 3, p. 319 –328
- DOI: 10.1049/iet-cvi.2018.5014
- Type: Article
- + Show details - Hide details
-
p.
319
–328
(10)
Recognising human actions in untrimmed videos is an important challenging task. An effective three-dimensional (3D) motion representation and a powerful learning model are two key factors influencing recognition performance. In this study, the authors introduce a new skeleton-based representation for 3D action recognition in videos. The key idea of the proposed representation is to transform 3D joint coordinates of the human body carried in skeleton sequences into RGB images via a colour encoding process. By normalising the 3D joint coordinates and dividing each skeleton frame into five parts, where the joints are concatenated according to the order of their physical connections, the colour-coded representation is able to represent spatio-temporal evolutions of complex 3D motions, independently of the length of each sequence. They then design and train different deep convolutional neural networks based on the residual network architecture on the obtained image-based representations to learn 3D motion features and classify them into classes. Their proposed method is evaluated on two widely used action recognition benchmarks: MSR Action3D and NTU-RGB+D, a very large-scale dataset for 3D human action recognition. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art approaches while requiring less computation for training and prediction.
- Author(s): Cunling Bian ; Ya Zhang ; Fei Yang ; Wei Bi ; Weigang Lu
- Source: IET Computer Vision, Volume 13, Issue 3, p. 329 –337
- DOI: 10.1049/iet-cvi.2018.5281
- Type: Article
- + Show details - Hide details
-
p.
329
–337
(9)
Academic emotions can produce a great impact on the learning effect. Normally, emotions are expressed externally in the students' facial expressions, speech and behaviour. In this paper, the focus is on automatic academic emotion inference based on facial expressions in online learning. Considering the lack of training samples for the inference algorithm, a spontaneous facial expression database is established. It includes the facial expressions of five common academic emotions and consists of two subsets: a video clip database and an image database. A total of 1,274 video clips and 30,184 images from 82 students are included in the database. The samples are labelled by both the participants and external coders. An extensive analysis is carried out on the image database using a convolutional neural network (CNN)-based algorithm to infer self-annotation. Some data augmentation algorithms are applied to improve the algorithm performance. Additionally, an adaptive data augmentation algorithm based on spatial transformer network is introduced, which can remove some confounding factors in the original images. The algorithm can obviously improve the inference performance, which has been proven by comparing some evaluation indicators before and after adoption. Such a database will certainly accelerate the application of affective computing in the educational field.
- Author(s): Vahid Ashkani Chenarlogh and Farbod Razzazi
- Source: IET Computer Vision, Volume 13, Issue 3, p. 338 –344
- DOI: 10.1049/iet-cvi.2018.5088
- Type: Article
- + Show details - Hide details
-
p.
338
–344
(7)
Here, the authors proposed a solution to improve the training performance in limited training data case for human action recognition. The authors proposed three different convolutional neural network (CNN) architectures for this purpose. At first, the authors generated four different channels of information by optical flows and gradients in the horizontal and vertical directions from each frame to apply to three-dimensional (3D) CNNs. Then, the authors proposed three architectures, which are single-stream, two-stream, and four-stream 3D CNNs. In the single-stream model, the authors applied four channels of information from each frame to a single stream. In the two-stream architecture, the authors applied optical flow-x and optical flow-y into one stream and gradient-x and gradient-y to another stream. In the four-stream architecture, the authors applied each one of the information channels to four separate streams. Evaluating the architectures in an action recognition system, the system was assessed on IXMAS, a data set which has been recorded simultaneously by five cameras. The authors showed that the results of four-stream architecture were better than other architectures, achieving 87.5, 91.66, 91.11, 88.05, and 81.94% recognition rates for cameras 0–4, respectively, using four-stream structure (88.05% recognition rate in average).
- Author(s): Fuhui Tang ; Xiankai Lu ; Xiaoyu Zhang ; Lingkun Luo ; Shiqiang Hu ; Huanlong Zhang
- Source: IET Computer Vision, Volume 13, Issue 3, p. 345 –353
- DOI: 10.1049/iet-cvi.2018.5194
- Type: Article
- + Show details - Hide details
-
p.
345
–353
(9)
Visual tracking has recently gained a great advance with the use of the convolutional neural network (CNN). Usually, existing CNN-based trackers exploit the features from a single layer or a certain combination of multiple layers. However, these features only characterise an object from an invariable aspect and cannot adapt to scene variation, which limits the performance of such trackers. To overcome this limitation, the authors study the problem from a new perspective and propose a novel convolutional layer selection method. To obtain robust appearance representation, they investigate the advantages of features extracted from different convolutional layers. To determine the correctness of the tracking prediction and updated model, they design a verification mechanism based on historical retrospect, which can estimate the deviation for each layer by bidirectionally locating the target. Meanwhile, the deviation works as the layer-wise selection criteria. Extensive evaluations on the OTB-2013, visual object tracking (VOT)-2016 and VOT-2017 benchmarks demonstrate that the proposed tracker performs favourably against several state-of-the-art trackers.
Heteroscedastic watermark detector in the contourlet domain
Dempster–Shafer evidence theory-based multi-feature learning and fusion method for non-rigid 3D model retrieval
Articulated deformable structure approach to human motion segmentation and shape recovery from an image sequence
Adaptive dual fractional-order variational optical flow model for motion estimation
Fully-connected semantic segmentation of hyperspectral and LiDAR data
Multi-task learning for captioning images with novel words
Salient object detection via reliable boundary seeds and saliency refinement
Online multiple object tracking using confidence score-based appearance model learning and hierarchical data association
Learning to recognise 3D human action from a new skeleton-based representation using deep convolutional neural networks
Spontaneous facial expression database for academic emotion inference in online learning
Multi-stream 3D CNN structure for human action recognition trained by limited data
Adaptive convolutional layer selection based on historical retrospect for visual tracking
Most viewed content
Most cited content for this Journal
-
Brain tumour classification using two-tier classifier with adaptive segmentation technique
- Author(s): V. Anitha and S. Murugavalli
- Type: Article
-
Driving posture recognition by convolutional neural networks
- Author(s): Chao Yan ; Frans Coenen ; Bailing Zhang
- Type: Article
-
Local directional mask maximum edge patterns for image retrieval and face recognition
- Author(s): Santosh Kumar Vipparthi ; Subrahmanyam Murala ; Anil Balaji Gonde ; Q.M. Jonathan Wu
- Type: Article
-
Fast and accurate algorithm for eye localisation for gaze tracking in low-resolution images
- Author(s): Anjith George and Aurobinda Routray
- Type: Article
-
‘Owl’ and ‘Lizard’: patterns of head pose and eye pose in driver gaze classification
- Author(s): Lex Fridman ; Joonbum Lee ; Bryan Reimer ; Trent Victor
- Type: Article