IET Computer Vision
Volume 12, Issue 3, April 2018
Volumes & issues:
Volume 12, Issue 3
April 2018
-
- Author(s): Yinzhong Qian ; Wenbin Chen ; I-fan Shen
- Source: IET Computer Vision, Volume 12, Issue 3, p. 233 –240
- DOI: 10.1049/iet-cvi.2017.0233
- Type: Article
- + Show details - Hide details
-
p.
233
–240
(8)
Action recognition in static image is challenging. The authors propose mutually incoherent pose bases which are implicit poselet co-occurrences and are learned by dictionary training to describe body pose. Poselets in a pose basis are not constrained in space and quantity, thus pose basis can describe body pose more flexibly than k-poselet. In their method, body pose in an image is represented by a sparse linear combination of pose bases because pose in an action varies while each image only captures a snapshot from a single viewpoint. In dictionary training, the challenge is how to stabilise the sparse representation which is the input of Support Vector Machine (SVM) for action recognition, because the original pose signal is ambiguous while dictionary is an over complete matrix. Their solution is to add cumulative coherence as penalty in objective function and induce pose bases become mutually incoherent. They evaluate the method on two popular datasets and experiment results show the pose representation has encouraging performance in action recognition. Furthermore, they empirically exploit the complementary role of the local pose feature with deep convolutional neural network features from holistic image. Experiment results demonstrate aggressive performance improvement by concatenating the two features.
- Author(s): Zefenfen Jin ; Zhiqiang Hou ; Wangsheng Yu ; Xin Wang
- Source: IET Computer Vision, Volume 12, Issue 3, p. 241 –251
- DOI: 10.1049/iet-cvi.2017.0176
- Type: Article
- + Show details - Hide details
-
p.
241
–251
(11)
Aiming at an efficient feature match and similarity search in visual tracking, this study proposes a tracking algorithm based on quantum genetic algorithm. Therein, the global optimisation ability of quantum genetic algorithm is utilised. In the framework of quantum genetic algorithm, the positions of pixels are taken as individuals in population, while scale-invariant feature transform and colour features are taken as target model. Via defining the objective function, individual's fitness values can be measured. Visual tracking is realised when the pixel point with the biggest fitness value is searched and its corresponding position is returned. The experiment results show that the tracking algorithm the authors proposed performs more efficiently when it is compared with the state-of-the-art tracking algorithms.
- Author(s): Chayanut Petpairote ; Suthep Madarasmi ; Kosin Chamnongthai
- Source: IET Computer Vision, Volume 12, Issue 3, p. 252 –260
- DOI: 10.1049/iet-cvi.2017.0352
- Type: Article
- + Show details - Hide details
-
p.
252
–260
(9)
Conventional personalised-face neutralisation methods use facial-expression databases; however, the database creation and maintenance is a tedious process, and should be minimised. Moreover, face-shape template should be also considerably used due to its crucial factor. This study proposes a personalised-face neutralisation method using best-matched face-shape template with neutral-face database. In personalised-face neutralisation, the best-matched face-shape template which is assumed as the most similar to the neutralisation expression face is found based on coarse-to-fine concept, and used for warping textures. Additionally, closed eyes are detected and opened up by using eye shape of the best-matched face shape, and mixed intensities of original closed-eye and the best-matched one. To evaluate the performance of the proposed method, experiments were performed using the CMU Multi-PIE database and the results reveal that the proposed method reduces gradient mean square error 0.07% on average and improves face recognition accuracy by 1.13% approximately comparing with the conventional method, while requiring only a single neutral database without expression images.
- Author(s): Mehdi Dehghani ; Hamed Kharrati ; Hadi Seyedarabi ; Mahdi Baradarannia
- Source: IET Computer Vision, Volume 12, Issue 3, p. 261 –275
- DOI: 10.1049/iet-cvi.2017.0128
- Type: Article
- + Show details - Hide details
-
p.
261
–275
(15)
In conventional navigation systems, inertial sensors consist of accelerometers and gyroscopes. These sensors suffer from in-built errors, accumulated drift and high-level noise sensitivity. The accurate gyroscopes are expensive and not suitable in cost-effective applications. To minimise such disadvantages, one solution is the combination of inertial sensors with different aiding sensors. To lower the cost, utilisation of redundant accelerometer structure as gyro-free inertial measurement unit (GFIMU) has been proposed. In this study, Gyro-Free navigation errors using four tri-axial accelerometers are illustrated. Compensation of errors in terms of angular velocity and position estimation is verified based on adding a simple gyroscope, inexpensive stereo cameras as well as creating an easy to use topological map. The topological map is easily created by means of scale-invariant feature transform method. The estimation of angular velocity is corrected on the basis of fusing the measurements from GFIMU and a simple gyroscope using unscented Kalman filter. The correction of position is performed by comparing the estimated position from GFIMU and observation of stereo cameras together with topological map. The results of the research show that the collaboration of GFIMU, stereo cameras and simple gyroscope will improve the robustness and accuracy of navigation, significantly.
- Author(s): Shaheena Noor and Vali Uddin
- Source: IET Computer Vision, Volume 12, Issue 3, p. 276 –287
- DOI: 10.1049/iet-cvi.2017.0141
- Type: Article
- + Show details - Hide details
-
p.
276
–287
(12)
The authors propose a method to improve activity recognition by including the contextual information from first person vision (FPV). Adding the context, i.e. objects seen while performing an activity, increases the activity recognition precision. This is because, in goal-oriented tasks, human gaze precedes the action and tends to focus on relevant objects. They extract object information from FPV images and combine it with the activity information from external or FPV videos to train an Artificial Neural Network (ANN). They used four configurations as combination of gaze/eye-tracker, head-mounted and externally mounted cameras using three standard cooking datasets from Georgia Tech Egocentric Activities Gaze, Technische Universität München kitchen and CMU multi-modal activity database. Adding object information when training the ANN increased the precision and accuracy of activity recognition from average 58.02% (and 89.78%) to 74.03% (and 93.42%). Experiments also showed that when objects are not considered, having an external camera is necessary. However, when objects are considered, the combination of internal and external cameras is optimal because of their complementary advantages in observing hand and objects. Adding object information also decreases ANN training cycles from 513.25 to 139, which shows that it provides critical information that speeds up training.
- Author(s): Sayan Kahali ; Sudip Kumar Adhikari ; Jamuna Kanta Sing
- Source: IET Computer Vision, Volume 12, Issue 3, p. 288 –297
- DOI: 10.1049/iet-cvi.2016.0278
- Type: Article
- + Show details - Hide details
-
p.
288
–297
(10)
Magnetic resonance (MR) imaging technique has become indispensable in image-guided diagnosis and clinical research. However, present MR image acquisition leads to a slow varying intensity inhomogeneity (IIH) in MR image data. This study presents a novel technique based on convolution of three-dimensional (3D) Gaussian surfaces, which is denoted as ‘Co3DGS’, for volumetric IIH estimation and correction for 3D brain MR image data. A 3D Gaussian surface is approximated using local voxel gradients on each tissue volume corresponding to grey matter, white matter and cerebrospinal fluid of the 3D brain MR image data and then convolved to partially estimate the IIH, which is subsequently removed from the image data. The above processes are repeated until there is no such significant change in the voxel gradients. The Co3DGS technique has been tested on both synthetic and in-vivo human 3D brain MR image data of different pulse sequences. The empirical results both in qualitatively and quantitatively, which include coefficient of joint variation, index of variation, index of joint variation, index of class separability and root mean square error, collectively demonstrate that the Co3DGS efficiently estimates and removes the IIH from the 3D brain MR image data and stands superior to some state-of-the-art methods.
- Author(s): Jiannan Zheng ; Liang Zou ; Z. Jane Wang
- Source: IET Computer Vision, Volume 12, Issue 3, p. 298 –304
- DOI: 10.1049/iet-cvi.2016.0335
- Type: Article
- + Show details - Hide details
-
p.
298
–304
(7)
There has been a growing interest in food image recognition for a wide range of applications. Among existing methods, mid-level image part-based approaches show promising performances due to their suitability for modelling deformable food parts (FPs). However, the achievable accuracy is limited by the FP representations based on low-level features. Benefiting from the capacity to learn powerful features with labelled data, deep learning approaches achieved state-of-the-art performances in several food image recognition problems. Both mid-level-based approaches and deep convolutional neural networks (DCNNs) approaches clearly have their respective advantages, but perhaps most importantly these two approaches can be considered complementary. As such, the authors propose a novel framework to better utilise DCNN features for food images by jointly exploring the advantages of both the mid-level-based approaches and the DCNN approaches. Furthermore, they tackle the challenge of training a DCNN model with the unlabelled mid-level parts data. They accomplish this by designing a clustering-based FP label mining scheme to generate part-level labels from unlabelled data. They test on three benchmark food image datasets, and the numerical results demonstrate that the proposed approach achieves competitive performance when compared with existing food image recognition approaches.
- Author(s): Hu Zhang ; Wei Wu ; Ding Wang
- Source: IET Computer Vision, Volume 12, Issue 3, p. 305 –311
- DOI: 10.1049/iet-cvi.2016.0338
- Type: Article
- + Show details - Hide details
-
p.
305
–311
(7)
The classification of natural scene images is multi-instance multi-label (MIML) for many labels that exist in a natural scene image. The traditional method of solving MIML is to degenerate it into single-instance single-label learning (SISL). However, the precision of the method could decrease due to information loss during the degeneration process. How to reasonably solve the MIML problem is key to obtaining high accuracy in this research area. An MIML algorithm based on instances via combining sparse coding with a deep neural network is proposed. First, an instance-based sparse representation with dictionary learning is adopted. Second, an MIML description model based on a deep network is proposed, which can realise parameter self-learning in combination with sparse representations. Third, the residuals of the sparse representation are introduced to the deep neural network. The results of the experiments show that the method outperforms a number of state-of-the-art approaches.
- Author(s): Mao Wang ; Lili Zhao ; Yuewei Ming ; En Zhu ; Jianping Yin
- Source: IET Computer Vision, Volume 12, Issue 3, p. 312 –321
- DOI: 10.1049/iet-cvi.2016.0504
- Type: Article
- + Show details - Hide details
-
p.
312
–321
(10)
In image retrieval, the bag-of-visual-words model-based approaches combined withthe spatial verification (SP) post-processing step have achieved considerableprogress. However, in practice, especially for retrieving landmark images, theauthors have observed that this baseline suffers from the problem of burst matches.This issue is caused by repetitive visual patterns that appear frequently amongimages. Local features derived from these burst patterns can redundantly matchothers, resulting in many invalid matches that vote over-estimated similarity scoresfor irrelevant images. Essentially, this problem can be mainly attributed to tworeasons, (i) the non-exclusive matching leads to one-to-many matches, (ii) the SPfails to filter burst matches that are closely located. To tackle this problem, aburstiness detection approach using geometric and visual word information of localfeatures is proposed. Firstly, a geometric filtering strategy is employed to removematches that are not consistent with global scale variation. Then, the one-to-onematching strategy is applied to detect and eliminate one-to-many matches. Finally,a down-weighting burstiness strategy is adopted to penalise the voting weight ofburst matches. Experimental results on three public datasets demonstrate that theproposed approach can achieve a comparable or even better accuracy over otherpopular approaches.
- Author(s): Imran N. Junejo and Naveed Ahmed
- Source: IET Computer Vision, Volume 12, Issue 3, p. 322 –331
- DOI: 10.1049/iet-cvi.2017.0187
- Type: Article
- + Show details - Hide details
-
p.
322
–331
(10)
In this study, the authors propose a novel method to perform foreground extraction for freely moving RGBD cameras. Although the field of foreground extraction or background subtraction has been explored by the computer vision researcher community since a long time, the depth-based subtraction is relatively new and has not been extensively addressed as of yet. Most of the current methods make heavy use of geometric reconstruction, making the solutions quite restrictive. In this study, the authors make a novel use of RGB and depth data: from the RGB frame, they first extract corner features and then represent these features with the histogram of oriented gradients (HoG) descriptor. They then train a non-linear SVM on these HoG descriptors. During the test phase, they make use of the fact that the foreground object has a distinct depth ordering with respect to the rest of the scene. Hence, they use the positively classified features from accelerated segment test (FAST) features on the test frame to initiate a region growing algorithm to obtain an accurate foreground segmentation from the depth data alone. The authors demonstrate the proposed method on six datasets, and demonstrate encouraging quantitative and qualitative results.
- Author(s): Ali Seydi Keçeli ; Aydın Kaya ; Ahmet Burak Can
- Source: IET Computer Vision, Volume 12, Issue 3, p. 331 –339
- DOI: 10.1049/iet-cvi.2017.0204
- Type: Article
- + Show details - Hide details
-
p.
331
–339
(9)
Usage of depth sensors in activity recognition is an emerging technology in human–computer interaction. This study presents an approach to recognise human-to-human interactions by using depth information. Both hand-crafted features and deep features extracted from depth frames are studied. After selecting and ranking strong features with Relieff algorithm, depth frames are assigned to words. Then, interaction sequences are represented as histograms of words and non-linear input mapping is applied over histogram bins to minimise differences among various subjects. Random forest, K-nearest neighbour, and support vector machine (SVM) classifiers are trained using these histograms. The final model is tested on SBU and K3HI datasets and compared with the methods in the literature. In the experiments, joint distances, joint angles and spherical coordinates of the joints were the best performing features. The most successful results are obtained with the composite kernel SVM with Relieff and input mapping methods. While Relieff algorithm helps to select and rank the best features in the feature set, input mapping reduces differences among interactions of various actors.
- Author(s): Aditya Roshan and Yun Zhang
- Source: IET Computer Vision, Volume 12, Issue 3, p. 341 –349
- DOI: 10.1049/iet-cvi.2017.0209
- Type: Article
- + Show details - Hide details
-
p.
341
–349
(9)
Moving object detection in video streams is a challenging and integral part of computer vision which is used in surveillance, traffic and site monitoring, and navigation. Compared with the background-based techniques, frame differencing technique is computationally inexpensive. However, frame differencing technique only detects the boundary of a moving object. Due to changing light conditions, shadows, poor contrast between object and background, and a slow-moving object, object detection rate from frame differencing technique reduces. This is because the number of noisy frames and frames with missing/partially detected object increases. Application of large kernel size morphological operations fails to remove noise as they might remove the boundary (or part) of a moving object. In this study, the authors propose a methodology to improve the frame differencing technique using footstep sound generated by a moving object. Audio recorded with the video system is processed and footstep sound is detected using audio features computed as mel-frequency cepstral coefficients. Number of frames within each footstep sound are counted and processed. Spatial segmentation is used to find the moving object in noisy frames. A missing or partially detected object is recovered by modelling an ellipse using a moving object from other neighbourhood frames.
- Author(s): Jie Zhu ; Shufang Wu ; Xizhao Wang ; Guoqing Yang ; Liyan Ma
- Source: IET Computer Vision, Volume 12, Issue 3, p. 350 –356
- DOI: 10.1049/iet-cvi.2017.0261
- Type: Article
- + Show details - Hide details
-
p.
350
–356
(7)
One of the central problems in object recognition is to develop appropriate representations for the objects in images. The authors present a novel approach for image representation that is based on graphs. In the proposed image graph, each node represents a patch and edges are added between neighbouring nodes. First, class-specific match-set graphs are generated by matching the image graphs that are in the same categories, and the multi-image matching problem is solved by applying a seed-expansion strategy. Then, the matches between the match-set graphs and an image graph are considered to be the object patches in the image. Finally, the features extracted from these patches are used for the image representation. Extensive experiments are conducted to demonstrate that their approach can obtain state-of-the-art results on several challenging datasets.
- Author(s): Han-Mu Park ; Dae-Yong Cho ; Kuk-Jin Yoon
- Source: IET Computer Vision, Volume 12, Issue 3, p. 357 –363
- DOI: 10.1049/iet-cvi.2017.0208
- Type: Article
- + Show details - Hide details
-
p.
357
–363
(7)
Recently developed object detectors rely on automatically generated object proposals, instead of using a dense sliding window search scheme; generating good object proposals has therefore become crucial for improving the computational cost and accuracy of object detection performance. In particular, the shape and location errors of object proposals can be directly propagated to object detection unless some additional processes are adopted to refine the shape and location of bounding boxes. In this study, the authors demonstrate an object proposal refinement algorithm that improves the localisation accuracy and refines the shape of object proposals by searching a boundary-aligned minimum bounding box. They assume that an object consists of several image regions, and that the optimal object proposal is well aligned with image region boundaries. Based on this assumption, they design novel boundary-region alignment measures and then propose a greedy refinement method based on the proposed measures. Experiments on the PASCAL VOC 2007 dataset show that the proposed method produces highly well-localised object proposals and truly improves the quality of object proposals.
Action recognition from mutually incoherent pose bases in static image
Target tracking approach via quantum genetic algorithm
Personalised-face neutralisation using best-matched face shape with a neutral-face database
Improvement of angular velocity and position estimation in gyro-free inertial navigation based on vision aid equipment
Using context from inside-out vision for improved activity recognition
Convolution of 3D Gaussian surfaces for volumetric intensity inhomogeneity estimation and correction in 3D brain MR image data
Mid-level deep Food Part mining for food image recognition
Multi-instance multi-label learning of natural scene images: via sparse coding and multi-layer neural network
Boosting landmark retrieval baseline with burstiness detection
Foreground extraction for freely moving RGBD cameras
Depth features to recognise dyadic interactions
Using mel-frequency audio features from footstep sound and spatial segmentation techniques to improve frame-based moving object detection
Multi-image matching for object recognition
Greedy refinement of object proposals via boundary-aligned minimum bounding box search
Most viewed content
Most cited content for this Journal
-
Brain tumour classification using two-tier classifier with adaptive segmentation technique
- Author(s): V. Anitha and S. Murugavalli
- Type: Article
-
Driving posture recognition by convolutional neural networks
- Author(s): Chao Yan ; Frans Coenen ; Bailing Zhang
- Type: Article
-
Local directional mask maximum edge patterns for image retrieval and face recognition
- Author(s): Santosh Kumar Vipparthi ; Subrahmanyam Murala ; Anil Balaji Gonde ; Q.M. Jonathan Wu
- Type: Article
-
Fast and accurate algorithm for eye localisation for gaze tracking in low-resolution images
- Author(s): Anjith George and Aurobinda Routray
- Type: Article
-
‘Owl’ and ‘Lizard’: patterns of head pose and eye pose in driver gaze classification
- Author(s): Lex Fridman ; Joonbum Lee ; Bryan Reimer ; Trent Victor
- Type: Article