IET Computer Vision
Volume 14, Issue 8, December 2020
Volumes & issues:
Volume 14, Issue 8
December 2020
-
- Author(s): Hyunguk Choi ; Hoyeon Ahn ; Joonmo Kim ; Moongu Jeon
- Source: IET Computer Vision, Volume 14, Issue 8, p. 555 –563
- DOI: 10.1049/iet-cvi.2019.0289
- Type: Article
- + Show details - Hide details
-
p.
555
–563
(9)
Semantic segmentation is one of the important technologies in autonomous driving, and ensuring its real-time and high performance is of utmost importance for the safety of pedestrians and passengers. To improve its performance using deep neural networks that operate in real-time, the authors propose a simple and efficient method called ADFNet using accumulated decoder features, ADFNet operates by only using the decoder information without skip connections between the encoder and decoder. They demonstrate that the performance of ADFNet is superior to that of the state-of-the-art methods, including that of the baseline network on the cityscapes dataset. Further, they analyse the results obtained via ADFNet using class activation maps and RGB representations for image segmentation results.
- Author(s): Marek Jakab ; Lukas Hudec ; Wanda Benesova
- Source: IET Computer Vision, Volume 14, Issue 8, p. 564 –574
- DOI: 10.1049/iet-cvi.2019.0416
- Type: Article
- + Show details - Hide details
-
p.
564
–574
(11)
Multiple research studies have recently demonstrated deep networks can generate realistic-looking textures and stylised images from a single texture example. However, they suffer from some drawbacks. Generative adversarial networks are in general difficult to train. Multiple feature variations, encoded in their latent representation, require a priori information to generate images with specific features. The auto-encoders are prone to generate a blurry output. One of the main reasons is the inability to parameterise complex distributions. The authors present a novel texture generative model architecture extending the variational auto-encoder approach. It gradually increases the accuracy of details in the reconstructed images. Thanks to the proposed architecture, the model is able to learn a higher level of details resulting from the partial disentanglement of latent variables. The generative model is also capable of synthesising complex real-world textures. The model consists of multiple separate latent layers responsible for learning the gradual levels of texture details. Separate training of latent representations increases the stability of the learning process and provides partial disentanglement of latent variables. The experiments with proposed architecture demonstrate the potential of variational auto-encoders in the domain of texture synthesis and also tend to yield sharper reconstruction as well as synthesised texture images.
- Author(s): Zhizhong Wang ; Lei Zhao ; Sihuan Lin ; Qihang Mo ; Huiming Zhang ; Wei Xing ; Dongming Lu
- Source: IET Computer Vision, Volume 14, Issue 8, p. 575 –586
- DOI: 10.1049/iet-cvi.2019.0844
- Type: Article
- + Show details - Hide details
-
p.
575
–586
(12)
Recent studies using deep neural networks have shown remarkable success in style transfer, especially for artistic and photo-realistic images. However, these methods cannot solve more sophisticated problems. The approaches using global statistics fail to capture small, intricate textures and maintain correct texture scales of the artworks, and the others based on local patches are defective on global effect. To address these issues, this study presents a unified model [global and local style network (GLStyleNet)] to achieve exquisite style transfer with higher quality. Specifically, a simple yet effective perceptual loss is proposed to consider the information of global semantic-level structure, local patch-level style, and global channel-level effect at the same time. This could help transfer not just large-scale, obvious style cues but also subtle, exquisite ones, and dramatically improve the quality of style transfer. Besides, the authors introduce a novel deep pyramid feature fusion module to provide a more flexible style expression and a more efficient transfer process. This could help retain both high-frequency pixel information and low-frequency construct information. They demonstrate the effectiveness and superiority of their approach on numerous style transfer tasks, especially the Chinese ancient painting style transfer. Experimental results indicate that their unified approach improves image style transfer quality over previous state-of-the-art methods.
- Author(s): Haohua Zhao ; Weichen Xue ; Xiaobo Li ; Zhangxuan Gu ; Li Niu ; Liqing Zhang
- Source: IET Computer Vision, Volume 14, Issue 8, p. 587 –596
- DOI: 10.1049/iet-cvi.2019.0761
- Type: Article
- + Show details - Hide details
-
p.
587
–596
(10)
Video data are of two different intrinsic modes, in-frame and temporal. It is beneficial to incorporate static in-frame features to acquire dynamic features for video applications. However, some existing methods such as recurrent neural networks do not have a good performance, and some other such as 3D convolutional neural networks (CNNs) are both memory consuming and time consuming. This study proposes an effective framework that takes the advantage of deep learning on the static image feature extraction to tackle the video data. After extracting in-frame feature vectors using a pretrained deep network, the authors integrate them and form a multi-mode feature matrix, which preserves the multi-mode structure and high-level representation. They propose two models for follow-up classification. The authors first introduce a temporal CNN, which directly feeds the multi-mode feature matrix into a CNN. However, they show that characteristics of the multi-mode features differ significantly in distinct modes. The authors therefore further propose the multi-mode neural network (MMNN), in which different modes deploy different types of layers. They evaluate their algorithm with the task of human action recognition. The experimental results show that the MMNN achieves a much better performance than the existing long short-term memory-based methods and consumes far fewer resources than the existing 3D end-to-end models.
- Author(s): Dianzhuan Jiang ; Shengsheng Zhang ; Yaping Huang ; Qi Zou ; Xingyuan Zhang ; Mengyang Pu ; Junbo Liu
- Source: IET Computer Vision, Volume 14, Issue 8, p. 597 –604
- DOI: 10.1049/iet-cvi.2019.0916
- Type: Article
- + Show details - Hide details
-
p.
597
–604
(8)
Most existing text detection methods are mainly motivated by deep learning-based object detection approaches, which may result in serious overlapping between detected text lines, especially in dense text scenarios. It is because text boxes are not commonly overlapped, as different from general objects in natural scenes. Moreover, text detection requires higher localisation accuracy than object detection. To tackle these problems, the authors propose a novel dense text detection network (DTDN) to localise tighter text lines without overlapping. Their main novelties are: (i) propose an intersection-over-union overlap loss, which considers correlations between one anchor and GT boxes and measures how many text areas one anchor contains, (ii) propose a novel anchor sample selection strategy, named CMax-OMin, to select tighter positive samples for training. CMax-OMin strategy not only considers whether an anchor has the largest overlap with its corresponding GT box (CMax), but also ensures the overlapping between one anchor and other GT boxes as little as possible (OMin). Besides, they train a bounding-box regressor as post-processing to further improve text localisation performance. Experiments on scene text benchmark datasets and their proposed dense text dataset demonstrate that the proposed DTDN achieves competitive performance, especially for dense text scenarios.
- Author(s): Yunlong Gao ; Shuxin Zhong ; Kangli Hu ; Jinyan Pan
- Source: IET Computer Vision, Volume 14, Issue 8, p. 605 –613
- DOI: 10.1049/iet-cvi.2019.0403
- Type: Article
- + Show details - Hide details
-
p.
605
–613
(9)
Locality preserving projections (LPP) method is a classical manifold learning method for dimensionality reduction. However, LPP is sensitive to outliers since squared L2-norm may exaggerate the distance of outliers. Besides, the normalisation constraint of LPP may impair its robustness during embedding. Motivated by this observation, the authors propose a novel robust LPP using angle-based adaptive weight (RLPP-AAW) method. RLPP-AAW not only considers the distance metric of training samples, but also take the reconstruction error into account, so as to reduce the influence of outliers and noise in the embedding process. In the RLPP-AAW, based on the angle between distance metric and reconstruction error, a novel way is used to combine them in the objective function. Besides, RLPP-AAW employs the L21-norm criterion, which retains rotational invariance and is more robust than squared L2-norm. An iterative algorithm is presented to solve the objective function of RLPP-AAW. Experimental results on the benchmark databases illustrate the effectiveness of the proposed algorithm.
- Author(s): Saeedeh Zebhi ; SMT Al-Modarresi ; Vahid Abootalebi
- Source: IET Computer Vision, Volume 14, Issue 8, p. 614 –624
- DOI: 10.1049/iet-cvi.2019.0625
- Type: Article
- + Show details - Hide details
-
p.
614
–624
(11)
Motion history image (MHI) is a spatio-temporal template that temporal motion information is collapsed into a single image where intensity is a function of recency of motion. Also, it consists of spatial information. Energy image (EI) based on the magnitude of optical flow is a temporal template that shows only temporal information of motion. Each video can be described in these templates. So, four new methods are introduced in this study. The first three methods are called basic methods. In method 1, each video splits into N groups of consecutive frames and MHI is calculated for each group. Transfer learning with fine-tuning technique has been used for classifying these templates. EIs are used for classifying in method 2 similar to method 1. Fusing two streams of these templates is introduced as method 3. Finally, spatial information is added in method 4. Among these methods, method 4 outperforms others and it is called the proposed method. It achieves the recognition accuracy of 92.30 and 94.50% for UCF Sport and UCF-11 action data sets, respectively. Also, the proposed method is compared with the state-of-the-art approaches and the results show that it has the best performance.
- Author(s): Jianming Wang ; Enjie Cui ; Kunliang Liu ; Yukuan Sun ; Jiayu Liang ; Chunmiao Yuan ; Xiaojie Duan ; Guanghao Jin ; Tae-Sun Chung
- Source: IET Computer Vision, Volume 14, Issue 8, p. 625 –633
- DOI: 10.1049/iet-cvi.2019.0483
- Type: Article
- + Show details - Hide details
-
p.
625
–633
(9)
The task of referring expression comprehension (REC) is to localise an image region of a specific object described by a natural language expression, and all existing REC methods assume that the object described by the referring expression must be located in the given image. However, this assumption is not correct in some real applications. For example, a visually impaired user might tell his robot ‘please take the laptop on the table to me’. In fact, the laptop is not on the table anymore. To address this problem, the authors propose a novel REC model to deal with the situation where expression-image mismatching occurs and explain the mismatching by linguistic feedback. The authors' REC model consists of four modules: the expression parsing module, the entity detection module, the relationship detection module, and the matching detection module. They built a data set called NP-RefCOCO+ from RefCOCO+ including both positive samples and negative samples. The positive samples are original expression-image pairs in RefCOCO+. The negative samples are the expression-image pairs in RefCOCO+, whose expressions are replaced. They evaluate the model on NP-RefCOCO+ and the experimental results show the advantages of their method for dealing with the problem of expression-image mismatching.
- Author(s): Jiahui Cai ; Jianguo Hu ; Shiren Li ; Jialing Lin ; Jun Wang
- Source: IET Computer Vision, Volume 14, Issue 8, p. 634 –641
- DOI: 10.1049/iet-cvi.2020.0023
- Type: Article
- + Show details - Hide details
-
p.
634
–641
(8)
In this study, the authors focus on improving the spatio–temporal representation ability of three-dimensional (3D) convolutional neural networks (CNNs) in the video domain. They observe two unfavourable issues: (i) the convolutional filters only dedicate to learning local representation along input channels. Also they treat channel-wise features equally, without emphasising the important features; (ii) traditional global average pooling layer only captures first-order statistics, ignoring finer detail features useful for classification. To mitigate these problems, they proposed two modules to boost 3D CNNs’ performance, which are temporal-channel correlation (TCC) and bilinear pooling module. The TCC module can capture the information of inter-channel correlations over the temporal domain. Moreover, the TCC module generates channel-wise dependencies, which can adaptively re-weight the channel-wise features. Therefore, the network can focus on learning important features. With regards to the bilinear pooling module, it can capture more complex second-order statistics in deep features and generate a second-order classification vector. We can get more accurate classification results by combining the first-order and second-order classification vector. Extensive experiments show that adding our proposed modules to I3D network could consistently improve the performance and outperform the state-of-the-art methods. The code and models are available at https://github.com/caijh33/I3D_TCC_Bilinear.
- Author(s): Xingmei Wang ; Boxuan Sun ; Hongbin Dong
- Source: IET Computer Vision, Volume 14, Issue 8, p. 642 –649
- DOI: 10.1049/iet-cvi.2019.0514
- Type: Article
- + Show details - Hide details
-
p.
642
–649
(8)
Unsupervised domain adaption aims to reduce the divergence between the source domain and the target domain. The final objective is to learn domain-invariant features from both domains that get the minimised expected error on the target domain. The divergence between domains which is also called domain shift is mainly between the distributions of domains' samples. Additionally, the label shift is also a tricky challenge in domain adaptation. In this study, domain-invariant adversarial learning with conditional distribution alignment is proposed to alleviate the effect of domain shift with label shift. To obtain the domain-invariant features, the proposed method modifies adversarial auto-encoder architecture and performs semi-supervised learning to enlarge the inter-class discrepancy. The marginal distribution is aligned in the adversarial learning process of extracting domain-invariant features. Meanwhile, the label information is incorporated in this way to align the conditional distribution. The proposed work also theoretically analyses the generalisation bound of the proposed model. Finally, the proposed method is evaluated based on several domain adaptation tasks, including digit classification and object recognition, and achieves state-of-the-art performance.
- Author(s): Haibo Chen ; Lei Zhao ; Lihong Qiu ; Zhizhong Wang ; Huiming Zhang ; Wei Xing ; Dongming Lu
- Source: IET Computer Vision, Volume 14, Issue 8, p. 650 –657
- DOI: 10.1049/iet-cvi.2020.0014
- Type: Article
- + Show details - Hide details
-
p.
650
–657
(8)
Existing style transfer methods have achieved great success in artwork generation by transferring artistic styles onto everyday photographs while keeping their contents unchanged. Despite this success, these methods have one inherent limitation: they cannot produce newly created image contents, lacking creativity and flexibility. On the other hand, generative adversarial networks (GANs) can synthesise images with new content, whereas cannot specify the artistic style of these images. The authors consider combining style transfer with convolutional GANs to generate more creative and diverse artworks. Instead of simply concatenating these two networks: the first for synthesising new content and the second for transferring artistic styles, which is inefficient and inconvenient, they design an end-to-end network called ArtistGAN to perform these two operations at the same time and achieve visually better results. Moreover, to generate images of higher quality, they propose the bi-discriminator GAN containing a pixel discriminator and a feature discriminator that constrain the generated image from pixel level and feature level, respectively. They conduct extensive experiments and comparisons to evaluate their methods quantitatively and qualitatively. The experimental results verify the effectiveness of their methods.
- Author(s): Sarah Ahmed and Tayyaba Azim
- Source: IET Computer Vision, Volume 14, Issue 8, p. 658 –664
- DOI: 10.1049/iet-cvi.2019.0208
- Type: Article
- + Show details - Hide details
-
p.
658
–664
(7)
Fisher kernels derived from stochastic probabilistic models such as restricted and deep Boltzmann machines have shown competitive visual classification results in comparison to widely popular deep discriminative models. This genre of Fisher kernels bridges the gap between shallow and deep learning paradigm by inducing the characteristics of deep architecture into Fisher kernel, further deployed for classification in discriminative classifiers. Despite their success, the memory and computational costs of Fisher vectors do not make them amenable for large-scale visual retrieval and classification tasks. This study introduces a novel feature selection technique inspired from the functional characteristics of neural architectures for learning discriminative feature representations to boost the performance of Fisher kernels against deep discriminative models. The proposed technique condenses the large dimensional Fisher features for kernel learning and shows improvement in its classification performance and storage cost on leading benchmark data sets. A comparison of the proposed method with other state-of-the-art feature selection techniques is made to demonstrate its performance supremacy as well as time complexity required to learn in reduced Fisher space.
- Author(s): Lei Lu ; Ming Xu ; Jeremy S. Smith ; Yuyao Yan
- Source: IET Computer Vision, Volume 14, Issue 8, p. 665 –673
- DOI: 10.1049/iet-cvi.2019.0175
- Type: Article
- + Show details - Hide details
-
p.
665
–673
(9)
A pedestrian segmentation algorithm in the presence of cast shadows is presented in this study. The novelty of this algorithm lies in the fusion of multi-view and multi-plane homographic projections of foregrounds and the use of the fused data to guide colour clustering. This brings about an advantage over the existing binocular algorithms in that it can remove cast shadows while keeping pedestrians’ body parts, which occlude shadows. Phantom detection, which is inherent with the binocular method, is also investigated. Experimental results with real-world videos have demonstrated the efficiency of this algorithm.
- Author(s): Abimael Guzman-Pando ; Mario Ignacio Chacon-Murguia ; Lucia B. Chacon-Diaz
- Source: IET Computer Vision, Volume 14, Issue 8, p. 674 –682
- DOI: 10.1049/iet-cvi.2019.0997
- Type: Article
- + Show details - Hide details
-
p.
674
–682
(9)
This study proposes a new method to evaluate the performance of algorithms for moving object detection (MODA) in video sequences. The proposed method is based on human performance metric intervals, instead of ideal metric values (0 or 1) which are commonly used in the literature. These intervals are proposed to establish a more reliable evaluation and comparison, and to identify areas of improvement in the evaluation of MODA. The contribution of the study includes the determination of human segmentation performance metric intervals and their comparison with state-of-the-art MODA, and the evaluation of their segmentation results in a tracking task to establish the impact between performance and practical utility. Results show that human participants had difficulty with achieving a perfect segmentation score. Deep learning algorithms achieved performance above the human average, while other techniques achieved a performance between 88 and 92%. Furthermore, the authors demonstrate that algorithms not ranked at the top of the quantitative metrics worked satisfactorily in a tracking experiment; and therefore, should not be discarded for real applications.
ADFNet: accumulated decoder features for real-time semantic segmentation
Partial disentanglement of hierarchical variational auto-encoder for texture synthesis
GLStyleNet: exquisite style transfer combining global and local pyramid features
Multi-mode neural network for human action recognition
Detecting dense text in natural images
Robust locality preserving projections using angle-based adaptive weight method
Converting video classification problem to image classification with global descriptors and pre-trained network
Referring expression comprehension model with matching detection and linguistic feedback
Combination of temporal-channels correlation information and bilinear feature for action recognition
Domain-invariant adversarial learning with conditional distribution alignment for unsupervised domain adaptation
Creative and diverse artwork generation using adversarial networks
Diversified Fisher kernel: encoding discrimination in Fisher features to compete deep neural models for visual classification task
Moving shadow detection via binocular vision and colour clustering
Human-like evaluation method for object motion detection algorithms
Most viewed content
Most cited content for this Journal
-
Brain tumour classification using two-tier classifier with adaptive segmentation technique
- Author(s): V. Anitha and S. Murugavalli
- Type: Article
-
Driving posture recognition by convolutional neural networks
- Author(s): Chao Yan ; Frans Coenen ; Bailing Zhang
- Type: Article
-
Local directional mask maximum edge patterns for image retrieval and face recognition
- Author(s): Santosh Kumar Vipparthi ; Subrahmanyam Murala ; Anil Balaji Gonde ; Q.M. Jonathan Wu
- Type: Article
-
Fast and accurate algorithm for eye localisation for gaze tracking in low-resolution images
- Author(s): Anjith George and Aurobinda Routray
- Type: Article
-
‘Owl’ and ‘Lizard’: patterns of head pose and eye pose in driver gaze classification
- Author(s): Lex Fridman ; Joonbum Lee ; Bryan Reimer ; Trent Victor
- Type: Article