Computer vision is an interdisciplinary scientific field that deals with how computers obtain, store, interpret and understand digital images or videos using artificial intelligence based on neural networks, machine learning and deep learning methodologies. They are used in countless applications such as image retrieval and classification, driving and transport monitoring, medical diagnostics and aerial monitoring. Written by a team of international experts, this edited book covers the state-of-the-art of advanced research in the fields of computer vision and recognition systems from fundamental concepts to methodologies and technologies and real world applications including object detection, biometrics, Deepfake detection, sentiment and emotion analysis, traffic enforcement camera monitoring, vehicle control and aerial remote sensing imagery. The book will be useful for industry and academic researchers, scientists and engineers in the fields of computer vision, machine vision, image processing and recognition, multimedia, AI, machine and deep learning, data science, biometrics, security, and signal processing. It will also make a great course reference for advanced students and lecturers in these fields of research.
Inspec keywords: image classification; learning (artificial intelligence); computer vision; convolutional neural nets; feature extraction
Other keywords: learning (artificial intelligence); convolutional neural nets; image classification; mobile computing; computer vision; support vector machines; driver information systems; object detection; feature extraction; deep learning (artificial intelligence)
Subjects: Computer vision and image processing techniques; General electrical engineering topics; General and management topics; Traffic engineering computing; Education and training; Neural nets; Optical, image and video signal processing
Computer vision, pattern recognition, deep learning (DL), expert systems, cognitive computing, and the Internet of things are some of the innovations and terminologies that have sprung up as artificial intelligence (AI) has grown in popularity. Among these, computer vision is one of the innovations that allow computers to perceive and comprehend the visual world. Computers recognize and classify artifacts using digital images and DL representations. Computer vision technologies have exploded in popularity in the fields of automation and logistics. Despite these challenges, automation appears to be one of the most exciting regions for recently developed artificial intelligence solutions, primarily computer and machine vision frameworks. Amongst the most important problems in automation is the protection of human-computer and human-machine interactions, which necessitates the "explainability" of techniques, which also precludes the use of any DL-based solutions, regardless of their success in computer vision applications. To automate some aspects of the manual labor involved, robotic platforms have been created. Traditional analytic methods are used by many of the current systems. Usually, automation is not end-to-end, necessitating user involvement to transfer vials, create analytical methods for each compound, and interpret raw data. This chapter is addressing the issues involved with computer vision and recognition-based safe automated systems.
Machine learning algorithms-applications (ML) have been deployed to support growth by employing the Internet of things (IoT) in various technologies and aimed at full smart cities. Graphic processing unit (GPU)-based systems or GPU-central processing unit (CPU)-based systems were aimed to implement various MLs' computations including deep neural networks (DNN), convolutional neural network (CNN), and recurrent neural network (RNN), which have utilized parallel computations in multiply-accumulate (MAC) operations. GPU-based systems satisfied flexibility for implementing different MLs and supporting their training and inference phases, whereas increasing neural network's layers remains its energy efficiency problems caused by enhancing memory accesses. According to deploying high accurate image processing, and pattern and speech recognition-based applications and grow up their complexity, some methods had to be considered to tackle the problem. Hence software (SW), hardware (HW), and SW-HW approaches have been proposed to face the challenges, which consist of memory capacity, delay, energy consumption, and bandwidth requirement. One of the approaches is the deep learning accelerator's (DLA) communication infrastructure which connects the processing elements (PE). Trained models' traffic distributes PEs using communication infrastructure, which can inspire various structures and designs such as application-specific integrated circuit (ASIC) and field-programmable gate array (FPGA). As an example of DLA's efficiency, ASIC-based designs have less flexibility and reconfigurability compared to network on chip (NoC) and FPGA-based communication structures and can only support a specific purpose such as image processing. In this chapter, we will focus on hardware approaches to improve the GPU-based system's energy efficiency and performance in the inference phase, which is described as a deep learning accelerator including memory, communication infrastructure, and PEs. We first explain different communication network's role in improving or deteriorating data transfer of trained DNN models between memory and network, and processing elements. Next, we will describe various approaches and investigate their impact on DLA-based system's efficiency, which have included data-flow mapping, dataflow stationaries, traffic patterns, and partitioning methods.
In the past decades of the digital era, the amount of electronic data such as text, audio, images, and many more have increased to tremendous amounts. A study reveals that the number of cameras in the world exceeds the number of eyes. Reports suggest there will be approximately 7.4 trillion images generated by the end of 2020. An image retrieval system is used to search for images similar to the query image in a large image database. An image retrieval system will assist in the processing, organizing, and handling of image data efficiently. Companies such as Google and Pinterest use image retrieval systems to provide users with related images. In the year 1970, a text-based image retrieval technique was implemented where images were first annotated manually and then searched using a text query. This method was extremely time-consuming, labor-intensive, and prone to errors in annotation due to human perception's subjectivity. In the early 1980s, a content-based image retrieval (CBIR) system was introduced which used visual features extracted from an image. This method used traditional image processing tools for understanding the color, texture, and shape-based features of the image. This approach had problems such as inappropriate and limited feature representations which resulted in less efficient and non-generalized algorithms. Advancement in deep learning (DL) related to image processing has given rise to better processing of image data. Deep learning was inspired by the working of the biological neural network. It is widely used for classification tasks and achieves state-of-the-art performance. Today, we use convolutional neural networks (CNN) to perform image retrieval tasks. CNN is a variant of DL and is widely used for computer vision applications because of its inherent ability to extract features from an image. CNN has also shown prominent results as compared to traditional image processing techniques. In a CNN, convolutional layers are used for feature representation and extraction. After the features are extracted, distance metrics are used for measuring the similarity between the query image and the images from the database. Generally, similarity metrics such as Euclid distance, Cosine distance, and Manhattan distance facilitate finding similar images from the database. Further, the authors will discuss the image retrieval system using convolutional auto-encoders and improving an image retrieval system's usability using generative adversarial networks (GANs). In convolutional autoencoders, the autoencoder's encoder portion is used to compute the latent space representation, which quantifies the content of the image in a feature vector. The query image's feature vector is compared with the feature vectors of all the images in the database to find similar images. GANs can produce an image from text or a simple sketch. Alternatively, GANs can be used to create an image with new features. The image generated by a GAN can be used for image retrieval thereby adding a method to query an image retrieval system. Convolutional neural networks, convolutional auto-encoders, and GANs are the three prominent methodologies for image retrieval that are to be discussed by the authors.
Handwritten documents have been a valuable resource in human transactions for many years. Today, there is an immediate need for computer-based techniques to intelligently read and analyze such documents. Meanwhile, handwritten numerals are of particular importance due to their role in finance, business, post, etc. Although there exist many researches on English handwritten number recognition, the development of reliable recognition systems has been paid little attention for non-English scripts. In this chapter, an overview of the state-of-the-art on handwritten number recognition (with focus on non-English languages) is presented. Dictionary learning as a supervised learning technique, which has been recently shown great success in image classification problems, is introduced. We describe the ways one can design discriminative dictionaries for classification of handwritten numbers. The obtained dictionaries convey exclusive features of the associated numerals. In order to improve the classification performance of handwritten numbers using dictionary learning, two novel approaches are presented. First, an incoherence penalty is combined with the learning process to fine-tune the structure of the dictionaries learned for each class. Second, class label information is embedded into the learning process in order to produce class-specific weights which improve the discriminativity of the learned dictionaries. We further adopt a new feature space, that is, histogram of oriented gradients (HOG) to generate the dictionary atoms. HOG is a strong descriptor of most handwritten images especially those studied in this chapter. Four different handwritings, namely, Chinese, Persian, Arabic, as well as English are used to evaluate the performance of the proposed methods. We also present a convolutional neural network model to compare the performance of deep learning with that of dictionary learning for handwritten digits recognition. The obtained results and their comparisons with benchmark methods confirm the effectiveness and robustness of the proposed approaches for recognition of handwritten numbers.
Handwriting recognition is a famous problem to convert the handwriting to text efficiently. The problem can be solved by utilizing a deep learning approach. A large data set is required to build an excellent model. This research uses the public data set on handwriting to train the neural network and to validate the model performance with our augmented dataset to prove that the model can run well with the unseen handwriting style. This research investigates several state-of-the-art handwriting models, namely Flôr, Bluche, and PuigCerver models with the modified data set. The data set initially given by the Research Group on Computer Vision and Artificial Intelligence INF, University of Bern contains 112,746 words. The experiment methodology explores the extensions to improve accuracy. It considers modifications such as adding more depth in encoder and decoder, adding a skipped connection, changing the activation function, and adding state-of-the-art neural network architecture, that is, Squeeze and Excitation block and Bottleneck block from MobileNetV2 model. We also consider the model size, accuracy and inference time, along with each configuration. The results show that our refinement using the skipped connection performs the best with the unseen test data set among all other designs.
Real-time object detection is a task that involves the detection of one or more objects with high precision and very low latency. It is a longstanding problem in the field of computer vision. Various methods and algorithms have tried to make this task faster and efficient. Earlier real-time object detection involved traditional image processing algorithms and techniques. The traditional techniques had good speed but very low accuracy. Also, with the advent of deep learning and graphic processing unit (GPU) compute availability, deep learning methods became very popular for real-time detection tasks. Realtime object detection needs to process a stream of images or videos. Traditionally high compute was not available on Internet of things (IoT) devices and hence traditional client-server architecture techniques were popular. In the client-server architecture technique, people would process the webcam or sensor data to remote servers. These caused heavy latency and hence real-time detection was very difficult. Edge IoT devices solved the latency issue by creating a compute environment close to sensors. These devices made it possible to create real-time object detection with low latency.
Synthetic audio and video content are enormously growing in the Internet world today, creating many problems in recent years. The use of techniques such as artificial intelligence over multimedia to create fake content is known as DeepFake. It challenges the reality and genuineness of the created high-quality audio and video content. The synthetic media created is almost similar to the real media, and with bare eyes, it is almost impossible to differentiate them. Though the DeepFake has also been a boon in many multimedia fields such as film making, advertising, and animation industry to create magnetic media, it has also become a threat to society. There is software available across the Internet that can be used even by a novice to create fake multimedia contents that look very realistic and used for various criminal activities. The applications such as Snapchat, Instagram, Facebook, Twitter, and TikTok use multimedia as its contents. Social media applications can be easily falsified and can cause severe threats both personally and psychologically using DeepFake technology. Hence, there is a need for tools to detect the forged contents in the media to check and authenticate the data's genuineness and integrity. This chapter highlights the various challenges the human race faces because of the DeepFake technology and the limitation of different forensic tools available to analyze and forecast the challenging future and understand the solutions incorporated to secure the current data situation.
In recent years, eye-gaze detection and monitoring has been an active area of research as it provides convenience to a range of applications. It is considered to be an effective nontraditional human computer interaction form. Detection of head movement has also gained the attention and interest of researchers as it has been found to be an easy and efficient form of interaction. The simplest alternative device approaches are considered by both technologies. They represent a significant number of chronically handicapped people with limited motor skills. Several different methods have been proposed and used for both eye tracking and head movement detection to incorporate various algorithms for these technologies. Researchers are also trying to find robust approaches to be used efficiently in different applications, considering the amount of research performed on both technologies. This chapter provides a study of the eye tracking and head motion identification approaches. Examples of various implementation areas are also discussed in both innovations, such as cooperation with the person and computer systems, driver assistance systems, and assistive technology.
Analyzing sentiments using computational techniques is one of the prominent areas of research these days. Both research domains, sentiment analysis and deep learning, are promising technologies of AI which has widely been used to solve complex real-life problems. In this modern era of E-commerce, decisions are immensely dependent on sharing and posting opinions, reviews, and sentiments on world wide web. Text classification was performed manually and engineering tasks applied on the text were accomplished using handcrafted features. This was achieved by labeling the text through some predefined knowledge-based techniques, taking use of dictionaries or using ontologies by joining hierarchical components seen in texts using graph data structures. Humans mainly did all these tasks with no automation. In the fast-developing phase, need is to develop some automatic procedure of classifying the text with least or no intervention of humans, taking machines as a whole using artificial intelligence techniques and algorithms for such kind of feature engineering tasks. The chapter includes concepts of deep learning and its algorithms, mainly convolutional neural networks for classifying sentiments. Deep learning is one of the important subsets of machine learning used to work with images, text, sound, etc.; the most attractive feature of deep learning is relevant feature extraction and transformation. The elementary block for deep learning is neural network which combines with deep learning to form a deep neural network. The terminology deep in deep learning refers to the total number of layers present in the network. The more the number of layers in the network, the more deeper the network will be. A deep neural network as compared to the traditional neural network can have hundreds of layers to get more accurate results. It is the combination of various processing layers controlled by various biological nervous systems. There are basically one input layer, multiple hidden layers, and one output layer. All these layers are connected with each other through neurons, where the output of one hidden layer becomes the input for other layer. The multilayer concept of deep learning gives fruitful results especially in the field of speech and image recognition. Various neural networks architectures exist for deep learning: Multilayer perceptrons, which are the oldest and simplest ones Convolutional neural networks (CNN), especially for image processing and text classification. This chapter will propose the novel method of classifying the opinions by automatically training the classifier. The details of the layers and other parameters will be discussed in the chapter highlighting on the learning is achieved through word representations. The accuracy achieved using several information retrieval metrics will be illustrated using visualization tools.
In several pattern recognition and classification problems, the prefeature extraction technique and deep learning methods have shown outstanding precision. This chapter presents a study on prefeature extraction with different models of convolution neural networks (CNN) and demonstrates its benefits through the use of face emotion classification problems. Gaussian filters with Canny edge detection, most significant bit (MSB) plane slicing, and Gabor filter with element-wise maximum feature extraction are the prefeature extraction techniques. We have considered and evaluated the efficiency of various prefeature extraction techniques with respect to three CNN techniques, such as LeNet, Alexnet, and VGG16 architectures. In our experiments, all CNN techniques implemented using CPU-based systems, along with Gabor filter with element-wise maximum feature extraction technique have high state-of-the-art accuracy and less execution time for the face emotion classification problems of vehicle drivers.
With the advancement in technologies, computer vision aims to imitate the potential capacities of human vision. Many computer vision applications are finding place in our lives. Deep convolutional neural network (DCNN) is a driving force behind these applications. It first extracts low-level features such as lines and edges, then progressively extracts higher and complex features from the data to give the desired output. More the number of layers in the network, higher the accuracy of the network. No doubt a large network gives better results but it will use too much power and becomes a heavy network. Nowadays, smartphones have become an integral part of day-to-day life. If computer vision applications can be used in smartphones then the user can use them anywhere and anytime. The increasing adoption of edge devices has motivated researchers to focus on the network that will work with these resource-restricted edge devices. A large family of deep convolutional neural network (DCNN) works with floating point precision format (float32). This makes the network heavy thereby increasing the computation and inference time. These heavy weight networks are not suitable to be deployed directly on resource-constrained devices. The resource-restricted devices have limited computational speed, power, and storage capabilities. With mobiles, if cloud services are used for processing and analyzing the visual information then not only system inference time will increase but it will also require good Internet connectivity and good bandwidth. Apart from this, there are certain domains where the confidentiality of data is of utmost importance. If we are dealing with the healthcare domain, then the privacy of data needs to be preserved. This leads to the need for on-device machine learning models. For mobile and embedded devices, light weight models are required. These Mobile Net architectures are efficient models for mobile and embedded applications. The expected desirable properties of such network are small model size, low latency, low power consumption, and sufficiently high accuracy. To deploy large DCNN on mobile devices, one needs to optimize them. Quantization techniques are used to compress the models. Vgg16, Inceptionv3, MobileNetV1, MobileNetV2, and NASNet Mobile models are first fine-tuned. Then quantization aware training is used to optimize these fine-tuned networks. These optimized models can be used for any computer vision application. In this study, the diabetic retinopathy dataset is used for experimental purpose. The model complexity is analyzed by counting the number of learnable parameters. Experimental result shows that the optimized fine-tuned model size is reduced since the number of trainable parameters is less as compared to their counterpart fine-tuned models. There is a marginal difference in accuracy between fine-tuned and quantized models. InceptionV3 is better than VGG16. NASNETmobile architecture outperforms MobileNetv1 and MobileNetv2. This chapter reveals the Mobile Net architecture that is built especially for its use on mobile devices.
To ensure the imposition of traffic rules, the most essential yet difficult task is to identify the traffic rule violators. One of these tasks comprises the finding of wrong-way drive of vehicles. In this chapter, traffic enforcement camera monitoring techniques are demonstrated to spot the wrong-way movement of automobiles using a deep convolution neural network. The idea behind this is to recognize such vehicles when they enter a region covered by a closed-circuit television camera and alert the driver. Different case studies are presented using various vehicle detection and monitoring techniques to perform various operations such as counting vehicles, detecting vehicles, and wrong-direction detection.
In this modern and associated world, the travel industry is one of the quickest developing industries and is actually a central point in deciding the economy of some countries such as the UAE and Singapore. 'Tourism' alludes to the movement of individuals from their original place of living to somewhere else with the intention of returning after a brief timeframe. As per insights, around 1.4 billion individuals travelled in the year 2018 and the amount it generated is roughly USD 250 billion around the world. In any case, numerous travelers experience issues when they travel to foreign nations like not understanding the language, issues in understanding the routes and lack of fundamental knowledge. Because of this, the traveler needs to suffer from immense loss of cash and time. Therefore, we have compared and analyzed various features that are essential for helping a traveler, alongside proposing a proficient method to incorporate all these fundamental features and other advanced options to support a vacationer, which is based on Kaldi called Kaldi-based speech interaction system (KBSIS) and OpenCV for performing image analysis. Alongside this, we have used Android Packages (APKs) of leading developers in numerous fields with the goal that we do not rely on onboard high processing powers. We have, for the present, included highlights such as navigation, interpreting text from images, plant and animal identification, face recognition and other essential highlights such as note-taking, time, weather forecast and playing music. All these functions are included on a glass frame which can project the necessary information images onto the glasses. All controls depend on speech recognition and the output is on either visual or sound.
In this chapter, the grey wolf optimizer-based support vector machine method is proposed to detect renal calculi. The proposed method utilizes the preprocessing step, which consists of two main sub-processes, such as filtering and histogram equalization. These methods are used to enhance the image quality by removing speckle noise and normalizing the images. The extracted features are applied on a support vector machine for the classification of renal calculi. The proposed technique shows better conduct when evaluated in comparison with the already existing techniques. The proposed technique attained an accuracy of 96% during classification. This method is expected to aid medical image diagnosis systems with better speed and reliability.
Computer vision and image processing are excelling in the field of segmentation, feature extraction and object detection from image data. In this decade, machine learning, especially deep learning, has brought about significant breakthroughs in vision systems, notably in object detection and recognition area. The major challenging problem in object detection is locating a specific object from within multiple objects. There has been a sustained increase in research in industry and academia related to machine learning in general and deep learning in particular for objection detection and recognition using drones or unmanned aerial vehicles (UAVs) for crop and forest analysis, traffic monitoring, robotics, aerial surveillance, etc. Unlike stationary surveillance, the camera platform of UAVs is in constant motion and makes object extraction difficult. Recent research involving transfer learning and re-use methods for multi-class image classification tested on large-scale datasets has built trust to explore further to optimize algorithms with respect to accuracy, speed, reduction in parameters, etc. The objectives in this chapter are set to benefit readers, who are interested in abreast themselves of recent research in the area of object detection and classification from aerial images using deep learning methods and their efficiency. The challenges faced, respective training issues and testing metrics, available databases and development platforms with useful applications are also discussed in this research.
The citrus fruits cultivation is done throughout the world in a large quantity. The market requires good quality of fruits. However, harvesting using manual methods can be time-consuming, inefficient and also labor-intense process. The labor cost is increasing day by day. People are working to find an efficient method of agriculture where the investment is less and the profit is more. Many research studies have been carried out on both software and hardware part of this automatic mature fruit identification. We are trying to give a software solution in our project with the approach of machine learning. Due to changes in sunlight exposure, weather condition, position randomness of fruits and many other conditions, the images are changed dramatically. In this project, we capture the images of citrus trees. Our objective of covering many natural variant conditions was achieved by collecting images of citrus fruits for all the possible conditions. Using image processing techniques and with the use of multi-class support vector machine, the fruits are segmented the feed-forward neural network is used to locate the fruit in three dimensions. The result of this would be the detection of fruits and clusters of fruits on the images.
Cashew quality is one of the significant parameter defining the rate of the product. The machine vision system is non-destructive, high-efficient substitute technology to the prevailing manual and mechanical methods to evaluate the quality of cashews. Leveraging image processing and machine learning techniques for cataloging cashew kernels quality plummets production expenditure escalates classification accuracy. Concoction of detection of cashew defects and grading cashew kernels is beneficial for the confection of the robust machine vision system [1,2]. An automated cashew kernel grading system using machine vision is proposed and presented, and a study of the effects of various preprocessing techniques in the grading process is made. The proposed work focuses on a cashew defect detector and segregation of high-quality cashew images, which primarily is based on leveraging image processing and machine learning techniques for cataloging cashew kernel quality that plummets production expenditure. The cashew kernel quality is assessed using image processing and machine learning techniques. The various defects in the cashews are demarcated before the grading process, and the cashew kernels are classified into different grades (WW-180, WW-320, WW-450, splits, and SW-240). Image preprocessing and segmentation of cashew kernel is performed using the image processing toolbox of MATLAB®. The Lucy filter and Wiener filter are applied to eliminate blurring effects. Cashew image segmentation is performed by concoction of color threshold and Otsu's segmentation methods. Significant features, namely, color, size and shape, and texture features, are extracted from the segmented cashew kernel image. Texture features are extracted using the gray level co-occurrence matrix. The multi-class support vector machine (SVM) learning models and the random forest classifier are confected using training samples and different labels using the machine learning toolbox. Cataloguing of cashew kernel quality is performed by the SVM classifier and random forest classifier based on trained models. The total training set comprising both defected and good quality cashews considered is 444 samples for training and 136 samples for testing. The defected and good quality cashew kernels are demarcated using the proposed methodology with an SVM classifier accuracy of 89.4% and with a random forest classifier accuracy of 94%. From results, it can be concluded that W-180, splits, and SW-240 are efficiently classified. Classifier accuracy can be improved by increasing the number of samples.