Network Classification for Traffic Management: Anomaly detection, feature selection, clustering and classification
2: Department of Computer Information Systems, University of Albaha, Saudi Arabia
3: Department of Computer Science, University of King Abdulaziz, Saudi Arabia
With the massive increase of data and traffic on the Internet within the 5G, IoT and smart cities frameworks, current network classification and analysis techniques are falling short. Novel approaches using machine learning algorithms are needed to cope with and manage real-world network traffic, including supervised, semi-supervised, and unsupervised classification techniques. Accurate and effective classification of network traffic will lead to better quality of service and more secure and manageable networks. This authored book investigates network traffic classification solutions by proposing transport layer methods to achieve better run and operated enterprise-scale networks. The authors explore novel methods for enhancing network statistics at the transport layer, helping to identify optimal feature selection through a global optimization approach and providing automatic labelling for raw traffic through a SemTra framework to maintain provable privacy on information disclosure properties.
Inspec keywords: computer network security; learning (artificial intelligence); data privacy; pattern clustering; feature selection; telecommunication traffic; computer network management; pattern classification
Other keywords: traffic data publishing; privacy preservation; unsupervised feature selection; clustering algorithms; hybrid clustering-classification; semisupervised network traffic labelling; feature selection; anomaly detection; transport layer statistics quality; enterprise-scale networks; network traffic classification; traffic management
Subjects: Knowledge engineering techniques; Network management; General and management topics; General electrical engineering topics; Computer communications; Computer installation management; Computing security management; Computer networks and techniques; Data handling techniques
- Book DOI: 10.1049/PBPC032E
- Chapter DOI: 10.1049/PBPC032E
- ISBN: 9781785619212
- e-ISBN: 9781785619229
- Page count: 288
- Format: PDF
-
Front Matter
- + Show details - Hide details
-
p.
(1)
-
1 Introduction
- + Show details - Hide details
-
p.
1
–10
(10)
In recent years, knowing what information is passing through the networks is rapidly becoming more and more complex due to the ever-growing list of applications shaping today's Internet traffic. Consequently, traffic monitoring and analysis have become crucial for tasks ranging from intrusion detection, traffic engineering to capacity planning. Network traffic classification is the process of analyzing the nature of the traffic flows on the networks, and it classifies these flows mainly on the basis of protocols (e.g., TCP, UDP, and IMAP) orby different classes of applications (e.g., HTTP, peer-to-peer (P2P), and Games). Network traffic classification has the capability to address fundamentals to numerous network-management activities for Internet Service Providers (ISPs) and their equipment vendors for better quality of service (QoS) treatment. In particular, network operators need an accurate and efficient classification of traffic for effective network planning and design, applications prioritization, traffic shaping/policing, and security control. It is essential that network operators understand the trends in their networks so that they can react quickly to support their business goals. Traffic classification can also be a part of intrusion detection systems (IDS), where the main goal of such systems is to detect a wide range of unusual or anomalous events and to block unwanted traffic.
-
2 Background
- + Show details - Hide details
-
p.
11
–20
(10)
This chapter provides the necessary background that will enable a reader to better understand the remaining chapters of this book. It briefly describes and reviews the progress that has been made in three fields, namely dimensionality reduction, clustering-based methods, and data-driven intrusion detection systems (IDSs). These three areas will hopefully provide a reader with a comprehensive background that will facilitate an understanding of the work carried out in this book.
-
3 Related work
- + Show details - Hide details
-
p.
21
–44
(24)
The main purpose of a network scheduler is to classify differently processed packets. Today, myriads of different methods are used to attain the network classification. The simplest of these would be to correlate parts of data patterns with the popular protocols. A rather advanced method statistically analyzes the packet inter-arrival times, byte frequencies, as well as packet sizes in order. After the traffic flow classification has been done through a certain protocol, a preset policy is used for the traffic flow, including the other flows. This process is conducted in order to achieve a particular quality, i.e., quality of service. This application should be conducted at the exact point when traffic accesses the network. It should also be carried out in a manner that allows the traffic management to take place, isolating the individual flows and queue from the traffic. These individual flows and queue will be shaped differently as well. The next network traffic classification approaches [7,9,17] are considered the most reliable, as they involve a full analysis of the protocol. However, these approaches have certain disadvantages, the first being the encrypted and proprietary protocols. As they do not have a public description, they cannot be classified. Although the implementation of every single protocol possible in the network is a thorough approach, in reality, this is extremely difficult. A single-state tracking protocol might demand quite a lot of resources. Consequently, the method loses its meaning and becomes impractical and unattainable.
-
4 A taxonomy and empirical analysis of clustering algorithms for traffic classification
- + Show details - Hide details
-
p.
45
–67
(23)
Clustering algorithms have emerged as an alternative powerful meta-learning tool to accurately analyze the massive volume of data generated by modern applications. In particular, their main goal is to categorize data into clusters such that objects are grouped in the same cluster when they are “similar” according to specific metrics. There is a vast body of knowledge in the area of clustering and there have been attempts to analyze and categorize them for a larger number of applications. However, one of the major issues in using clustering algorithms for big data that created a confusion amongst the practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization. With the intention of alleviating these problems, this chapter introduces concepts and algorithms related to clustering, a concise survey existing (clustering) algorithms as well as providing a comparison both from a theoretical and empirical perspective. From a theoretical perspective, we come up with a categorizing framework based on the main properties pointed out in previous study. Empirically, extensive experiments are carried out where we compared the most representative algorithm from each of the categories using a large number of real (big) data sets. The effectiveness of the candidate clustering algorithms is measured through a number of internal and external validity metrics, stability, runtime, and scalability tests. Additionally, we highlighted the set of clustering algorithms that are the best performing for big data.
-
5 Toward an efficient and accurate unsupervised feature selection
- + Show details - Hide details
-
p.
69
–89
(21)
Both redundant and nonrepresentative features result in large-volume and high-dimensional data, which degrade the accuracy and performance of classification as well as clustering algorithms. Most of the existing feature selection (FS) methods have limitations when dealing with high-dimensional data, as they search different subsets of features to find accurate representations of all features. Obviously, searching for different combinations of features is computationally very expensive, which makes existing work not efficient for high-dimensional data. The work carried out in this chapter, which relates to the design of an efficient and accurate similarity-based unsupervised feature selection (AUFS) method, tackles mainly the high-dimensionality issue of data by selecting a reduced set of representative and nonredundant features without the need for data class labels.
-
6 Optimizing feature selection to improve transport layer statistics quality
- + Show details - Hide details
-
p.
91
–119
(29)
There is significant interest in the network management and industrial security community about the need to improve the quality of transport layer statistics (TLS) and to identify the “best” and most relevant features. The ability to eliminate redundant and irrelevant features is important in order to improve the classification accuracy and to reduce the computational complexity related to the construction of the classifier. In practice, several feature selection (FS) methods can be used as a preprocessing step to eliminate redundant and irrelevant features and as a knowledge discovery tool to reveal the “best” features in many soft computing applications. This chapter investigates the advantages and disadvantages of such FS methods with new proposed metrics, namely goodness, stability, and similarity. The aim here is to come up with an integrated FS method that is built on the key strengths of existing FS methods. A novel way is described to identify efficiently and accurately the “best” features by first combining the results of some well-known FS methods to find consistent features and then use the proposed concept of support to select the smallest set of features and cover data optimality. The empirical study over ten high-dimensional network traffic datasets demonstrates significant gain in accuracy and improved runtime performance of a classifier compared to individual results of well-known FS methods.
-
7 Optimality and stability of feature set for traffic classification
- + Show details - Hide details
-
p.
121
–148
(28)
Feature selection (FS) methods can be used as a preprocessing step to eliminate meaningless features, and also as a tool to reveal the set of optimal features. Unfortunately, as detailed in Chapter 6, such methods are often sensitive to a small variation in the traffic data collected over different periods of time. Thus, obtaining a stable feature set is crucial in enhancing the confidence of network operators. This chapter describes a robust approach, called global optimization approach (GOA), to identify both optimal and stable features, relying on a multi-criterion fusion-based FS method and an information-theoretic method. GOA first combines multiple well-known FS methods to yield possible optimal feature subsets across different traffic datasets and then uses the proposed adaptive threshold, which is based on entropy to extract the stable features. A new goodness measure is proposed within a random forest framework to estimate the final optimum feature subset. The effectiveness of GOA is demonstrated through several experiments on network traffic data in spatial and temporal domains. Experimental results show that GOA provides up to 98.5% accuracy, exhibits up to 50% reduction in the feature set size, and finally speeds up the runtime of a classifier by 50% compared with individual results produced by other well-known FS methods.
-
8 A privacy-preserving framework for traffic data publishing
- + Show details - Hide details
-
p.
149
–179
(31)
As explained in Chapter 7, sharing network traffic data has become a vital requirement in machine-learning (ML) algorithms when building an efficient and accurate network traffic classification and intrusion detection system (IDS). However, inappropriate sharing and usage of network traffic data could threaten the privacy of companies and prevent sharing of such data.This chapterpresents aprivacy-preserving strategy-based permutation framework, called PrivTra, in which data privacy, statistical properties, and data-mining utilities can be controlled at the same time. In particular, PrivTra involves the followings: (i) vertically partitioning the original dataset to improve the performance of perturbation; (ii) developing a framework to deal with various types of network traffic data, including numerical, categorical, and hierarchical attributes; (iii) grouping the portioned sets into a number of clusters based on the proposed framework; and (iv) accomplishing the perturbation process by altering the original attribute value with a new value (cluster centroid). The effectiveness of PrivTra is shown through several experiments, such as real network traffic, intrusion detection, and simulated network datasets. Through the experimental analysis, this chapter shows that PrivTra deals effectively with multivariate traffic attributes, produces compatible results as the original data, improves the performance of the five supervised approaches, and provides a high level of privacy protection.
-
9 A semi-supervised approach for network traffic labeling
- + Show details - Hide details
-
p.
181
–211
(31)
As discussed in the previous two chapters, the recent promising studies for network classification have relied on the analysis of the statistics of traffic flows and the use of machine learning (ML) methods. However, due to the high cost of manual labeling, it is hard to obtain sufficient, reliable, and up-to-date labeled data for effective IP traffic classification. This chapter discusses a novel semi-supervised approach, called SemTra, which automatically alleviates the shortage of labeled flows for ML by exploiting the advantages of both supervised and unsupervised models. In particular, SemTra involves the followings: (i) generating multi-view representations of the original data based on dimensionality reduction methods to have strong discrimination ability; (ii) incorporating the generated representations into the ensemble clustering model to provide a combined clustering output with better quality and stability; (iii) adapting the concept of self-training to iteratively utilize the few labeled data along with unlabeled within local and global viewpoints; and (iv) obtaining the final class decision by combining the decisions of mapping strategy of clusters, the local self-training and global self-training approaches. Extensive experiments were carried out to compare the effectiveness of SemTra over representative semi-supervised methods using 16 network traffic datasets. The results clearly show that SemTra is able to yield noticeable improvement in accuracy (as high as 94.96%) and stability (as high as 95.04%) in the labeling process.
-
10 A hybrid clustering-classification for accurate and efficient network classification
- + Show details - Hide details
-
p.
213
–227
(15)
The traffic classification is the foundation for many network activities, such as quality of service (QoS), security monitoring, lawful interception, and intrusion detection system (IDS). A recent statistics-based method to address the unsatisfactory results of traditional port-based and payload-based methods has attracted attention. However, the presence of non-informative attributes and noise instances degrade the performance of this method. Thus, to address this problem, in this chapter, a hybrid clustering-classification method (called CluClas) is described to improve the accuracy and efficiency of network traffic classification by selecting informative attributes and representative instances. An extensive empirical study on four traffic data sets shows the effectiveness of the CluClas method.
-
11 Conclusion
- + Show details - Hide details
-
p.
229
–233
(5)
Network traffic classification has the potential to resolve key issues for network operators, including network management problems, quality of service provisioning, Internet accounting and charging, and lawful interception [1]. The traditional network classification techniques that rely mostly on well-known port numbers have been used to identify Internet traffic. Such an approach was successful because traditional applications used fixed port numbers; however [9,10] show that the current generations of peer-to-peer (P2P) applications try to hide their traffic by using dynamic port numbers. Consequently, applications whose port numbers are unknown cannot be identified in advance.
-
Back Matter
- + Show details - Hide details
-
p.
(1)