[RNN+Encrypted Traffic A] EBSNN: Extended Byte Segment Neural Network for Network Traffic Classification

Article directory

  • Introduction to the paper
    • Summary
    • Problems
    • Paper contribution
      • 1. EBSNN
      • 2. Experiment
    • Summarize
      • Paper content
      • data set
      • Readable citations

Introduction to the paper

Original title: EBSNN: Extended Byte Segment Neural Network for Network Traffic Classification
Chinese title: Extended byte segment neural network for network traffic classification
Published in journal: IEEE Transactions on Dependable and Secure Computing (TDSC)
Publication year: 2022-9-1
Author: Xi Xiao
latex citation:

@article{xiao2021ebsnn,
  title={EBSNN: Extended byte segment neural network for network traffic classification},
  author={Xiao, Xi and Xiao, Wentao and Li, Rui and Luo, Xiapu and Zheng, Haitao and Xia, Shutao},
  journal={IEEE Transactions on Dependable and Secure Computing},
  volume={19},
  number={5},
  pages={3521--3538},
  year={2021},
  publisher={IEEE}
}

Abstract

Network flow classification is an important part of intrusion detection and network management. Existing methods are mostly based on machine learning techniques and rely on manually extracted features from streams or packets. However, with the rapid growth of network applications, it is difficult for these methods to handle new complex applications. In this paper, we design a new neural network, Extended Byte Segment Neural Network (EBSNN), to classify network traffic.

EBSNN first divides the data packet into a header segment and a payload segment, and then feeds it into an encoder composed of a recurrent neural network with an attention mechanism. Based on the output, another encoder learns a high-level representation of the entire packet. In particular, side-channel features are learned from header segments to improve performance. Finally, the label of the message is obtained through the softmax function. Additionally, EBSNN can classify network flows by examining the first few packets. Experiments on real data sets show that EBSNN achieves better performance than existing methods in both application recognition tasks and website recognition tasks.

Existing problems

  1. Due to the large number of new applications, manually finding new features that apply to all applications is time-consuming and error-prone.
  2. The network traffic of modern applications is complex and dynamic, and traditional mining methods cannot generalize well to the specific application of fingerprint recognition.

Thesis contribution

  1. A new deep learning network, Extended Byte Segment Neural Network (EBSNN), is designed for traffic classification. EBSNN introduces an aggregation strategy that relies only on the first k packets in a flow to identify flows. Extensions on flow-level classification and website identification tasks demonstrate the scalability of EBSNN.
  2. EBSNN exploits side channel characteristics to improve performance. The header is the input of EBSNN, from which appropriate side-channel features are automatically learned and utilized.
  3. Compared with the dataset in [10] which only contains 10 classes, in this work we collect and publish two large-scale real-world datasets from 29 applications and 20 popular websites, covering most of daily life field. Additionally, more baseline methods are implemented for performance evaluation. In addition to the traditional machine learning-based method (Securitas [11]), another deep learning-based method DeepPacket [12] is also used as a baseline.

The paper’s approach to solving the above problems:

Compared with traditional machine learning-based methods, deep learning has amazing generalization and robustness and can automatically learn more complex and expressive features. Since network packets can be viewed as a natural language between network applications, we propose a new deep learning neural network, Extended Byte Segment Neural Network (EBSNN), for traffic classification.

Thesis tasks:

  • Identify applications sending traffic
  • Determine which websites to visit based on captured traffic

1. EBSNN

  • work process:

    1. Each packet is preprocessed and fed into the segment generator, where it is split into a series of segments, including header segments and payload segments.
    2. Then, the attention encoder converts each segment into a segment vector.
    3. After that, all these vectors of the packet are combined into another attention encoder to get the representation vector of the entire packet (in Figure 1, different colors of attention encoders have different RNN layers).
    4. Finally, the representation vector is introduced into the softmax function to calculate the predicted label of the bag.
  1. preprocessing

    The header consists of an Ethernet II header, an IPv4 header, and a TCP/UDP header.

    1. First read the packet in binary format and convert it into a sequence of hexadecimal integers. Divide it into 4 segments: Ethernet header, IPv4 header, TCP/UDP header, payload subsequence
    • IPv4 header length: Internet header length field in the IPv4 header
    • TCP/UDP header length: Data Offset field in TCP header (length field in UDP header)
    • U

      U

      U: The value of the Internet header length field

    • V

      V

      V: For the TCP header, V is the value of the Data Offset field; for the UDP header, V is the value of the Length field.

    • W

      W

      W: length of original packet

    1. mask and padding

    mask: Not every byte in the header segment is meaningful. For example, the source IP and destination IP in the IPv4 header will cause overfitting, so these fields need to be masked with 00. Here are the meaningless bytes in the header:

    • Ethernet II header: Since this header only contains EtherType, source MAC address and destination MAC address, we simply discard the entire Ethernet II header
    • IPv4 header: IP identification (bits 32-37 in the IPv4 header), IP checksum (bits 80-5), source IP address (bits 96-127), destination IP address (bits 128-159) .
    • TCP/UDP header: source port (0-15 bits), destination port (16-31 bits)

    padding: the length of the payload segment

    M

    M

    M is not the field length in most cases

    N

    N

    N is an integer, so the tail needs to be filled with 00, and the payload segment is divided into

    ?

    M

    /

    N

    ?

    \lceil M/N \rceil

    ?M/N? byte segments.

    1. Input these 4 segments into the segment generator and output two byte segments (header segment, payload segment). The length of each byte segment is fixed, denoted as

      N

      N

      N

  2. Model

    • The first attention encoder:

      Each byte segment will generate an s, and finally merge to generate a

      S

      =

      [

      s

      1

      ,

      s

      2

      ,

      .

      .

      .

      ,

      s

      L

      ]

      R

      f

      ?

      L

      S = [s_1,s_2,…,s_L]\in R^{f*L}

      S=[s1?,s2?,…,sL?]∈Rf?L, where

      L

      L

      L is the number of fields

    • Second attention encoder:

      d

      =

      A

      t

      t

      e

      n

      t

      i

      o

      n

      e

      n

      c

      o

      d

      e

      r

      (

      [

      s

      1

      ,

      .

      .

      .

      ,

      s

      L

      ]

      )

      d = Attention encoder([s_1,…,s_L])

      d=Attentionencoder([s1?,…,sL?]), where the output d is the context information of all fragments

    • To solve the problem of class imbalance, focal loss is used

  3. Extension to Flow-Level Classification


    Some studies [16], [47] show that the first k packets of a flow contain important information about the flow. Therefore, we input the first k packets, pass them through EBSNN, and finally aggregate the softmax voting results to output the classification results of the flow.

    Drop ACK, SYN, etc. zero-length packets that do not contain payload

2. Experiment

  1. data set

    • D1 Dataset: D1 is a large-scale real-world dataset for application identification, which is based on 29 popular applications in many domains such as email, music, video, social networking, search, and shopping. Capture protocol-specific datagrams by tracing the corresponding application process. Specifically, the traffic collector first obtains the socket information of the target application. The datagram is then filtered and saved based on the socket information. In this way, network traffic can be collected without noise. Note that our dataset consists of encrypted and unencrypted network traffic data, which reflects real-world network scenarios.

    • D2 Dataset: D2 is another large real-world dataset for website identification, containing 20 popular websites. We collected D2 in 2020 by running a custom lightweight browser 2 and a network traffic dump tool (i.e. WireShark) in an isolated environment. The isolated environment is built with Docker and network namespaces, with only the browser and WireShark running. In order to collect as much pure web traffic as possible, the custom browser disables all noisy web traffic (such as search recommendations, browser update checks, and account synchronization). For some websites that require login accounts, such as QQ Mail and Twitter, we manually log in to personal accounts to obtain real network traffic. Compared with application identification, website identification [2], [49] is a noisier task because many websites share the same static files stored in the same CDN and exhibit similar behaviors. Specifically, D2 consists of JD.com, NetEase Cloud Music, TED, Amazon, Baidu, Bing, Douban, Facebook, Google, IMDb, Instagram, iQiyi, QQ Mailbox, Reddit, Taobao, Tieba, Twitter, Sina Weibo, It consists of 20 websites including Youku and Youtube, covering almost all aspects of daily life. Most of their traffic uses https and cannot be distinguished by destination IP and port.

  2. Preferences:

    • input embedding size (E): 64

    • RNN embedding size: 100

    • attention encoder embedding size:100

    • learning rate:0.001

    • dropout rate: 0.5

    • N:16

    • batch size: 128

      • Parameters of focal loss: 1

      • RNN types: One-way LSTM and GRU

  3. Experimental results

    • D1 data set:

    • D2 data set:

    • Class:

Summary

Paper content

  1. learned methods

    Theoretical method:

    1. How the data set was collected
    2. Package level methods

Dataset

https://drive.google.com/drive/folders/1-I3a3lM6v_ANU6uu_AUmpNYt7rGu3kzt

Readable citations

  • Byte segment neural network for network traffic classification
  • Deep packet: A novel approach for encrypted traffic classification using deep learning
  • Robust smartphone app identification via encrypted network traffic analysis
  • MaMPF: Encrypted traffic classification based on multi-attribute Markov probability fingerprints