[RNN+Encrypted Traffic A] ET-BERT: A Contextualized Datagram Representation with Pre-training Transformers for…

Article directory

Introduction to the paper
- Summary
- Problems
- Paper contribution
- - 1.ET-BERT
  - 2. Experiment
- Summarize
- - Paper content
  - data set
  - Readable citations
  - Reference connection

Introduction to the paper

Original title: ET-BERT: A Contextualized Datagram Representation with Pre-training Transformers for Encrypted Traffic Classification
Chinese title: ET-BERT: A datagram contextual representation method based on pre-trained transformers for encrypted traffic classification
Conference: WWW ’22: The ACM Web Conference 2022
Publication year: 2022-4-25
Author: Xinjie Lin
latex citation:

@inproceedings{lin2022bert,
  title={Et-bert: A contextualized datagram representation with pre-training transformers for encrypted traffic classification},
  author={Lin, Xinjie and Xiong, Gang and Gou, Gaopeng and Li, Zhen and Shi, Junzheng and Yu, Jing},
  booktitle={Proceedings of the ACM Web Conference 2022},
  pages={633--642},
  year={2022}
}

Abstract

Encrypted flow classification requires obtaining discriminative and robust traffic representations from content-invisible and imbalanced traffic data to achieve accurate classification, which is required to achieve network A challenge, but essential, for security and network management. The main limitation of existing solutions is that they are highly dependent on deep features, which are overly dependent on data size and difficult to generalize on unseen data. How to leverage open-domain unlabeled traffic data to learn representations with strong generalization capabilities remains a key challenge.

In this paper, we propose a new traffic representation model called Encrypted Traffic Bidirectional Encoder Representation from Transformers (ET-BERT), which pre-trains deeply contextualized packet-level representations from large-scale unlabeled data. The pre-trained model can be fine-tuned on a small amount of task-specific labeled data and achieves state-of-the-art performance in five encrypted traffic classification tasks, significantly improving the F1 of ISCX-VPN-Service to 98.9% (5.2%↑), The cross-platform (Android) increased to 92.5% (5.4%↑), and the F1 of CSTNET-TLS increased to 1.3 to 97.4% (10.0%↑). Notably, we provide an explanation of empirically robust pre-trained models by analyzing cryptographic randomness. It provides new ideas for us to understand the boundaries of encrypted traffic classification capabilities.

Code is available at https://github.com/linwhitehat/ET-BERT.

Existing problems

Fingerprint matching based on plaintext features [37]: Not suitable for emerging encryption technologies (such as TLS 1.3) because the plaintext becomes more sparse or obfuscated.
Machine learning based on statistical features [27,36]: Highly dependent on features designed by experts, with limited generalization ability.
ML based on original traffic characteristics [21,22]: Highly dependent on the number and distribution of labeled training data, it is easy to cause model deviation and difficult to adapt to emerging encryption.

Thesis contribution

We propose a pre-training framework for encrypted traffic classification that leverages large-scale unlabeled encrypted traffic to learn a common datagram representation for a range of encrypted traffic classification tasks.
Two traffic-specific self-supervised pre-training tasks are proposed, such as masked BURST model and homologous BURST prediction, which capture byte-level and BURST-level context to obtain a universal datagram representation.
ET-BERT has strong generalization ability and has been implemented on five encrypted traffic classification tasks, including general encrypted application classification, encrypted malware classification, encrypted traffic classification on VPN, encrypted application classification on Tor, and encrypted application classification on TLS 1.3. The new state-of-the-art performance significantly outperforms existing works by 5.4%, 0.2%, 5.2%, 4.4%, and 10.0%.
At the same time, the powerful performance of the pre-trained model is theoretically explained and analyzed.

The paper’s approach to solving the above problems:

Pretraining-based methods employ large amounts of unlabeled data to learn unbiased data representations. Such data representations can be easily transferred to downstream tasks by fine-tuning on a limited amount of labeled data.

Enhanced generalization ability

Not dependent on expert knowledge

Ability to identify emerging encrypted traffic (not present in the training data)

No need to label data

Thesis tasks:

Use BERT for multi-classification of encrypted traffic. The pre-trained model remains unchanged. The downstream task is single sentence prediction.

1. ET-BERT

Datagram2Token
1. The BURST generator extracts persistent server-to-client or client-to-server packets in a session stream, called BURST [28, 33], to represent partially complete information of the session.
  
  BURST: A set of time-contiguous network packets from requests or responses within a single session stream.
  
  Among them, and represent the maximum number of one-way packets from source to destination and destination to source respectively.
  
  Simple to understand: A BURST represents a flow in a session that has been cut according to forward and reverse flows. As shown in the figure, assuming that dark blue represents forward flow, then light blue represents reverse flow.
2. The BURST2Token process then converts the datagram in each BURST into token embedding through the bi-gram model. At the same time, this process also divides BURST into two parts to prepare for the pre-training task.
  1. Represent the stream in BRUST as a hexadecimal encoding sequence
  2. Use bi-gram to encode a hexadecimal sequence, that is, each unit consists of two adjacent byte units. In this way, the value range of each unit is 0~65535, so the dictionary size
    ∣
    
    V
    
    ∣
    
    =
    
    65536
    
    |V|=65536
    
    ∣V∣=65536
  3. Divide the encoded hexadecimal sequence evenly into two subsequences:
    s
    
    u
    
    b
    
    ?
    
    B
    
    R
    
    U
    
    S
    
    T
    
    A
    
    sub-BRUST^A
    
    sub?BRUSTA、
    
    s
    
    u
    
    b
    
    ?
    
    B
    
    R
    
    U
    
    S
    
    T
    
    B
    
    sub-BRUST^B
    
    sub?BRUSTB
  4. Add special tokens: [CLS], [SEP], [PAD], [MASK]
3. Finally, Token2Emebdding concatenates the token embedding, position embedding and segmentation embedding of each token as a pre-trained input representation.
  
  We represent each token obtained in BURST2Token through three embeddings: token embedding, position embedding and segment embedding.
Pre-training
- Masked BURST Model: This task is similar to the masked language model used by BERT b[6]. ET-BERT is trained to predict tokens for masked locations based on context. The loss function is as follows:
- Same-origin BURST Prediction: For this task, a binary classifier is used to predict whether two sub-BURSTs come from the same BURST. Specifically:
  - 50% probability
    s
    
    u
    
    b
    
    ?
    
    B
    
    U
    
    R
    
    S
    
    T
    
    B
    
    sub-BURST^B
    
    sub?BURSTB is actual
    
    s
    
    u
    
    b
    
    ?
    
    B
    
    U
    
    R
    
    S
    
    T
    
    A
    
    sub-BURST^A
    
    sub?BURSTA second half (from the same BURST)
  - 50% probability is a random
    s
    
    u
    
    b
    
    ?
    
    B
    
    U
    
    R
    
    S
    
    T
    
    sub-BURST
    
    sub?BURST.
  The loss function is as follows:
  
  in
  
  B
  
  j
  
  =
  
  (
  
  s
  
  u
  
  b
  
  ?
  
  B
  
  j
  
  A
  
  ,
  
  s
  
  u
  
  b
  
  ?
  
  B
  
  j
  
  B
  
  )
  
  ,
  
  y
  
  j
  
  ∈
  
  [
  
  0
  
  ,
  
  1
  
  ]
  
  B_j = (sub-B_j^A,sub-B_j^B)，y_j \in [0,1]
  
  Bj?=(sub?BjA?,sub?BjB?)，yj?∈[0,1]
Fine-tuning
Since the structures of fine-tuning and pre-training are basically the same, we input task-specific packet or flow representation into the pre-trained ET-BERT and fine-tune all parameters in the end-to-end model. At the output layer, the [CLS] representation is fed to the multi-class classifier for prediction. We propose two fine-tuning strategies to adapt to the classification of different scenarios:
- Taking the packet level as input, we specifically test whether ET-BERT can adapt to more fine-grained traffic data, called ET-BERT(packet)
- The flow level is taken as input and is dedicated to a fair and objective comparison of ET-BERT with other methods, called ET-BERT(flow).
The main difference between these two fine-tuned models is the amount of information in the input traffic. We use spliced datagrams of consecutive packets in the flow as input data, where ? is set to 5 in our method. Section 4.1 describes the processing of traffic data in detail.

2. Experiment

data set
- Task 1: Generic Encryption Application Classification (GEAC): Classify application traffic based on standard encryption protocols.
- Task 2: Encrypted Malware Classification (EMC): a collection of encrypted traffic consisting of malware and benign applications [41].
- Task 3: VPN Encrypted Traffic Classification (ETCV): Classify encrypted traffic using VPN for network communication.
- Task 4: Tor-based Encrypted Application Classification (EACT): Aims to classify encrypted traffic using Tor routers (Tor) to enhance communication privacy.
- Task 5: Encrypted application classification based on TLS 1.3 (EAC1.3): Aims to classify encrypted traffic based on the new encryption protocol TLS 1.3. This data set is 120 applications we collected under CSTNET from March to July 2021, named CSTNET-tls 1.3. As we know, this is the first TLS 1.3 dataset to date. The application is obtained from an Alexa Top-5000[3] deployed with TLS 1.3, and we mark each session flow with a Server Name Indication (SNI). In CSTNET-TLS 1.3, since TLS
  For 1.3 compatibility, SNI is still accessible. The ECH mechanism will disable SNI and compromise marking accuracy in the future, but we will discuss some ideas to overcome it in Section 5.
Data preprocessing
1. Address Resolution Protocol (ARP) and Dynamic Host Configuration Protocol (DHCP) packets have been removed and are not relevant to the specific traffic transporting content.
2. To avoid the impact of packet headers, which may introduce partial interference in a limited set with strong identifying information (such as IP and port [19, 25, 40]), we removed the Ethernet header, IP header and protocol in the TCP header port.
3. In the fine-tuning phase, we randomly select up to 500 flows and 5000 packets from each class across all datasets.
4. Each data set is divided into training set, validation set and test set in a ratio of 8:1:1
parameter:

Package level:
- batch_size = 32
- learning_rate = 0.00002
- ratio of warmup = 0.1
- epoch = 10
Stream level:
- batch_size = 32
- learning_rate = 0.00006
- dropout = 0.5
Effect
ablation experiment

Summary

Paper content

learned methods

Theoretical approach: using bert for encrypted traffic classification
Thesis advantages and disadvantages
advantage:
1. Provides ideas for BERT’s application in network security
shortcoming:
1. There are not enough changes to bert, it’s basically just copying without much innovation.
innovative ideas

See if you can make any changes in downstream tasks or data granularity (BURST used here)

Dataset

Data sets used for pre-training: Public data sets: [9,32]
- cicids2018
Data sets used for fine-tuning: Public data sets: [37,41,9,10]
- Cross-Platform(IOS)、Cross-Platform(Android)
- USTC-TFC
- ISCX-VPN-Service、ISCX-VPN-App
- ISCX-Tor
Passively collect traffic under China Science and Technology Network (CSTNET)

Readable citations

Encrypted traffic detection:

FlowPrint: Semi-Supervised Mobile-App Fingerprinting on Encrypted Network Traffic
Robust Smartphone App Identification via Encrypted Network Traffic Analysis
Deep Fingerprinting: Undermining Website Fingerprinting Defenses with Deep Learning
TSCRNN: A novel classification scheme of encrypted traffic based on flow spatiotemporal features for efficient management of IIoT
Deep packet: a novel approach for encrypted traffic classification using deep learning
Exploiting Diversity in Android TLS Implementations for Mobile App Traffic Classification

Pre-trained model:

RoBERTa: RoBERTa: A Robustly Optimized BERT Pretraining Approach
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ERNIE: ERNIE: Enhanced Language Representation with Informative Entities
DistilBERT: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Reference link

Code: https://github.com/linwhitehat/ET-BERT