论文标题

et-bert:一种上下文化的数据报表示,具有用于加密流量分类的预训练变压器

ET-BERT: A Contextualized Datagram Representation with Pre-training Transformers for Encrypted Traffic Classification

论文作者

Lin, Xinjie, Xiong, Gang, Gou, Gaopeng, Li, Zhen, Shi, Junzheng, Yu, Jing

论文摘要

加密的流量分类需要从内容不可感知和不平衡的流量数据中捕获的歧视性和可靠的流量表示,以进行准确的分类,这具有挑战性,但不可或缺以实现网络安全性和网络管理。现有解决方案的主要局限性是,它们高度依赖于深层特征,这些功能过于取决于数据大小,并且很难推广到看不见的数据。如何利用开放域的未标记流量数据以强大的概括能力学习表示形式仍然是一个关键挑战。在本文中,我们提出了一个新的流量表示模型,称为Transformer(ET-ETR)的加密流量双向编码器表示,该模型从大规模未标记的数据中预先培训进行了深入培训的深层上下文化数据报级表示。可以对少数任务的标签数据进行微调,并在五个加密的交通分类任务中实现最先进的性能,显着将ISCX-TOR的F1推向99.2%,至99.2%(4.4%绝对即兴),ISCX-VPN-Service(ISCX-vpn-service to 98.9%)(5.2%cromist and Improffort),5.2%(5.2%),5.2%(5.2%),以及5.2%(5.2%)。 (绝对改善为5.4%),CSTNET-TLS 1.3至97.4%(绝对改善10.0%)。值得注意的是,我们通过分析密码的随机性来解释实证强大的预训练模型。它为我们了解分类能力在加密流量上的边界方面提供了见解。该代码可在以下网址提供:https://github.com/linwhitehat/et-bert。

Encrypted traffic classification requires discriminative and robust traffic representation captured from content-invisible and imbalanced traffic data for accurate classification, which is challenging but indispensable to achieve network security and network management. The major limitation of existing solutions is that they highly rely on the deep features, which are overly dependent on data size and hard to generalize on unseen data. How to leverage the open-domain unlabeled traffic data to learn representation with strong generalization ability remains a key challenge. In this paper,we propose a new traffic representation model called Encrypted Traffic Bidirectional Encoder Representations from Transformer (ET-BERT), which pre-trains deep contextualized datagram-level representation from large-scale unlabeled data. The pre-trained model can be fine-tuned on a small number of task-specific labeled data and achieves state-of-the-art performance across five encrypted traffic classification tasks, remarkably pushing the F1 of ISCX-Tor to 99.2% (4.4% absolute improvement), ISCX-VPN-Service to 98.9% (5.2% absolute improvement), Cross-Platform (Android) to 92.5% (5.4% absolute improvement), CSTNET-TLS 1.3 to 97.4% (10.0% absolute improvement). Notably, we provide explanation of the empirically powerful pre-training model by analyzing the randomness of ciphers. It gives us insights in understanding the boundary of classification ability over encrypted traffic. The code is available at: https://github.com/linwhitehat/ET-BERT.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源