如何在软件日志上配置蒙版事件异常检测？

论文标题

如何在软件日志上配置蒙版事件异常检测？

How to Configure Masked Event Anomaly Detection on Software Logs?

论文作者

Nyyssölä, Jesse, Mäntylä, Mika, Varela, Martín

论文摘要

使用蒙版事件预测的软件日志异常事件检测具有多种技术方法，具有无数的配置和参数。我们的目标是为未来的类似研究提供设置的基准。我们使用的模型是N-Gram模型，它是自然语言处理（NLP）领域的经典方法，以及两个深度学习（DL）模型长短期记忆（LSTM）和卷积神经网络（CNN）。对于数据集，我们使用了四个数据集的bluegene/L（BGL），Hadoop分布式文件系统（HDFS）和Hadoop。其他设置是滑动窗口的大小，它决定了我们使用的周围事件来预测给定事件，掩码位置（我们预测的窗口中的位置），仅使用唯一序列以及用于培训的数据的一部分。结果显示了可以跨数据集概括的设置的明确指示。随着窗口尺寸的增加，DL模型的性能不会恶化，而N-Gram模型的性能较差，而BGL和Profilence数据集上的较大窗口尺寸。尽管下一个事件预测的流行，但结果表明，在这种情况下，最好不要在子序列的边缘（即第一个或最后一个事件）上预测事件，最佳结果来自预测窗口大小为五个事件的第四次事件。关于用于培训的数据量，结果显示了数据集和模型之间的差异。例如，与DL模型相比，N-Gram模型对缺乏数据更敏感。总体而言，对于类似的实验设置，我们建议以下一般基线：窗口尺寸10，掩盖位置仅次于最后，不要过滤出非唯一的序列，并使用总数据的一半进行训练。

Software Log anomaly event detection with masked event prediction has various technical approaches with countless configurations and parameters. Our objective is to provide a baseline of settings for similar studies in the future. The models we use are the N-Gram model, which is a classic approach in the field of natural language processing (NLP), and two deep learning (DL) models long short-term memory (LSTM) and convolutional neural network (CNN). For datasets we used four datasets Profilence, BlueGene/L (BGL), Hadoop Distributed File System (HDFS) and Hadoop. Other settings are the size of the sliding window which determines how many surrounding events we are using to predict a given event, mask position (the position within the window we are predicting), the usage of only unique sequences, and the portion of data that is used for training. The results show clear indications of settings that can be generalized across datasets. The performance of the DL models does not deteriorate as the window size increases while the N-Gram model shows worse performance with large window sizes on the BGL and Profilence datasets. Despite the popularity of Next Event Prediction, the results show that in this context it is better not to predict events at the edges of the subsequence, i.e., first or last event, with the best result coming from predicting the fourth event when the window size is five. Regarding the amount of data used for training, the results show differences across datasets and models. For example, the N-Gram model appears to be more sensitive toward the lack of data than the DL models. Overall, for similar experimental setups we suggest the following general baseline: Window size 10, mask position second to last, do not filter out non-unique sequences, and use a half of the total data for training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题