论文标题
流式时间图:子图匹配
Streaming Temporal Graphs: Subgraph Matching
论文作者
论文摘要
我们研究了在时间流数据中的子图匹配的解决方案。我们提供了一种高级语言,用于描述感兴趣的时间子图,流媒体分析语言(SAL)。 SAL程序被翻译成在集群上并行运行的C ++代码。我们称此SAL的实现是流媒体分析机(SAM)。 SAL程序是简洁的,比直接使用SAM库或使用Apache Flink编写实现的代码线少约20倍。为了基于SAM,我们计算出流NetFlow数据中的时间三角形。另外,我们将SAM与为Flink编写的实施进行了比较。我们发现SAM能够扩展到128个节点或2560个内核,而Apache Flink具有32个节点的最大吞吐量,此后降解。当三角形罕见时,Apache Flink具有优势,而最大量吞吐量的最大量吞吐量大于SAM的最大最大可达到的速率。在我们的实验中,当三角发生的速度快于每秒5个节点的速度快,SAM的表现更好。由于网络通信的潜伏期,这两个框架可能会错过结果。 Sam始终报告的平均预期结果的93.7%,而随着我们的增加到群集的最大尺寸,Flink从83.7%降至52.1%。总体而言,SAM每天可以获得918亿NETFLOWS的比率。
We investigate solutions to subgraph matching within a temporal stream of data. We present a high-level language for describing temporal subgraphs of interest, the Streaming Analytics Language (SAL). SAL programs are translated into C++ code that is run in parallel on a cluster. We call this implementation of SAL the Streaming Analytics Machine (SAM). SAL programs are succinct, requiring about 20 times fewer lines of code than using the SAM library directly, or writing an implementation using Apache Flink. To benchmark SAM we calculate finding temporal triangles within streaming netflow data. Also, we compare SAM to an implementation written for Flink. We find that SAM is able to scale to 128 nodes or 2560 cores, while Apache Flink has max throughput with 32 nodes and degrades thereafter. Apache Flink has an advantage when triangles are rare, with max aggregate throughput for Flink at 32 nodes greater than the max achievable rate of SAM. In our experiments, when triangle occurrence was faster than five per second per node, SAM performed better. Both frameworks may miss results due to latencies in network communication. SAM consistently reported an average of 93.7% of expected results while Flink decreases from 83.7% to 52.1% as we increase to the maximum size of the cluster. Overall, SAM can obtain rates of 91.8 billion netflows per day.