论文标题
轻巧的词汇测试优先级以立即反馈
Lightweight Lexical Test Prioritization for Immediate Feedback
论文作者
论文摘要
单位测试的实践使程序员能够获得有关当前编辑程序是否与测试用例中指定的期望一致的自动反馈。当反馈立即发生时,反馈是最有价值的,因为缺陷可以立即纠正,然后才能更难解决。但是,随着运行较长的测试套件的增长,反馈的频率较低,并且落后于计划更改。 测试优先级的目的是对测试进行排名,以便尽早发现缺陷(如果存在),或者成本最低。尽管有许多静态方法仅根据程序的当前版本输出测试排名,但我们专注于基于变更的测试优先级,该测试建议可能因最新程序更改而失败的测试。规范方法依赖于覆盖范围数据,并确定覆盖已更改区域的测试的优先级,但是获取和更新覆盖范围数据是昂贵的。最近,事实证明,在变化和测试之间利用重叠词汇的信息检索技术已被证明是强大而轻量级的。 在这项工作中,我们演示了信息检索的功能,用于以Python为例,在动态编程语言中优先考虑测试。我们讨论并测量了先前研究的变化点,包括如何使用程序更改的上下文信息,并设计用于量身定制的用于检索失败测试的广泛\ emph {tf-idf}检索模型的替代方案。 为了通过关联的测试失败获得程序更改,我们设计了一种工具,该工具可以从版本历史记录及其测试结果中生成一系列错误的更改。使用此数据集,我们使用四个开源Python项目比较了现有和新的词汇优先级策略,显示了对未经处理和随机测试订单的大量改进,并与静态类型语言的相关工作一致。 我们得出的结论是,在没有覆盖范围数据的情况下或像动态语言一样,在没有覆盖范围数据或静态分析的情况下,在没有覆盖数据或静态分析的情况下,可以预测测试失败测试的有效工具。这些知识可以使依靠快速反馈的单个程序员以及连续集成基础架构的运营商受益,在该基础架构中,可以通过在构建周期早期检测缺陷来早日释放资源。
The practice of unit testing enables programmers to obtain automated feedback on whether a currently edited program is consistent with the expectations specified in test cases. Feedback is most valuable when it happens immediately, as defects can be corrected instantly before they become harder to fix. With growing and longer running test suites, however, feedback is obtained less frequently and lags behind program changes. The objective of test prioritization is to rank tests so that defects, if present, are found as early as possible or with the least costs. While there are numerous static approaches that output a ranking of tests solely based on the current version of a program, we focus on change-based test prioritization, which recommends tests that likely fail in response to the most recent program change. The canonical approach relies on coverage data and prioritizes tests that cover the changed region, but obtaining and updating coverage data is costly. More recently, information retrieval techniques that exploit overlapping vocabulary between change and tests have proven to be powerful, yet lightweight. In this work, we demonstrate the capabilities of information retrieval for prioritizing tests in dynamic programming languages using Python as example. We discuss and measure previously understudied variation points, including how contextual information around a program change can be used, and design alternatives to the widespread \emph{TF-IDF} retrieval model tailored to retrieving failing tests. To obtain program changes with associated test failures, we designed a tool that generates a large set of faulty changes from version history along with their test results. Using this data set, we compared existing and new lexical prioritization strategies using four open-source Python projects, showing large improvements over untreated and random test orders and results consistent with related work in statically typed languages. We conclude that lightweight IR-based prioritization strategies are effective tools to predict failing tests in the absence of coverage data or when static analysis is intractable like in dynamic languages. This knowledge can benefit both individual programmers that rely on fast feedback, as well as operators of continuous integration infrastructure, where resources can be freed sooner by detecting defects earlier in the build cycle.