论文标题
SpeedLimit:神经建筑搜索量化的变压器模型
SpeedLimit: Neural Architecture Search for Quantized Transformer Models
论文作者
论文摘要
尽管在变压器模型领域的研究主要集中在增强诸如准确性和困惑之类的性能指标上,但行业中的实际应用通常需要严格考虑推理潜伏期限制。在应对这一挑战时,我们引入了SpeedLimit,这是一种新型的神经体系结构搜索(NAS)技术,该技术在粘附于上限延迟约束的同时优化精度。我们的方法在搜索过程中包含了8位整数量化,以优于当前最新技术。我们的结果强调了在绩效和延迟之间寻求最佳平衡的可行性和功效,这为在潜伏期敏感环境中部署最先进的变压器模型提供了新的途径。
While research in the field of transformer models has primarily focused on enhancing performance metrics such as accuracy and perplexity, practical applications in industry often necessitate a rigorous consideration of inference latency constraints. Addressing this challenge, we introduce SpeedLimit, a novel Neural Architecture Search (NAS) technique that optimizes accuracy whilst adhering to an upper-bound latency constraint. Our method incorporates 8-bit integer quantization in the search process to outperform the current state-of-the-art technique. Our results underline the feasibility and efficacy of seeking an optimal balance between performance and latency, providing new avenues for deploying state-of-the-art transformer models in latency-sensitive environments.