Q-EEGNET：一个节能8位的量化平行EEGNET实现，用于边缘运动象征脑 - 机器接口

论文标题

Q-EEGNET：一个节能8位的量化平行EEGNET实现，用于边缘运动象征脑 - 机器接口

Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet Implementation for Edge Motor-Imagery Brain--Machine Interfaces

论文作者

Schneider, Tibor, Wang, Xiaying, Hersche, Michael, Cavigelli, Lukas, Benini, Luca

论文摘要

电动象征脑界面（MI-BMIS）通过分析用脑电图记录（EEG）记录的大脑活动来有望直接且无障碍通信。延迟，可靠性和隐私约束使其不适合将计算卸载到云中。实际用例要求使用可穿戴电池的设备，其长期用途平均功耗低。最近，出现了复杂的算法，特别是深度学习模型，用于对EEG信号进行分类。在达到出色的准确性的同时，由于其内存和计算要求，这些模型通常会超过边缘设备的局限性。在本文中，我们证明了EEGNET的算法和实施优化，EEGNET是一种适合许多BMI范式的紧凑型卷积神经网络（CNN）。我们将重量和激活量化为8位定点，在4级MI上的准确性损失可忽略不计0.4％，并通过利用其自定义的RISC-VISA扩展和8核综合构成Cluster compute of System-Wolf Parallel Ultra-Low Power（Pulp）System-On-Chip（SOC）对沃尔夫平行的超低功率（PULP）System-On-Chip（SOC）进行了节能硬件感知的实现。通过我们提出的优化步骤，我们可以获得64倍的总体加速度，相对于单核层基线实现，记忆足迹最多可减少85％。我们的实施仅需5.82毫秒，每个推论消耗了0.627 MJ。使用21.0GMAC/S/W，它比在ARM Cortex-M7（0.082GMAC/S/W）上实现EEGNET实现的能源256倍。

Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines by analyzing brain activities recorded with Electroencephalography (EEG). Latency, reliability, and privacy constraints make it unsuitable to offload the computation to the cloud. Practical use cases demand a wearable, battery-operated device with low average power consumption for long-term use. Recently, sophisticated algorithms, in particular deep learning models, have emerged for classifying EEG signals. While reaching outstanding accuracy, these models often exceed the limitations of edge devices due to their memory and computational requirements. In this paper, we demonstrate algorithmic and implementation optimizations for EEGNET, a compact Convolutional Neural Network (CNN) suitable for many BMI paradigms. We quantize weights and activations to 8-bit fixed-point with a negligible accuracy loss of 0.4% on 4-class MI, and present an energy-efficient hardware-aware implementation on the Mr.Wolf parallel ultra-low power (PULP) System-on-Chip (SoC) by utilizing its custom RISC-V ISA extensions and 8-core compute cluster. With our proposed optimization steps, we can obtain an overall speedup of 64x and a reduction of up to 85% in memory footprint with respect to a single-core layer-wise baseline implementation. Our implementation takes only 5.82 ms and consumes 0.627 mJ per inference. With 21.0GMAC/s/W, it is 256x more energy-efficient than an EEGNET implementation on an ARM Cortex-M7 (0.082GMAC/s/W).

下载PDF全文

下载文献需遵守相关版权规定

论文标题