Luca Urbinati

and 1 more

Edward Manca

and 2 more

The demand for edge machine learning (ML) in Internet of Things (IoT) applications has driven interest in Microcontroller Unit (MCU)-based TinyML solutions, especially with the rise of the RISC-V Instruction Set Architecture. While MCUs are power-efficient, their limited resources challenge the deployment of complex ML models. Mixed-Precision Quantization (MPQ) can achieve the best trade-off between model size, energy consumption and accuracy, by using different precision across model layers. However, MCU-class processors often lack hardware support for MPQ. We present an end-to-end flow from training to hardware deployment designed to efficiently run Mixed-Precision Quantized Neural Networks (MP-QNNs) on MCU-class processors. Central to our approach is STAR-MAC, a precision-scalable Multiply-and-Accumulate unit that supports flexible MPQ operations on 16-, 8-, and 4-bit integer data. STAR-MAC combines two subword-parallel techniques, Sum-Together and Sum-Apart, in a unified multiplier architecture that effectively reconfigures for Fully-Connected, 2D Convolution, and Depth-wise layers. We integrate STAR-MAC into the low-power RISC-V Ibex core and validate our flow using an FPGA-based System-on-Chip setup. Inference results on MLPerf Tiny MP-QNN models using our modified TensorFlow Lite for Microcontrollers (TFLM) deployment flow show a 68% latency reduction with little to no accuracy drop against their 8-bit counterparts using the standard TFLM runtime. Synthesis on a 28-nm CMOS technology indicates limited area and power overhead over the original Ibex. We open-source our framework to foster MP-QNN deployment on MCU-class RISC-V processors for low-power and low-latency IoT data processing.