LUCA URBINATI - 21DOCS Test Area

LUCA URBINATI

Researcher at the National Research Council of Italy, CNR, IEIIT

Bologna, Italy

Public Documents 4

Design-Space Exploration of Mixed-precision DNN Accelerators based on Sum-Together Mu...

Luca Urbinati

and 1 more

November 14, 2024

Mixed-precision quantization (MPQ) is gaining momentum in academia and industry as a way to improve the trade-off between accuracy and latency of Deep Neural Networks (DNNs) in edge applications. MPQ requires dedicated hardware to support different bit-widths. One approach uses Precision-Scalable MAC units (PSMACs) based on multipliers operating in Sum-Together (ST) mode. These can be configured to compute N = 1, 2, 4 multiplications/dot-products in parallel with operands at 16/N bits. We contribute to the State of the Art (SoA) in three directions: we compare for the first time the SoA ST multipliers architectures in performance, power and area; compared to previous work, we contribute to the portfolio of STbased accelerators proposing three designs for the most common DNN algorithms: 2D-Convolution, Depth-wise Convolution and Fully-Connected; we show how these accelerators can be obtained with a High-Level Synthesis (HLS) flow. In particular, we perform a design-space exploration (DSE) in area, latency, power, varying many knobs, including PSMAC units parallelism, clock frequency and ST multipliers type. From the DSE on a 28-nm technology we observe that both at multiplier level and at accelerator level there is no one-fits-all solution for each possible scenario. Our findings allow accelerators' designers to choose, out of a rich variety, the best combination of ST multiplier and HLS knobs depending on the target, either high performance, low area, or low power.

An End-to-End Flow to Deploy and Accelerate TinyML Mixed-Precision Models on RISC-V M...

Edward Manca

and 2 more

November 14, 2024

The demand for edge machine learning (ML) in Internet of Things (IoT) applications has driven interest in Microcontroller Unit (MCU)-based TinyML solutions, especially with the rise of the RISC-V Instruction Set Architecture. While MCUs are power-efficient, their limited resources challenge the deployment of complex ML models. Mixed-Precision Quantization (MPQ) can achieve the best trade-off between model size, energy consumption and accuracy, by using different precision across model layers. However, MCU-class processors often lack hardware support for MPQ. We present an end-to-end flow from training to hardware deployment designed to efficiently run Mixed-Precision Quantized Neural Networks (MP-QNNs) on MCU-class processors. Central to our approach is STAR-MAC, a precision-scalable Multiply-and-Accumulate unit that supports flexible MPQ operations on 16-, 8-, and 4-bit integer data. STAR-MAC combines two subword-parallel techniques, Sum-Together and Sum-Apart, in a unified multiplier architecture that effectively reconfigures for Fully-Connected, 2D Convolution, and Depth-wise layers. We integrate STAR-MAC into the low-power RISC-V Ibex core and validate our flow using an FPGA-based System-on-Chip setup. Inference results on MLPerf Tiny MP-QNN models using our modified TensorFlow Lite for Microcontrollers (TFLM) deployment flow show a 68% latency reduction with little to no accuracy drop against their 8-bit counterparts using the standard TFLM runtime. Synthesis on a 28-nm CMOS technology indicates limited area and power overhead over the original Ibex. We open-source our framework to foster MP-QNN deployment on MCU-class RISC-V processors for low-power and low-latency IoT data processing.

Enhanced Machine-Learning Flow for Microwave-Sensing Systems to Detect Contaminants i...

Bernardita Štitić

and 4 more

November 14, 2024

The presence of foreign bodies in packaged food is a serious concern for both final consumers (allergies, injuries, choking) and food manufacturers (reputation and economic losses). In particular, low-density plastics, glass and wood splinters are hard to detect even by the most advanced X-ray imagers. One solution is Machine-Learning-based Microwave Sensing (MLMWS): a non-invasive, contactless, and real-time method which uses a machine-learning (ML) classifier to analyze the scattered microwaves from the irradiated target object. In this paper, we want to extend our previous work about contaminant detection in cocoa-hazelnut spread jars by proposing an enhanced ML flow to increase the accuracy of the ML classifier. For the first time in this case study, we use a multi-class classifier, we train it with scattering parameters measured at multiple microwave frequencies, with a new pre-processing scaler, data augmentation, quantization-aware training and a pruning schedule. The results show a contaminant detection multi-class accuracy of 94.167% with a latency of 26 µs when targeting an AMD/Xilinx Kria K26 FPGA. Finally, we released our datasets publicly to OpenML.1

A Reconfigurable Depth-Wise Convolution Module for Heterogeneously Quantized DNNs

Luca Urbinati

and 1 more

November 14, 2024

In Deep Neural Networks (DNN), the depth-wise separable convolution has often replaced the standard 2D convolution having much fewer parameters and operations. Another common technique to squeeze DNNs is heterogeneous quantization, which uses a different bitwidth for each layer. In this context we propose for the first time a novel Reconfigurable Depth-wise convolution Module (RDM), which uses multipliers that can be reconfigured to support 1, 2 or 4 operations at the same time at increasingly lower precision of the operands. We leveraged High Level Synthesis to produce five RDM variants with different channels parallelism to cover a wide range of DNNs. The comparisons with a nonconfigurable Standard Depth-wise convolution module (SDM) on a CMOS FDSOI 28-nm technology show a significant latency reduction for a given silicon area for the low-precision configurations.