The demand for efficient edge machine learning (ML) applications has driven interest in Microcontroller Unit (MCU)-based TinyML solutions, especially with the rise of RISC-V’s Instruction Set Architecture. While MCUs are power- efficient, their limited resources challenge the deployment of complex ML models. Mixed-Precision Quantization (MPQ) can achieve the best trade-off between model size, energy consump- tion, and accuracy, by using different weights and activations precision across model layers. However, MCU-class processors often lack hardware support for MPQ. We present an end-to-end flow from training to hardware deployment designed to efficiently run Mixed-Precision Quantized Neural Networks (MP-QNNs) on MCU-class processors. Central to our approach is STAR- MAC, a precision-scalable Multiply-and-Accumulate unit that supports flexible MPQ operations on 16-, 8-, and 4-bit integer data. STAR-MAC combines two subword-parallel techniques, Sum-Together and Sum-Apart, within a unified multiplier architecture that effectively reconfigures for Fully-Connected, 2D Convolution, and Depth-wise layers. We integrate STAR-MAC into the low-power RISC-V Ibex core and validate our flow using a Field-Programmable Gate Array-based System-on-Chip setup. Inference results over the four MLPerf Tiny MP-QNN models using our modified TensorFlow Lite for Microcontrollers (TFLM) deployment flow, show an average latency reduction of 68% with respect to their 8-bit counterparts using the standard TFLM runtime, with a reduced flatbuffer size of 27% on average, and with little to no accuracy drop. Synthesis on 28-nm CMOS technology indicates limited area and power overhead (+11.2%, +8.5%) over the original Ibex. We open-source our framework to encourage further development of MP-QNNs on RISC-V MCUs.