MUHAMMAD JUNAID - 21DOCS Test Area

The increasing complexity of Deep Neural Network (DNN) models, coupled with rising demands for energy efficiency and computational speed in edge devices, necessitates the reevaluation of conventional numerical representations and computational strategies. Traditional quantization of DNNs, which often employs low-bit integers due to their fixed hardware implementation, applies uniform granularity across all values. This approach can be inefficient when certain data points require finer granularity to maintain accuracy. In contrast, floating-point (FP) quantization provides greater flexibility in bit allocation, enabling more precise granularity where needed. This paper proposes mixed-precision FP arithmetic based on a genetic algorithm to find the optimal precision for each layer. To minimize rounding errors, stochastic rounding is incorporated, and layer-specific exponent bias adjustments are implemented to enhance data representation precision. Experimental results from implementing the proposed mixed-precision FP quantization on the YOLOv2-tiny model reveal a 2.9 times reduction in energy consumption per image and a 50% reduction in memory requirements, with only a negligible loss of 0.13% in mean Average Precision ([email protected]) as evaluated on the VOC dataset, compared to Bfloat16, a known low-cost floating-point method.