On-device DNN training faces constraints in storage capacity and energy supply. Existing works primarily focus on optimizing the training of convolutional and batch normalization (BN) layers to improve the compute-to-communication (CTC) ratio and reduce the energy cost of off-chip memory access. However, the training of activation layers remains challenging due to the additional off-chip memory access required for derivative calculations. This paper proposes MASL-AFU, an architecture designed to accelerate the activation layer in on-device DNN training. MASL-AFU leverages non-uniform piecewise linear (NUPWL) functions to speed up the forward propagation (FP) in the activation layer. During backward propagation (BP), derivatives are retrieved from lookup tables, eliminating the need for input data. By storing lookup table indices instead of the original activation inputs, MASL-AFU significantly reduces and accelerates memory access. Compared to other activation function units, MASL-AFU offers up to a 5.8× increase in computational and off-chip memory access efficiency. Additionally, MASL-AFU incorporates two dimensions of scalability: data precision and the number of LUT entries. These scalable, hardware-friendly methods enhance MASL-AFU's area efficiency by up to 3.24× and energy efficiency by up to 3.85×.