This work presents an end-to-end Residue Numbering System (RNS) Deep Neural Network (DNN) accelerator targeting edge-AI devices. The developed architecture enables translating the advantages of RNS regarding the implementation of a single multiply-add operation into system-level power efficiency gains. This is made possible by the introduction of novel architectural features, such as the amortization of nontrivial RNS operations (base extension, activation and scaling), and the integration of bespoke RNS low-power techniques, such as a clock-gating scheme that exploits the periodic usage of the non-trivial RNS units, and voltage scaling that exploits RNS capability to achieve clock frequency constraints with smaller supply voltages. Systematic analysis of trade-offs between hardware performance metrics (area, power, throughput) of the RNS implementation for various operation scenarios and RNS bases, as well as comparisons against the conventional positional binary implementation, are conducted, guiding the optimal selection of design space parameters. Silicon power measurements on the 22-nm FDSOI prototype chips underpin the theoretical analysis and simulation results, which show considerable benefits of RNSbased DNN processing. These prove that RNS can not only increase the maximum achievable frequency of the arithmetic circuits, but also results in 1.33× more energy-efficient processing, compared to conventional binary counterparts. Compared to the state-of-the-art RNS-based DNN accelerator, the proposed architecture is shown to be 9× more power efficient, reaching a peak power efficiency of 4.92 TOPS/W.