Chenjia Xie - 21DOCS Test Area

Deep convolutional neural networks (CNNs) generate intensive inter-layer data during inference, which results in substantial on-chip memory size and off-chip bandwidth. To solve the memory constraint, this paper proposes an accelerator adopted with a compression technique that can reduce the inter-layer data by removing both intra- and inter-channel redundant information. Principal component analysis (PCA) is utilized in the compression process to concentrate inter-channel information. The spatial differences, truncation, and reconfigurable bit-width coding are implemented inside every feature map to eliminate the intra-channel data redundancy. Moreover, a particular data arrangement is introduced to enhance data continuity to optimize PCA analysis and improve compression performance. A CNN accelerator with the proposed compression technique is designed to support the on-the-fly compression process by pipelining the reconstruction, CNN computation, and compression operation. The prototype accelerator is implemented using 28-nm CMOS technology. It achieves 819.2GOPS peak throughput and 3.75TOPS/W energy efficiency with 218.5mW. Experiments show that the proposed compression technique achieves a compression ratio of 21.5%~43.0% (8-bit mode) and 9.8%~19.3% (16-bit mode) on state-of-the-art CNNs with a negligible accuracy loss.