Hongyi Li

and 10 more

The rapid development of deep learning has propelled many real-world artificial intelligence (AI) applications. Many of these applications integrate multiple neural network (multi-NN) models to cater to various functionalities. Although a number of multi-NN acceleration technologies have been explored, few can fully fulfill the flexibility and scalability required by emerging and diverse AI workloads, especially for mobile. Among these, homogeneous multi-core architectures have great potential to support multi-NN execution by leveraging decentralized parallelism and intrinsic scalability. However, the advantages of multi-core systems are underexploited due to the adoption of bulk synchronization parallelism (BSP), which is inefficient to meet the diversity of multi-NN workloads. This paper reports a hierarchical multi-core architecture with asynchronous parallelism to enhance multi-NN execution for higher performance and utilization. Hierarchical asynchronous parallel (HASP) is the theoretical foundation, which establishes a programmable and grouped dynamic synchronous-asynchronous framework for multi-NN acceleration. HASP can be implemented on a typical multi-core processor for multi-NN with minor modifications. We further developed a prototype chip to validate the hardware effectiveness of this design. A mapping strategy that combines spatial partitioning and temporal tuning is also developed, which allows the proposed architecture to promote resource utilization and throughput simultaneously.