Robert Kent

and 1 more

We have recently introduced a general system for building fast single-stage hardware N-sorters and N-filters, with N>=3. When N-sorter/N-filters designed with this system are constructed using a design logic block in the FPGA central to the Amazon AWS EC2 F1 instance, they were shown to be significantly faster than the prior state-of-the-art sorting devices, which are networks of 2-sorters and 2-filters. Here we show that N-sorter/N-filters are even faster when they are implemented using carry chain design logic blocks, featuring lookahead acceleration and dedicated routing, also found in the same FPGA product. A new 32-bit carry chain 7-sorter operates in 1.469 nS, a speedup of 1.23 versus our original state-of-the-art 7-sorter. Faster and larger N-sorters and much larger N-max/N-min filters are constructed when product term splitting and a new Sum-of-Products output multiplexer equation are added to the general design system, and then combined with carry chain logic. While we were previously limited to sorting 10 or fewer input values, we now fully sort 16 32-bit values in 2.024 nS, a speedup of 4.61 versus a comparable 2-sorter network. Our largest prior max pooling 32-bit N-max filter was a 9-max 3x3 2D image filter, but we have now produced a 125-max 5x5x5 3D video filter that operates in 2.075 nS, a speedup of 3.43 versus a 2-max network. Using 32-max single-stage filters, a 32-bit 1024-max 2-stage network operates in 3.557 nS, less than a 280 MHz clock’s period, and with a 2.85 speedup versus the equivalent 10-stage 2-max network.

Robert Kent

and 1 more

In hardware such as FPGAs, Kenneth Batcher’s Odd-Even Merge Sort and Bitonic Merge Sort are the state-of-the-art methodologies used to quickly sort a list of more than 16 input values. Both sorting networks feature merges of 2 sorted input lists into a single sorted output list. For both, a full sort of 64 and 512 input values requires 21 and 45 serial stages, respectively. Multiway merge sorting networks described here require significantly fewer serial stages. For example, 8-way merge networks fully sort 64 and 512 input values in 9 and 20 serial stages, less than half the number of the respective 2-way networks. When the multiway merge sorting networks utilize the single-stage N-sorters recently defined by the authors, they are considerably faster than Batcher’s networks. In the AMD-Xilinx Ultrascale+ xcvu9p FPGA, the two 8-way merge networks have speedups of 1.85 and 1.74 versus the comparable 2-way networks. A fully pipelined 3-way merge network in this FPGA is capable of fully sorting 500 million lists of 729 unsorted 32-bit values in one second. In software, multiway merge methods are used to find the median of certain pixel rectangles in images, since the median can be determined in fewer stages than are required to fully sort the rectangle. However, the software still requires a series of many 2-sorter operations to find a median. These multiway merge median methods are dramatically sped up in hardware, where the authors’ new single-stage N-sorters and N-filters operate in parallel in each stage of the merge process.