“MLP_Conv2D is a fully functional design that convolves a 2D input image with multiple kernels simultaneously. The design takes full advantage of the MLP and BRAM blocks, performing 12 int8 multiplications per cycle of the MLP. Additionally, both MLP columns and BRAM blocks are cascaded to efficiently pass image data, allowing multiple cores to be processed simultaneously.
Achronix’s innovative machine learning processor (MLP) breaks through traditional FPGA timing performance bottlenecks
Author: Yang Yu, Senior Field Application Engineer at Achronix
Introduction: This article will focus on describing example applications of AlexNet-based 2D convolution kernels.
MLP_Conv2D is a fully functional design that convolves a 2D input image with multiple kernels simultaneously. The design takes full advantage of the MLP and BRAM blocks, performing 12 int8 multiplications per cycle of the MLP. Additionally, both MLP columns and BRAM blocks are cascaded to efficiently pass image data, allowing multiple cores to be processed simultaneously.
The design uses a NoC Access Point (NAP) to read or write data from the Network on Chip (NoC). The NoC connects to the GDDR6 controller in the Speedster7t device to the external memory.
Although the MLP_Conv2D design was originally configured for AlexNet image and kernel size, 2D convolution is a general procedure, so the design can be reconfigured and adapted to many different 2D methods.
The general principle of 2D convolution is to pass a kernel (a 2D matrix) over an image (actually another 2D matrix). For each computation, the kernel is centered on a pixel of the input image and multiplies each kernel value (called a weight) with its currently aligned pixel. The sum of these multiplications gives the specific convolution result of the original image pixels. Then move the kernel to the next pixel and repeat the process.
With the trained kernel, the 2D convolution produces an output result image that highlights specific features of the input image, such as vertical lines, horizontal lines, diagonal lines with varying angles, and curves with varying radii. These features can then be fed into other processing layers (including other 2D convolutions), which can then be identified (usually in software) as specific objects.
Therefore, 2D convolution processing should not be viewed as a complete solution for image recognition, but as a single key component in the chain of processing operations.
The challenge with 2D convolution is the number of multiplications required, which is the dedicated multiplier array in MLP. For the AlexNet configuration, each kernel is 11 × 11 = 121 weight values. However, convolutions are actually 3D because the input image has three layers (RGB), so a set of kernels has 121×3 = 363 multiplications to produce a single output result. The AlexNet input image is 227×227; however, this image has a stride of 4 (the kernel is shifted by four pixels between computations). This process results in an output result matrix of 54×54 = 2916 results. Therefore, 363 × 2916 = 1,058,508 multiplications are required for an image; that is, more than a million accumulation operations are required to process an image. The dynamic schematic diagram of a single Kernel performing 2D convolution is as follows:
Figure 1 Dynamic schematic diagram of 2D convolution performed by a single Kernel
For MLP_Conv2D, it is designed to process 60 kernels in an image at a time, performing more than 60 million multiply-accumulate operations at a time.
The MLP_Conv2D design can run at 750 MHz. A single MLP is able to convolve a single 227×227 RGB input image with an 11×11 kernel in 137 µs, equivalent to 15.4 GOPS per second (including multiply and add). But an instance of MLP_Conv2D consists of 60 MLPs running in parallel, which can convolve 60 input images simultaneously, equivalent to 924 GOPS. Finally, instantiating up to 40 MLP_Conv2Ds into a single device, each transferring data to GDDR6 memory via its own NAP, enables a combined performance of up to 37 TOPS – equivalent to 28,8000 images per second (This design is mainly for convolution kernels).
MLP_Conv2D is designed around MLP and BRAM block functions and uses their respective internal cascading traces. Likewise, NAP allows routing of data interconnects directly from external storage. These features enable minimal additional logic or routing requirements, and the utilization table is as follows:
Figure 2 Resource usage of a single MLP_Conv2D instance
Figure 3 Resource usage of 40 MLP_Conv2D instances in parallel
Figure 4 MLP_Conv2D block diagram
Data Flow: Single MLP
Each MLP has an adjacent BRAM. In this design BRAM is used to store the core and pass it multiple times to the MLP. On initialization, the different cores will be read from the input NAP and written to the corresponding BRAM. BRAM is configured as 72 bits on the write side and 144 bits for read settings. During operation, only 96 bits are used as kernel weights, i.e. read as 4 weights × 3 layers × 8 bits. The initial image data is read from the NAP into the input FIFO, which is used to store the image as a series of lines. Although this input memory is listed as a FIFO, it still acts as a repeatable FIFO because rows can be read from it multiple times. The memory is configured to be 144 bits wide, using only 96 bits, and consists of two BRAM72Ks. Each word consists of 4 pixels × 3 layers × 8 bits. On initialization, enough lines are read to match the number of lines in the kernel plus the number of lines required for vertical stride.which is
Once the initial data and kernel are loaded, the computation begins.
Read the first image line from the input FIFO and read the number of pixels of image data matching the horizontal size of the kernel. When reading these pixels, the matching kernel weights are read. The MLP multiplies each of these 96-bit streams by 12 int8 values and accumulates the results. The input FIFO advances to the second row, and the process is repeated until all rows of the kernel are multiplied by the appropriate pixel in the upper left corner of the input image. During this process, the MLP accumulates the result; now, that result is a 2D convolution of the upper left corner of the image convolved with the kernel. The result is output from the MLP as a 16-bit result. Repeat this process with the input FIFO straddling the line ahead by the number of pixels set by the STRIDE parameter (STRIDE is fixed at 4 for the current design). As each processing cycle is included, another result is generated until an appropriate number of results are obtained horizontally.
Then, the input FIFO is shifted down by the number of STRIDE lines, and the process is repeated to generate the convolution result for the next set of lines in the input image. As the input FIFO moves down, the initial rows in the FIFO are no longer needed, so the next set of STRIDE rows for the input image is loaded when parallel to the MLP computation. When considering the bandwidth requirements of external storage sources, it can be seen that the image and kernel are read from memory only once. They can then be reused from their respective BRAMs, reducing the overall burden on external memory bandwidth, as shown in Figure 1.
Data Flow: Multiple MLPs
A distinguishing feature of MLP is the ability to cascade data and results from one MLP or BRAM into the same column. MLP_Conv2D takes advantage of these cascade paths by placing MLPs and their associated BRAMs in ranks. A cascade path is used to pipeline data to each BRAM when BRAMs are loaded into the core, and the BRAM to be written to the core is selected using the BRAM block address mode.
During computation, incoming image data will be concatenated in columns of MLPs so that each MLP receives image data one cycle after its next neighbor. At the same time, the BRAM read address read by the control core is cascaded into the BRAM column with a one-cycle delay. In this way, each MLP receives the same image data and the same core read address one cycle after its previous MLP. The computational difference for each MLP is that its associated BRAM will have different kernel data. The result is an image convolved with multiple kernels in parallel. The number of parallel convolutions is called BATCH.
Data Flow: Computational Results
As mentioned before, each MLP produces a 16-bit result for each convolution of the kernel and image parts.
The MLPs are arranged in 16 columns, so from this column a 256-bit word is generated that consists of the result of each MLP in that column. This 256-bit word is then written to the output NAP. This arrangement results in the convolution results being stored in memory as layers of the same image; thus, matching the input word permutation when three layers or RGB are stored in a single input word.
Then, since the activation function can be executed in 16 parallel instances on the full 256-bit result, this arrangement allows parallel processing of the involved results into the activation layer. Likewise, once the 256-bit result is written back into memory via the output NAP, the result can be read back into another 2D convolution circuit.
Figure 5 MLP_Conv2D layout diagram
In the Speedster7t architecture, each NAP corresponds to 32 MLPs. The design is optimized to use two NAPs, one for read and one for write, so it can correspond to 64 MLPs.
However, the input and output FIFOs require two BRAM 72K memory blocks to create a 256-bit wide combined memory. Therefore, these memories will consume four of the 64 available locations for data I/O.
The design is arranged to use a four-column MLP associated with two NAPs. However, both the first and last column use 14 MLPs, leaving two MLP locations for the input and output FIFOs, respectively. The middle two columns use all 16 available MLPs. In plan view, the columns are arranged so that the first column (with input FIFO memory at the bottom) is adjacent to the NAP to improve timing.
An example of the actual layout of the design using 60 MLPs (Batch=60) is shown below (with routes highlighted):
Chart 6 60 MLP layouts
When using 40 instances in a full-chip build, try to make every instance use NAP to communicate with memory. As a result, the FMax still hits 750MHz and uses all 80 NAPs in the chip and 94% of the MLP and BRAM72K.
Figure 7 2400 MLP layouts
The next issue will introduce the floating-point architecture and performance of MLP with examples, so stay tuned.