# A full-parallel digital implementation for pre-trained NNs

Tamás Szabó, Lőrinc Antoni, Gábor Horváth, Béla Fehér

Technical University of Budapest, Department of Measurement and Information Systems, H-1521. Budapest, Muegyetem rkp. **9,** Bldg. R. 1./113. Hungary, Tel.: +36 1 463 2057, Fax.: +36 1 463 4112, E-mails: [szabo, antoni, horvath, feher]@miit.brne.hu

**Abstract** *In many applications the most significant advantages of neural networks come mainly from their parallel architectures ensuring rather high operation speed. The dificulties of parallel digital hardware implementation arise mostly from the high complexity* of *the parallel many-multiplier structure. This paper suggests a new bit-serial/parallel neural network implementation method for pre-trained networks. The method makes*  possible significant hardware cost savings. The proposed approach - which is based on the results of a previ*ously suggested method for eficient implementation of digital filters* - *uses bit-serial distribvted aritlimetic. The eficient implementation* of *a matrix-vector multiplier is based on an optimization algorithm which utilizes the advantages of CSD (Canonic Signed Digit) encoding and bit-level pattern coincidences. The resulting architecture performs full-precision computation and allows high-speed bit-level pipe-line operation. The proposed approach seems to be a promising one for FPGA and ASIC realization* of *pre-trained neural networks*  and can be integrated into automatic neural network design environments. However, these implementation *methods can be useful in many other fields of digital signal processing.* 

## **1 Introduction and Motivation**

Neural Networks (NNs) mean very attractive solutions for numerous real-world computational problems of digital signal processing. However, the lack of suitable implementation impedes the spread of NNs iri everyday systems. There is a very wide scale of attempts using different implementation mediums. The main research directions are: analog neurochips, digital circuits, mixed signal (pulse-coded or other hybrid hardware) implementations and some optical devices [1] [2]. Each solution has its strengths and weaknesses.

Both analog and digital implementations have advantages and disadvantages. Analog implementations are quite sensitive to noise and thermal drift, and because of the variance of the parameters only a 5-8 bit precision can be attained. On the other hand adding, multiplication, non-linear functions etc. are relatively easy to implement using analog technique and their implementation operates at high speed.

Digital implementations assure arbitrary precision and they are not sensitive to noise and thermal drift. Nevertheless their operating speed is quite low and their realization is expensive **as** they need large chipsurface.

Our goal was to examine digital NN implementation techniques and develop new methods that exploit the advantages of digital implementations, but allow efficient high speed operation at reasonable hardware cost. The proposed methods are intended for dedicated (single chip) neurochip designs. However, these solutions are not only for NNs, but they can be very suitable to implement various digital signal processing algorithms.

The focal points of this paper are:

- to find an efficient digital implementation for fixed NNs,
- *0* to develop an optimization algorithm for a bit-serial matrix-vector multiplier.

## **2 Background**

There are numerous NN families in the literature. They have different structures, but a common feature is that they consist of neurons in a layered structure, typically using one or two hidden layers. **A** layer is composed of neurons working on the same inputs and generating different outputs. **.4** neuron produces a function of the weighted sum of its inputs (a function over a linear combination of the input). This function is called activation or squashing function, which is typically some monotone nonlinear function (e.g. sigmoid, tangent hyperbolic, etc.).

The operation of a neuron can be interpreted as a function over a vector-vector scalar product (MIS0 system), where one of the vectors is composed by the inputs and the other one is the weight vector. If a whole NN layer is considered then it is a MIMO system where the input and output are also vectors. The projection between the input and output of an NN can be described as a function of a matrix-vector product:

**<sup>&#</sup>x27;This work was supported by the Hungarian Fund for Scientific Research (OTKA) under contract TO21003** 

$$
y = f(\mathbf{W}\mathbf{x}) \tag{1}
$$

where **x** and **y** are the input and output vectors, **W** is the weight matrix of the layer and function  $f(.)$ represents the activation function of the neurons. NNs have two basic operation modes: the learning phase and the recall phase. If adaptivity is not required in a system then it is enough to implement a fixed network working in recall phase (fixed transformation). The elements of **W** are constant in this case, contrary to the learning phase where the weights -the matrix **W-** change.

Here we deal with the implementation of constant matrix - variable vector multiplication for pre-trained NNs. We shall not deal with the implementation of the activation function in this paper. It is less difficult to implement the squashing function than the implementation of the matrix-vector multiplier, it can be done for example with Look-Up Tables (LUT) or a LUT-based interpolation technique.

One of the most essential property of the NNs is their inherent parallelism. Mainly this property is responsible for their high computational speed, power and fault tolerance. We can define full-parallel implementation in the following sense: the layer is computed full-parallel if all the products in every neuron are computed simultaneously. However, as it was mentioned, the implementation of digital parallel multipliers is very expensive. Usually the technical limits of the realizations do not allow full-parallel implementations because bit-level parallelism results in many connections that are difficult to implement. Besides multipliers, another bottleneck of the hardware realization of NNs is the high number of connections, because the implementation of wires consumes large chip surface. Bit-serial communication and computation is a good candidate to help us to overcome these problems. Bit-serial Computation reduces the number of necessary arithmetic elements and keeps the number of connections tractable. The reduction -respecting the number of necessary wires and arithmetic elements- is about one order of magnitude depending on the precision of the computation. Unfortunately, bit-serial computation decreases the operation speed, too, but we hope that the proposed full-parallel implementation compensates this speed fall.

We can conclude from the properties described above that in our implementation we should exploit the inherent parallelism of NNs and that a good candidate to implement the matrix-vector multiplier (where the matrix is constant and the vector is variable) should be a structure which processes its input vectors parallely in viewpoint of the elements of the vector, but uses bit-serial data-flow. Certainly, the output vector of the multiplication **(y)** is also available simultaneously in bit-serial manner. See Fig. 1.



Figure 1: Bit-serial matrix-vector multiplier

In the following section we will present an approach to implement the constant matrix-variable vector multiplier. The proposed approach decreases the redundancies of the implementation in order to achieve efficient hardware.

# **3 The proposed constant matrix, variable vector multiplier architecture**

-4 corripletely new approach has been published in **[3]** for vector-vector inner product computation used in digital filter implementations. This approach does not require the storage of the constant coefficients in the product but "builds" them into the structure of the multiplier. This technique can be called as a special type of Distributed Arithmetic (DA) approach. In this paper we present the generalization of this inner product arithmetic for matrix-vector products and propose a new optimization algorithm for matrix-vector multipliers and a new application area for DA, namely the field of NNs. One of our previous papers **[4]**  deals also with a similar problem but the solution proposed there is less efficient than the current one.

It is possible to distinguish bit-serial/parallel and bit-parallel/parallel multipliers. Bit-serial/parallel multipliers read one operand bit-serially arid the other one bit-parallely, while bit-parallel/parallel units read both operands parallely. Bit-serial/parallel multipliers can be derived from bit-parallel/parallel multipliers by the projection of spatial coordinates to time domain scheduling. The resulting bit-serial/parallel multiplier is composed of one-bit serial adders, delay elements and one-bit multipliers (AND gates). This transformation can be done in several ways but there is no significant difference between the hardware cost of the resulting architectures. Our method uses LSB first, bit-serial/parallel arithmetic as a constant coefficient multiplier. The coefficients are built into the multiplier "netlist".

Here we review the methods we applied to reduce the hardware cost of the implementation. The first technique that helps us to realize this reduction is Canonic Signed Digit (CSD) encoding *[5].* It can be considered that if the coefficients are constant then it is not necessary to implement any hardware in the "0" digit positions of the constant numbers. The application of CSD encoding exploits this possibility. Besides this, we created an algorithm to attain bit-level optimization.

#### **3.1 The CSD representation**

CSD coding is a convenient method to reduce complexity in DSP hardware [5]. Allowing  $\{0, +1, -1\}$  in the representation of the coefficients the relative frequency of non-zero elements in CSD representation equals to  $B_c/3$  (against  $B_c/2$  in two's complement or binary data), where  $B_c$  is the number of bits (word length) of this representation. Nevertheless, usage of  $\pm 1$  values does not mean extra hardware cost because  $+1$  means bitserial adder while  $-1$  means bit-serial substractor and their hardware costs are the same. This implies that a constant multiplier can be implemented by less hardware using CSD coding while the dynamic range of the multiplier also slightly exparids using the same number of bits. (The largest and smallest numbers which can be represented in a normalized CSD code are **-4/3** and **4/3).** The CSD representation is unique, i.e. a given number may be represented in only one way. Let us see an example:  $-0.4414_{DEC} = 0.0 - 1001000 - 1_{CSD}$ that is  $-2^{-1}+2^{-4}-2^{-8}$ . Its CSD representation contains three non-zero elements while its two's complement representation is composed by five ones.

#### **3.2 Bit-level optimization**

The first idea was the application of CSD coding, the second one was bit-level reduction. To understand it Let us follow this simple example. Assume an  $N * M = 2 * 1$  dimension matrix-vector product:

$$
y = \begin{bmatrix} \frac{13}{32} \cdot 2E_C & -\frac{39}{32} \cdot 2E_C \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 0 & 1 & 0 & -1 & 0 & 1 & CSD \\ -1 & 0 & -1 & 0 & 1 & 0 & CSD \end{bmatrix}^T \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}
$$
(2)

This can be rewritten as:

$$
y = (2^{-1} - 2^{-3} + 2^{-5})x_1 + (-2^0 - 2^{-2} + 2^{-4})x_2 = -2^0 x_2 + (2^{-2} - 2^{-4})(2^1 x_1 - x_2) + 2^{-5} x_1
$$
 (3)

If we introduce  $x_3 = 2^1 x_1 - x_2$  we obtain:

 $\boldsymbol{y}$ 

e obtain:  
= 
$$
-2^0x_2 + 2^{-2}x_3 - 2^{-4}x_3 + 2^{-5}x_1
$$
 (4)

We can compare the original (corresponding to Eq. **3)** and the optimized multiplier **(Eq.** 4) structure on Fig. **2.** The difference is obvious. Both of them use CSD encoding but the reduced architecture performs the same operation using less building blocks. The original one contains six arithmetic and five delay elements while the reduced one needs only five arithmetic and three delay elements.



Figure **2:** Bit-serial vector-vector product implementations: a) original, b) reduced

We shall not examine the mathematical formalism of the bit-level optimization algorithm in detail in this paper. The description of the formalism can be found in **[3].** 

### **3.3 The optimization algorithm of the constant matrix, variable vector multiplier**

In this section we introduce an optimization algorithms which utilize the ideas of sections 3.1 and 3.2 for bit-serial matrix-vector multipliers. First we introduce some notations. Given a decimal number  $w_{i,j}$  in  $[-4/3, 4/3]$ . Its  $B_C$ -bit long CSD representation can be interpreted as a  $B_C$ -long vector  $h_{i,j}$ , the elements of which are the bits of the CSD number  $\mathbf{h}_{i,j} = \begin{bmatrix} h_{i,j}^0 2^0 & h_{i,j}^1 2^{-1} & \dots & h_{i,j}^{Bc-1} 2^{-B_c-1} \end{bmatrix}^{\text{U}}$  (e.g. if  $w_{i,j} = -0.875$ ,  $B_C = 8$ ,  $\mathbf{h}_{i,j} = [0. -10010000]$ ). Similarly, let us denote  $B_x$  the word le

Thus, we rewrite the matrix-vector multiplication of Eq. 1 to a series of vector-vector products:

$$
\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_M \end{bmatrix} = \begin{bmatrix} f(\sum_{k=1}^N w_{1,k} x_k) \\ f(\sum_{k=1}^N w_{2,k} x_k) \\ \vdots \\ f(\sum_{k=1}^N w_{M,k} x_k) \end{bmatrix}
$$
 (5)

In this equation the partial matrix-vector products,  $\sum_{k=1}^{N} w_{i,k} x_k$  are actually inner products of the input vector **x** and a constant coefficient vector **wi. A** two-dimensional data structure in which the rows are the binary representations of the coefficient vector elements with an implicit + sign between the cell-array row elements can represent a one-dimensional coefficient vector. Eq. **6** shows this.

$$
\begin{bmatrix}\nw_{i,1} \\
w_{i,2} \\
\vdots \\
w_{i,N}\n\end{bmatrix}^T = \begin{bmatrix}\nh_1^i \\
h_2^i \\
\vdots \\
h_N^i\n\end{bmatrix} = \mathbf{H_i} = \begin{bmatrix}\nh_{i,1}^0 2^0 & h_{i,1}^1 2^{-1} & \cdots & h_{i,1}^{B_c-1} 2^{-B_c-1} \\
h_{i,2}^0 2^0 & h_{i,2}^1 2^{-1} & \cdots & h_{i,2}^{B_c-1} 2^{-B_c-1} \\
\vdots & \vdots & & \vdots \\
h_{i,N}^0 2^0 & h_{i,N}^1 2^{-1} & \cdots & h_{i,N}^{B_c-1} 2^{-B_c-1}\n\end{bmatrix}
$$
\n(6)

**As** we have seen in the previous section identical or contradictory bit patterns have to be found in these **His.** It can be seen that longer matching bit patterns result in a higher reduction rate, so we prefer the reduction step considering the longest bit-pattern. First of all, let us examine how these matching bit patterns can be found in a single **Hi** if we use CSD encoding.

#### **3.3.1 Identification of matching bit-patterns**

During this task we will exploit the feature of CSD encoding that in a CSD number there cannot be non-zero values on adjacent digits, that means we must search for patterns having non-zero values at the first and the last position. The shortest pattern of this kind is 3-bit long, for example **{lOl}.** The identification of bit-patterns must be carried out between the rows of **Hi,** after the partial product corresponding to the bit-pattern can be separated that means the introduction of a new variable as we have discussed it in section **3.2.** In our research we have focused on the identification of the identical bit-patterns only in one  $\mathbf{H}_{i}$ , we have not extended the algorithm to search for patterns in different **His.** This is because it would significaritly increase the searching space and the complexity of the optimization algorithm but would not surely yield better result. However, the reductions are performed between the different **His** if necessary as we will see later. It means that our algorithm makes reduction respecting the whole matrix-vector product in every step.

Thus, we must search for identical bit-patterns between the rows of **Hi.** When doing this we compare the rows  $q, r = 1...N$  of  $\mathbf{H_i}$  and store the result in so-called statistical matrices  $\mathbf{S_i}$ *ps.* The statistical matrices  $\mathbf{S}_i p(q, r)$  contain the largest number of non-zero elements at the same position of coinciding identical (or contradicting) bit patterns from the q-th and the p-position right-shifted  $r$ -th rows of the actual  $\mathbf{H}_i$  matrix with regard to the sign. The sign of the element of  $\mathbf{S}_i p(q, r)$  will be positive if the sign of the corresponding bit-pattern is the same, and negative if it is different. The diagonals of  $S_i p$  are cleared to ignore selfcoincidences.

With the statistical matrices we have determined the number of non-zero values of the corresponding bit-patterns in each row. In our algorithm, among the bit-patterns the one is chosen in each step which results the highest gain. From this point of view the algorithm is similar to the gradient-methods as it always chooses the steepest gradient. It is possible to find the optimal or nearly optimal solution with the  $\mathbf{S}_i p(q, r)$ statistical matrices and a  $C_{AGG}$  () cost-function. The cost-function must consider two other important factors, as the possible gain depends not only on the number of non-zero elements of the bit-pattern chosen, but:

- *0* on the gain the given bit-pattern results in the whole **Wx** matrix-vector multiplication (that means in all the  $H_i$ s)
- *0* on whether we must introduce additional delay units into the structure or not.

The delay units mentioned in the second point realize the possible  $2^D$  delay (delay by *D* clock cycles). We must consider that we may have realized a certain delay for  $x_k$  already in a preceding step of the optimization, so we should consider only the  $\Delta D_k$  difference between the existing and the desired delays. In this case the cost of the implementation of the needed delay can be described as:

$$
C_{\mathcal{D}}(p,k) = 1 + \Delta D_k C \tag{7}
$$

where  $C$  is a technology-depending parameter that represents the cost-ratio of the implementation of the additional delay unit(s) and the arithmetic operator (1-bit adder or substractor).

It is not sure that the pattern containing the most non-zero values must be chosen during the optimization. It can be possible that a pattern containing less non-zero values but involved in more partial products (that means in more  $H_i$ s) is more reasonable to choose as it results more gain, so the  $C_{AGG}$ ( ) cost-function can be chosen in many different ways like 8 or 9:

$$
C_{\mathcal{A}\mathcal{G}\mathcal{G}}(p,q,r,C) = absmax_{p,q,r} [\mathbf{S}_{i}p(q,r)] - C_{\mathcal{D}}(p,r), \qquad (8)
$$

$$
\mathcal{C}_{Agg}(p,q,r,C) = absmax_{p,q,r} \left[S_{i}p(q,r)\right] - \mathcal{C}_{\mathcal{D}}(p,r),
$$
\n
$$
\mathcal{C}_{Agg}(p,q,r,C) = absmax_{p,q,r} \left[\sum_{\forall i, S_{i}p(q,r)>0} S_{i}p(q,r), \sum_{\forall i, S_{i}p(q,r)<0} S_{i}p(q,r)\right] - \mathcal{C}_{\mathcal{D}}(p,r) \qquad (9)
$$
\nThe cost-function Eq. (8) always chooses the longest bit-pattern to separate with no respect how many times

a shorter pattern occurs. The cost-function Eq. **9** adds separately the number of non-zero values at the same position of the bit-patterns and after chooses the locally maximum gain that can be obtained.

There is one more important step of the algorithm that must be considered. One must approximate the number *p* (the difference between the position of the identical bit-patterns) to know how many  $S_i p$ -s must be generated. If we consider that in CSD representation the maximum number of non-zero values in a number is  $\frac{B_c}{2}$  we can determine an upper limit to the value of p. According to practical experiences we can say that the optimal value of *p* is about  $p_{max} = 2...3$ .

When the algorithm is running the size of both the statistical matrices  $S_i p$  and the  $H_i s$  changes dynamically. However, it can be seen that it is enough to re-compute in each iteration step the rows of the concerned **Hi** and the values of the concerned *Sip.* Like this the complexity of the algorithm will be linearly proportional with the number of the elements of **W.** 

#### **3.4 Representative examples**

The proposed approaches have been validated in this section by some representative examples. For these examples we assume that the weight distribution of a trained neural network is approximately Gaussian (see [6]). This is why the elements of the sample coefficient matrices were generated randomly in  $[-(2^{Bc-1})$ , examples we assume that the weight distribution of a trained neural network<br>(see [6]). This is why the elements of the sample coefficient matrices were general<br> $2^{Bc-1}$ ] with zero mean and( $\sigma_2 = \frac{2^{Bc-1}}{3}$ ). (Approxi

The following table (Table **1)** contains some representative results (averages of **5** runs using different matrices) for coefficient word length  $B_c = 8,12,16$ , respectively. The number of inputs *N* and the outputs *M* are 5,10,20,40. The columns contain the number of adders, substractors and delay units required. The hardware cost of the trivial realization can be approximated by  $C_{TRI} = \frac{M*N*(B_c+1)}{2}$ . The "R %" column M are 5, 10, 20, 40. The columns contain the number of adders, substractors and delay units required. The hardware cost of the trivial realization can be approximated by  $C_{TRI} = \frac{M*N*(B_c+1)}{2}$ . The "R %" column shows hard

We can conclude from these examples that the reduction ratio increases proportionally to the dimensions *(N* and *M).* This can easily be explained because the more operations are in the matrix-vector product the more to reduce. However, it is not definitely true because the algorithm can execute different reduction steps, but it can be considered as a trend.

### **4 Application proposition**

As it has been mentioned, our goal was to develop implementation techniques which are suitable for embedded applications. The proposed application area is the fixed, non-adaptive (single-chip) neural network implementation. The presented method can be very fast because of the full parallel design and at the same time can provide quite a large -theoretically arbitrary- commutational precision. Certainly, the parallel architecture can be exploited only if the inputs are presented in parallel manner. Such applications are, among



#### Table **1:** Reduction example

others, the image processing tasks (e.g. OCR, Optical Character Recognition, and other pattern and texture recognition), control applications (MIMO, static and dynamic function approximation).

Beyond the Multiple-Layer Perception (MLP) which is the plausible application area of this matrix-vector multiplier structure we have suggested another interesting application area in one of our previous paper [7]: the tasks of the CNN (Cellular NN) -operations with space invariant templates- that can be redefined into a form which exploits the benefits of the new matrix-vector multiplier structure.

### *5* **Conclusion**

We have presented a new implementation approach with significant hardware-reduction to a matrix-vector multiplier in this paper. In our approach the reduction in the hardware cost is based on CSD coding and bit-level optimization of the matrix-vector product. The architecture obtained allows high-speed bit-level pipe-line operation. Our approach is convenient for FPGA arid ASIC realization of pre-trained NNs. The proposed method can be used in an automatic NN design environment where the input **(W)** of the matrixvector multiplier optimization algorithm comes from an NN simulator and its output is processed by a silicon compiler. Future research can be carried out on the optimization algorithm as it is possible to find different methods in the searching of identical bit-patterns.

### **References**

- **PI**  Manferd Glesner and Werner Pochmiiller, *Neurocomputers, an overview of neural networks in VLSI,* Neural Computing. Chapman & Hall, **1994.**
- **PI**  J. G. Delgado-Frias and W. R. Moore, Eds., *VLSI for Artificial Intelligence and Neural Networks,* Pelnum Press, **1991.**
- **[31**  Bkla Fehkr, "Efficient synthesis of distributed vector multipliers," *Microprocessing and Microprogramming,* vol. **38,** pp. **345-360, 1993.**
- [4] Tamás Szabó, Béla Fehér, and Gábor Horváth, "Neural network implementation using distributed arithmetic," in *proceedings of the International Conference on Knowledge- based Electronic Systems,* Adelaide, Australia, **1998,**  vol. **3,** pp. **511-520.**
- **151**  Richard I. Hartly and Keshab K. Parhi, *Digit-serial computation,* Kluwer Academic Publishers, **1995.**
- 161 I. Bellido and E. Fiesler, "DO backpropagation trained neural networks have normal weight distributions?," in *Proceedings of ICANN'93,* **1993.**
- **171**  Tamás Szabó, Béla Fehér, and Gábor Horváth, "Dedicated digital implementation of the CNN universal machine," in *Proceedings of IEEE International Workshop on Intelligent Signal Processing,* **1999,** vol. **1,** pp. **194-199.**