There is a proportional increase in computational effort when using multilayered networks to compute a wide range of Boolean functions [30]. Using gradient descent, the backpropagation algorithm searches for the minimum of the error function in weight [31, 32]. Learning problems are solved by combining the weights, minimizing the error function. Backpropagation entails both backward and forward steps. It performs a backward pass by adjusting the model’s parameters to minimize the error function [33]. In the forward process, “c” represents the inputs to the neural network with “x” neurons. wxk is the weight of interconnection between the hidden layer and neurons. “k” represents the hidden-layer neurons. The hidden layer can be defined as:
$${\text{H}}\left( {\text{k}} \right) = \mathop \sum \limits_{{{\text{x}} = 1}}^{{\text{N}}} {\text{c}}_{{\text{x}}} {\text{w}}_{{{\text{xk}}}} + {\text{b}}_{{\text{h}}}$$
(1)
where \({\rm{b}}_{\rm{h}}\) is a bias input layer. In the next step, this hidden layer is passed through an activation function [22]. After calculating the overall output by multiplying the output of the hidden layer neurons with the hidden layer weights \({\rm{w}}_{\rm{xk}}\), the results, pass through an activation function. The aim is to minimize the loss function (ω) by adjusting weights to reach a global minimum; this can be described by the following update rule:
$$\omega \to - \omega + \eta \nabla {\text{E}}\left( \omega \right)$$
(2)
$$\nabla {\text{E}}\left( \omega \right) = \left( {\frac{{\partial {\text{E}}}}{{\partial \omega_{1} }},\frac{{\partial {\text{E}}}}{{\partial \omega_{2} }}, \ldots ,\frac{{\partial {\text{E}}}}{{\partial \omega_{{\text{n}}} }}} \right)$$
(3)
to get the gradient of E with respect to the \({\text{w}}_{{{\text{pq}}}}\), we use the chain rule;
$$\frac{{\partial {\text{E}}}}{{\partial {\text{w}}_{{{\text{pq}}}} }} = \sum\limits_{{\text{k}}} {\frac{{\partial {\text{E}}}}{{\partial {\text{H}}\left( {\text{k}} \right)}}\frac{{\partial {\text{H}}\left( {\text{k}} \right)}}{{\partial {\text{w}}_{{{\text{pq}}}} }}} = \sum\limits_{{\text{k}}} {({\text{q}}_{{\text{k}}} \left( {\text{z}} \right)\tau - {\text{p}}\left( {\text{y}} \right)){\text{c}}_{{\text{p}}} \delta_{{{\text{kq}}}} }$$
(4)
The gradient of the error function E is
$${\text{E}}_{{\text{W}}} \left( {\text{W}} \right) = \left( {{\text{E}}_{{{\text{w}}_{0} }}^{{\text{T}}} \left( {\text{W}} \right),\;{\text{E}}_{{{\text{w}}_{1} }}^{{\text{T}}} \left( {\text{W}} \right), \ldots ., {\text{E}}_{{{\text{w}}_{{\text{n}}} }}^{{\text{T}}} \left( {\text{W}} \right)} \right)^{{\text{T}}}$$
(5)
Which is:
$$\frac{\partial E}{{\partial {\text{w}}_{{{\text{xk}}}} }} = {\text{c}}_{{\text{x}}} \left( {{\text{q}}_{{\text{i}}} \left( {\text{z}} \right) - {\text{p}}\left( {\text{x}} \right)} \right)$$
(6)
In this case the initial W0, the iterative increment formula for the weights takes the form
$${\text{w}}\left( {{\text{n}} + 1} \right) = {\text{W}}_{{\text{n}}} - \eta {\text{E}}_{{\text{W}}} \left( {{\text{W}}_{{\text{n}}} } \right)$$
(7)
where ƞ > 0 is the learning rate which indicates how far to go along the negative direction of the gradient. However, in this case, the convergence speed is very slow due to the saturation behavior of the activation function in the network, which is even much worse for the network with multi-hidden layer networks [34]. This is because even if the output unit saturates the corresponding decent gradient takes a small value, even if the output error is large, which will result in no significant progress in the weight adjustment. The second disadvantage of this method is the difficulty in choosing a proper learning rate ƞ to achieve fast learning while maintaining the learning procedure stable [35]. These problems contribute to the lack of an inability to apply conventional BP to a wide number of applications.
Momentum term prevents search deviation by observing two successive gradient steps to control or uphold the second. The momentum term is a fraction of the previous weight correction. During the last few years different modified versions of BP versions introduced in most of the work was concerned with the effect of both momentum and learning rates in relation to the speed of conversions. This is because these two parameters have a direct relation to conversion underdamped oscillation conditions. This is usually achieved by modifying Eq. (7) by adding a fraction of the previous weight adjustment, which leads to
$${\text{W}}_{{{\text{n}} + 1}} = {\text{W}}_{{\text{n}}} - \eta {\text{E}}_{{\text{W}}} \left( {{\text{W}}_{{\text{n}}} } \right) + \alpha \left( {{\text{W}}_{{\text{n}}} - {\text{W}}_{{{\text{n}} - 1}} } \right)$$
(8)
In this case \(\Delta {\text{W}}_{{{\text{n}} - 1}} = \left( {{\text{W}}_{{\text{n}}} - {\text{W}}_{{{\text{n}} - 1}} } \right)\), the above equation can now be rewritten as;
$$\Delta {\text{W}}_{{\text{n}}} = - \eta {\text{E}}_{{\text{W}}} \left( {{\text{W}}_{{\text{n}}} } \right) + \alpha \Delta {\text{W}}_{{{\text{n}} - 1}} \;\;{\text{n}} = 0, 1, \ldots$$
(9)
where \(\alpha \Delta {\text{W}}_{{{\text{n}} - 1}}\) is the momentum term while \(\alpha\) is the momentum coefficient which is a positive number and \((0 < \alpha < 1)\).
Backpropagation with adaptive momentum
In conventional BP the use of constant learning and momentum terms is an effective way to accelerate the learning convergence by adjusting these terms during the training process. The use of a small learning rate induces a small change in the network weights from one iteration to the next leading to a smoother learning curve. However, using a larger learning term value would result in a larger change in the network weights, which may cause network instability and oscillatory effect. Suitable momentum coefficient and learning rates are required to achieve fast and stable convergence during the training process. This study intends to introduce a BP algorithm with a variable adaptive momentum coefficient and learning rate. The proposed variable momentum is given by equation as:
$$\alpha \left( {\text{n}} \right) = \frac{\beta }{{1 + \exp \left( { - \left| {1 \div \sqrt {E({\mathfrak{n}} \times {\text{E(}}{\mathfrak{n}} - 1{\text{)}})} } \right|} \right)}}$$
(10)
where \(\beta\) is the forgetting factor \((0 \ll \beta < 1)\).
The initial value of β is expected to be large enough; this will result in the term \(1 - \beta^{{\text{n}}}\) close to unity. As such, the initial value of α(n) will be relatively large. It is expected that a rapid convergence of the updated weights can be achieved through a minimal number of iterations, which will be enhanced further as the value of momentum becomes smaller. Hence, it provides low-error performance for the weights update in (7). The momentum tracks of the error E(n) in each epoch and decreases or increases within a given range. We create a velocity variable to store our momentum for every parameter.
The gradient of the error function (3) with respect to W and V (velocity) and given the initial weights \({\text{w}}_{0}\), \({\text{w}}_{1}\), and \({\text{v}}_{0}\), \({\text{v}}_{1}\), the momentum algorithm updates the weights w and v iteratively by;
where \(\alpha \in \left( {0,1} \right)\) is the variable adaptive momentum coefficient given by Eq. (10), and \(\eta \in \left( {0,1} \right)\) is the learning rate (0.01).
Then Eq. (11) can be written as
$$\left\{ \begin{aligned} & \Delta {\text{w}}_{{{\text{n}} + 1}} = \alpha \Delta {\text{w}}_{{\text{n}}} - \eta {\text{E}}_{{\text{w}}} \left( {{\text{w}}_{{\text{n}}} ,{\text{V}}_{{\text{n}}} } \right) \\ & \Delta {\text{v}}_{{{\text{n}} + 1}}^{{\text{i}}} = \alpha_{{{\text{n}},{\text{i}}}} \Delta {\text{v}}_{{\text{n}}}^{{\text{i}}} - \eta {\text{E}}_{{{\text{v}}_{{\text{i}}} }} \left( {{\text{w}}_{{\text{n}}} ,{\text{V}}_{{\text{n}}} } \right)\quad {\text{i}} = 1, \ldots , {\text{N}},\;\;{\text{n}} = 1,2, \ldots . \\ \end{aligned} \right.$$
(13)
The convergence of the adaptive momentum algorithm is said to be weakly convergent under the following assumptions.
-
(a)
The denotation subset function \({\text{f}}\left( {\text{t}} \right)\), and their derivatives \({\text{f}}^{\prime } (t)\), and \({\text{f}}^{\prime \prime } {\text{(t)}}\) of Eq. (1) are uniformly bounded for all \({\text{t}} \in {\mathcal{R}}\)
-
(b)
\({\text{W}}_{{\text{n}}} { }({\text{n}} = 1,2, \ldots ..\) are uniformly bounded)
-
(c)
The following set has a finite number of elements
$$\varphi = \left( {{\text{w}},{\text{V}}} \right){\text{|E}}_{{\text{w}}} \left( {{\text{w}},{\text{V}}} \right) = 0,\;{\text{E}}_{{\text{w}}} \left( {{\text{w}},{\text{V}}} \right) = 0,\;{\text{i}} = 1, \ldots .,{\text{ N}}\}$$
(14)
Assuming that the error function given by (12) and the weight sequence \(\left\{ {{\text{W}}_{{\text{n}}} } \right\}\) generated by (13) with an initial weight value \({\text{W}}_{0}\) confirms that using assumption (a), (b), and (c) will hold for the final network output.
-
1.
\({\text{E}}\left( {{\text{W}}_{{{\text{n}} + 1}} } \right) \le {\text{E}}\left( {{\text{W}}_{{\text{n}}} } \right),{\text{ n}} = 0,{ }1,{ } \ldots\)
-
2.
There is \({\text{E}}^{*} \ge 0\) such that \({\text{lim}}_{{{\text{n}} \to \infty }} {\text{E}}\left( {{\text{w}}_{{\text{n}}} ,{\text{V}}_{{\text{n}}} } \right) = {\text{E}}^{*}\)
-
3.
\({\text{lim}}_{{{\text{n}} \to \infty }} {\text{E}}_{{\text{W}}} \left( {{\text{w}}_{{\text{n}}} ,{\text{ V}}_{{\text{n}}} } \right) = 0\)
-
4.
\({\text{lim}}_{{{\text{n}} \to \infty }} {\text{E}}_{{{\text{v}}_{{\text{i}}} }} \left( {{\text{w}}_{{{\text{n}},}} ,{\text{ V}}_{{\text{n}}} } \right) = 0,\;{\text{i}} = 1,{ } \ldots .,{\text{ N }}\)
For any input
, the output of the hidden neurons is
, and the network output is
Hence if assumption (c) is satisfied, then the (15) will converges to a local minimum \(\left( {{\text{w}}^{*} ,{\text{V}}^{*} } \right)\), which means
$${\text{lim}}_{{{\text{n}} \to \infty }} {\text{w}}_{{\text{n}}} = {\text{w}}^{*} ,\;{\text{lim}}_{{{\text{n}} \to \infty }} {\text{V}}_{{\text{n}}} = {\text{V}}^{*}$$
(16)
$${\text{E}}_{{\text{w}}} \left( {{\text{w}}^{*} ,{\text{V}}^{*} } \right) = 0,\;\;{\text{E}}_{{{\text{v}}_{{\text{i}}} }} \left( {{\text{w}}^{*} ,{\text{V}}^{*} } \right) = 0,\;\;{\text{i}} = 1,{ }..,{\text{ N}}$$
(17)
The proposed variable momentum algorithm is different from previous state of the art methods as it uses the error to update the momentum term which means that the momentum term is directly related to the error value and behaves in such a way to reduce the error. In the next section, medial data preprocessing is going to be explained in which gets transformed, to bring it to such a state that the proposed variable adaptive momentum can easily parse it. In our experiments, 3 medical datasets have been selected to include all classification tasks: binary, multi label and multi class classification.