Δji(t)=η+·Δji(t-1),ifξ(t-1)wji(t-1)·ξ(t)wji(t)>0
η-·Δji(t-1),ifξ(t-1)wji(t-1)·ξ(t)wji(t)<0
Δji(t-1),elsewhere 0<η-<1<η+
Δwji(t)=Δwji(t)=-Δwji(t-1),ifξ(t-1)wji(t-1)·ξ(t)wji(t)<0
-Δji(t),ifξ(t)wji(t)>0
+Δji(t),ifξ(t)wji(t)<0
0,else
Therefore,when the derivatives are in the same direction,the convergence speed will rise by increasing the magnitude of the weight changeHowever,when they are in different directions,it means the oscillation starts to appear and weight change should be reduced to give more precise adjustment
In practice,only part of the algorithm is adoptedOne eclectic method is:
Δji(t)=η+·Δji(t-1)ξ(t-1)wji(t-1)·ξ(t)wji(t),ifξ(t-1)wji(t-1)·ξ(t)wji(t)>0
η-·Δji(t-1)ξ(t-1)wji(t-1)·ξ(t)wji(t),ifξ(t-1)wji(t-1)·ξ(t)wji(t)<0
Δji(t-1),else
Δwji(t)=-Δji(t),ifξ(t)wji(t)>0
+Δji(t),ifξ(t)wji(t)<0
Conjugate Gradient Algorithms (CGA)
Rather than always following the steepest descent direction In BP series algorithms,CGA performs search in conjugateGiven the quadratic function:F(x)=12xTHx+dTx+c,a set of vectors {pk} is mutually conjugate with respect to a positive definite Hessian matrix H if and only if pTkHpj=0,where k≠j,Cited by Hagan et al(1996) directionsThe first search iteration still uses the steepest descent directionp(0)=-g(0),after that,CGA starts to follow the rule:wji(t+1)=wji(t)+α(t)p(t)
p(t)=-g(t)+β(t)p(t-1) Different CGAs give different βdefinitionSuch as Fletcher-Reeves method:β(t)=gTtgtgTt-1gt-1,Polak-Ribiére method:β(t)=ΔgTtgtgTt-1gt-1 and hestenes-steifel method:β(t)=ΔgTtgtΔgTt-1gt-1
Newton Algorithms
As stated in Hagan et al (1996),Newton method is actually bases on second-order Taylor series:
F(wji(t+1))=F(wji(t)+Δwji(t))≈F(wji(t))+gT(t)Δwji(t)+12ΔwTji(t)H(t)Δwji(t)
Where g(t) is the descent vector and H(t) is the Hessian matrixThe derivative with respect to Δwji(t) will be g(t)+H(t)Δwji(t)=0,then Δwji(t)=-H(t)\g(t)
Levenberg-Marquardt algorithm (LMA)
LMA is based on Newton methodRather than calculating the Hessian matrix every iteration,LMA uses the approximate Hessian matrix G(t+1)=H(t)+μ(t)I,Δwji(t)=-[JT(wji(t))J(wji(t))+μ(t)I]\JT(wji(t))eDetailed description could be found in Hagan et al(1996) where J is the Jacobian matrix,e is the network error