No he visto una respuesta de una fuente confiable, pero trataré de responderla yo mismo, con un ejemplo simple (con mi conocimiento actual).
En general, tenga en cuenta que el entrenamiento de un MLP utilizando la propagación inversa generalmente se implementa con matrices.
Complejidad temporal de la multiplicación de matrices
La complejidad temporal de la multiplicación de matrices para M i j ∗ M j kMij∗Mjk es simplemente O ( i ∗ j ∗ k )O(i∗j∗k) .
Tenga en cuenta que aquí estamos asumiendo el algoritmo de multiplicación más simple: existen algunos otros algoritmos con una complejidad temporal algo mejor.
Algoritmo de avance
El algoritmo de propagación de avance es el siguiente.
Primero, para pasar de la capa ii a la jj , debes
S j = W j i ∗ Z iSj=Wji∗Zi
Luego aplicas la función de activación
Z j = f ( S j )Zj=f(Sj)
Si tenemos NN capas (incluidas las capas de entrada y salida), esto se ejecutará N - 1N−1 veces.
Ejemplo
Como ejemplo, calculemos la complejidad del tiempo para el algoritmo de avance de un MLP con 44 capas, donde ii denota el número de nodos de la capa de entrada, jj el número de nodos en la segunda capa, kk el número de nodos en el tercera capa yll el número de nodos en la capa de salida.
Como hay 44 capas, necesita 33 matrices para representar los pesos entre estas capas. Denotémoslos por W j iWji , W k jWkj y W l kWlk , donde W j iWji es una matriz con jj filas y ii columnas ( W j iWji contiene los pesos que van de la capaiia la capajj ).
Supongamos que tienes tt ejemplos de entrenamiento. Para propagar de la capa ii a jj , tenemos primero
S j t =Wji∗ZitSjt=Wji∗Zit
and this operation (i.e. matrix multiplcation) has O(j∗i∗t)O(j∗i∗t) time complexity. Then we apply the activation function
Zjt=f(Sjt)Zjt=f(Sjt)
and this has O(j∗t)O(j∗t) time complexity, because it is an element-wise operation.
So, in total, we have
O(j∗i∗t+j∗t)=O(j∗t∗(t+1))=O(j∗i∗t)O(j∗i∗t+j∗t)=O(j∗t∗(t+1))=O(j∗i∗t)
Using same logic, for going j→kj→k, we have O(k∗j∗t)O(k∗j∗t), and, for k→lk→l, we have O(l∗k∗t)O(l∗k∗t).
In total, the time complexity for feedforward propagation will be
O(j∗i∗t+k∗j∗t+l∗k∗t)=O(t∗(ij+jk+kl))O(j∗i∗t+k∗j∗t+l∗k∗t)=O(t∗(ij+jk+kl))
I'm not sure if this can be simplified further or not. Maybe it's just O(t∗i∗j∗k∗l)O(t∗i∗j∗k∗l), but I'm not sure.
Back-propagation algorithm
The back-propagation algorithm proceeds as follows. Starting from the output layer l→kl→k, we compute the error signal, EltElt, a matrix containing the error signals for nodes at layer ll
Elt=f′(Slt)⊙(Zlt−Olt)Elt=f′(Slt)⊙(Zlt−Olt)
where ⊙⊙ means element-wise multiplication. Note that EltElt has ll rows and tt columns: it simply means each column is the error signal for training example tt.
We then compute the "delta weights", Dlk∈Rl×kDlk∈Rl×k (between layer ll and layer kk)
Dlk=Elt∗ZtkDlk=Elt∗Ztk
where ZtkZtk is the transpose of ZktZkt.
We then adjust the weights
Wlk=Wlk−DlkWlk=Wlk−Dlk
For l→kl→k, we thus have the time complexity O(lt+lt+ltk+lk)=O(l∗t∗k)O(lt+lt+ltk+lk)=O(l∗t∗k).
Now, going back from k→jk→j. We first have
Ekt=f′(Skt)⊙(Wkl∗Elt)Ekt=f′(Skt)⊙(Wkl∗Elt)
Then
Dkj=Ekt∗ZtjDkj=Ekt∗Ztj
And then
Wkj=Wkj−DkjWkj=Wkj−Dkj
where WklWkl is the transpose of WlkWlk. For k→jk→j, we have the time complexity O(kt+klt+ktj+kj)=O(k∗t(l+j))O(kt+klt+ktj+kj)=O(k∗t(l+j)).
And finally, for j→ij→i, we have O(j∗t(k+i))O(j∗t(k+i)). In total, we have
O(ltk+tk(l+j)+tj(k+i))=O(t∗(lk+kj+ji))O(ltk+tk(l+j)+tj(k+i))=O(t∗(lk+kj+ji))
which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be O(t∗(ij+jk+kl)).O(t∗(ij+jk+kl)).
This time complexity is then multiplied by number of iterations (epochs). So, we have O(n∗t∗(ij+jk+kl)),O(n∗t∗(ij+jk+kl)),
where nn is number of iterations.
Notes
Note that these matrix operations can greatly be paralelized by GPUs.
Conclusion
We tried to find the time complexity for training a neural network that has 4 layers with respectively ii, jj, kk and ll nodes, with tt training examples and nn epochs. The result was O(nt∗(ij+jk+kl))O(nt∗(ij+jk+kl)).
We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)
Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.
I'm not sure what the results would be using other optimizers such as RMSprop.
Sources
The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.
If you're not familiar with back-propagation, check this article:
http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4