Red neuronal profunda: propagación hacia atrás con ReLU

Tengo algunas dificultades para derivar la propagación hacia atrás con ReLU, e hice algo de trabajo, pero no estoy seguro de si estoy en el camino correcto.

Función de costo: $\frac{1}{2}(y-\hat y)^2$ , donde $y$ es el valor real, y es un valor predicho. También suponga que> 0 siempre. $\hat y$ $x$

1 capa ReLU, donde el peso en la primera capa es $w_1$

$\frac{dC}{dw_1}=\frac{dC}{dR}\frac{dR}{dw_1}$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

2 Layer ReLU, donde los pesos en la primera capa es $w_2$ , y la segunda capa es $w_1$ Y quería actualizar la primera capa $w_2$

$\frac{dC}{dw_2}=\frac{dC}{dR}\frac{dR}{dw_2}$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

Como $ReLU(w_1*ReLU(w_2x))=w_1w_2x$

3 Layer ReLU, donde los pesos en la 1ra capa son , 2da capa y 3ra capa $w_3$ $w_2$ $w_1$

$\frac{dC}{dw_3}=\frac{dC}{dR}\frac{dR}{dw_3}$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

Dado que $ReLU(w_1*ReLU(w_2(*ReLU(w_3))=w_1w_2w_3x$

Dado que la regla de la cadena solo dura 2 derivadas, en comparación con un sigmoide, que podría ser tan largo como número de capas. $n$

Digamos que quería actualizar los 3 pesos de capa, donde $w_1$ es la tercera capa, es la segunda capa, es la tercera capa $w_2$ $w_1$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

Si esta derivación es correcta, ¿cómo evita que desaparezca? En comparación con sigmoide, donde tenemos mucha multiplicación por 0.25 en la ecuación, mientras que ReLU no tiene ninguna multiplicación de valor constante. Si hay miles de capas, habría mucha multiplicación debido a los pesos, ¿entonces esto no causaría un gradiente de fuga o explosión?

neural-network backpropagation

— usuario1157751
fuente

@NeilSlater ¡Gracias por tu respuesta! ¿Puedes explicarme, no estoy seguro de lo que quisiste decir?

— user1157751

Ah, creo que sé a qué te referías. Bueno, ¿la razón por la que planteé esta pregunta es que estoy seguro de que la derivación es correcta? ¿Busqué y no encontré un ejemplo de ReLU derivado completamente desde cero?

— user1157751

Definiciones de trabajo de la función ReLU y su derivada:

$ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ x, & \text{otherwise}. \end{cases}$

$\frac{d}{dx} ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ 1, & \text{otherwise}. \end{cases}$

La derivada es la función de paso unitario . Esto ignora un problema en $x=0$ , donde el gradiente no está estrictamente definido, pero eso no es una preocupación práctica para las redes neuronales. Con la fórmula anterior, la derivada en 0 es 1, pero igualmente podría tratarse como 0 o 0,5 sin un impacto real en el rendimiento de la red neuronal.

Red simplificada

Con esas definiciones, echemos un vistazo a sus redes de ejemplo.

Está ejecutando regresión con la función de costo $C = \frac{1}{2}(y-\hat{y})^2$ . Ha definido $R$ como la salida de la neurona artificial, pero no ha definido un valor de entrada. Agregaré eso para completar: llámelo $z$ , agregue un poco de indexación por capa, y prefiero minúsculas para los vectores y mayúsculas para las matrices, por lo que $r^{(1)}$ sale de la primera capa, $z^{(1)}$ para su entrada y $W^{(0)}$ para el peso que conecta la neurona a su entrada $x$ (en una red más grande, que podría conectarse a una conexión más profunda $r$ valor en su lugar). También he ajustado el número de índice para la matriz de peso; por eso será más claro para la red más grande. Nota: estoy ignorando tener más de neurona en cada capa por ahora.

Mirando su simple red de 1 capa, 1 neurona, las ecuaciones de retroalimentación son:

$z^{(1)} = W^{(0)}x$

$\hat{y} = r^{(1)} = ReLU(z^{(1)})$

La derivada de la función de costo wrt una estimación de ejemplo es:

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}} = \frac{\partial}{\partial r^{(1)}}\frac{1}{2}(y-r^{(1)})^2 = \frac{1}{2}\frac{\partial}{\partial r^{(1)}}(y^2 - 2yr^{(1)} + (r^{(1)})^2) = r^{(1)} - y$

Usando la regla de la cadena para la propagación hacia atrás al valor previo a la transformación ( $z$ ):

$\frac{\partial C}{\partial z^{(1)}} = \frac{\partial C}{\partial r^{(1)}} \frac{\partial r^{(1)}}{\partial z^{(1)}} = (r^{(1)} - y)Step(z^{(1)}) = (ReLU(z^{(1)}) - y)Step(z^{(1)})$

Este $\frac{\partial C}{\partial z^{(1)}}$ is an interim stage and critical part of backprop linking steps together. Derivations often skip this part because clever combinations of cost function and output layer mean that it is simplified. Here it is not.

To get the gradient with respect to the weight $W^{(0)}$ , then it is another iteration of the chain rule:

$\frac{\partial C}{\partial W^{(0)}} = \frac{\partial C}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial W^{(0)}} = (ReLU(z^{(1)}) - y)Step(z^{(1)})x = (ReLU(W^{(0)}x) - y)Step(W^{(0)}x)x$

. . . because $z^{(1)} = W^{(0)}x$ therefore $\frac{\partial z^{(1)}}{\partial W^{(0)}} = x$

That is the full solution for your simplest network.

However, in a layered network, you also need to carry the same logic down to the next layer. Also, you typically have more than one neuron in a layer.

More general ReLU network

If we add in more generic terms, then we can work with two arbitrary layers. Call them Layer $(k)$ indexed by $i$ , and Layer $(k+1)$ indexed by $j$ . The weights are now a matrix. So our feed-forward equations look like this:

$z^{(k+1)}_j = \sum_{\forall i} W^{(k)}_{ij}r^{(k)}_i$

$r^{(k+1)}_j = ReLU(z^{(k+1)}_j)$

In the output layer, then the initial gradient w.r.t. $r^{output}_j$ is still $r^{output}_j - y_j$ . However, ignore that for now, and look at the generic way to back propagate, assuming we have already found $\frac{\partial C}{\partial r^{(k+1)}_j}$ - just note that this is ultimately where we get the output cost function gradients from. Then there are 3 equations we can write out following the chain rule:

First we need to get to the neuron input before applying ReLU:

$\frac{\partial C}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j} \frac{\partial r^{(k+1)}_j}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j}Step(z^{(k+1)}_j)$

We also need to propagate the gradient to previous layers, which involves summing up all connected influences to each neuron:

$\frac{\partial C}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} W^{(k)}_{ij}$

And we need to connect this to the weights matrix in order to make adjustments later:

$\frac{\partial C}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} r^{(k)}_{i}$

You can resolve these further (by substituting in previous values), or combine them (often steps 1 and 2 are combined to relate pre-transform gradients layer by layer). However the above is the most general form. You can also substitute the $Step(z^{(k+1)}_j)$ in equation 1 for whatever the derivative function is of your current activation function - this is the only place where it affects the calculations.

Back to your questions:

If this derivation is correct, how does this prevent vanishing?

Your derivation was not correct. However, that does not completely address your concerns.

The difference between using sigmoid versus ReLU is just in the step function compared to e.g. sigmoid's $y(1-y)$ , applied once per layer. As you can see from the generic layer-by-layer equations above, the gradient of the transfer function appears in one place only. The sigmoid's best case derivative adds a factor of 0.25 (when $x = 0, y = 0.5$ ), and it gets worse than that and saturates quickly to near zero derivative away from $x=0$ . The ReLU's gradient is either 0 or 1, and in a healthy network will be 1 often enough to have less gradient loss during backpropagation. This is not guaranteed, but experiments show that ReLU has good performance in deep networks.

If there's thousands of layers, there would be a lot of multiplication due to weights, then wouldn't this cause vanishing or exploding gradient?

Yes this can have an impact too. This can be a problem regardless of transfer function choice. In some combinations, ReLU may help keep exploding gradients under control too, because it does not saturate (so large weight norms will tend to be poor direct solutions and an optimiser is unlikely to move towards them). However, this is not guaranteed.

— Neil Slater
fuente

Was a chain rule performed on

\frac{d C}{d \hat{y}}

$\frac{dC}{d \hat y}$ ?

— user1157751

@user1157751: No,

\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}$ because

\hat{y} = r^{(1)}

$\hat{y} = r^{(1)}$ . The cost function C is simple enough that you can take its derivative immediately. The only thing I haven't shown there is the expansion of the square - would you like me to add it?

— Neil Slater

But

C

$C$ is

\frac{1}{2} (y - \hat{y})^{2}

$\frac{1}{2}(y- \hat y)^2$ , don't we need to perform chain rule so that we can perform the derivative on

\hat{y}

$\hat y$ ?

\frac{d C}{d \hat{y}} = \frac{d C}{d U} \frac{d U}{d \hat{y}}

$\frac{dC}{d \hat y}=\frac{dC}{dU}\frac{dU}{d \hat y}$ , where

U = y - \hat{y}

$U = y - \hat y$ . Apologize for asking really simple questions, my maths ability is probably causing trouble for you : (

— user1157751

If you can make things simpler by expanding. Then please do expand the square.

— user1157751

@user1157751: Yes you could use the chain rule in that way, and it would give the same answer as I show. I just expanded the square - I'll show it.

— Neil Slater