¿La función de costo de la red neuronal no es convexa?

36

La función de costo de la red neuronal es , y se afirma que no es convexa . No entiendo por qué es así, ya que, como veo, es bastante similar a la función de costo de la regresión logística, ¿verdad? $J(W,b)$

Si no es convexo, entonces la derivada de segundo orden , ¿verdad? $\frac{\partial J}{\partial W} < 0$

ACTUALIZAR

Gracias a las respuestas a continuación, así como al comentario de @gung, entendí su punto, si no hay capas ocultas, es convexo, al igual que la regresión logística. Pero si hay capas ocultas, al permutar los nodos en las capas ocultas, así como los pesos en las conexiones posteriores, podríamos tener múltiples soluciones de los pesos que resultan en la misma pérdida.

Ahora más preguntas,

1) Hay múltiples mínimos locales, y algunos de ellos deberían tener el mismo valor, ya que corresponden a algunas permutaciones de nodos y pesos, ¿verdad?

2) Si los nodos y los pesos no se permutarán en absoluto, entonces es convexo, ¿verdad? Y los mínimos serán los mínimos mundiales. Si es así, la respuesta a 1) es, todos esos mínimos locales serán del mismo valor, ¿correcto?

neural-networks loss-functions

— aguacate
fuente

No es convexo porque puede haber múltiples mínimos locales.

— gung - Restablece a Monica

2

Depende de la red neuronal. Las redes neuronales con funciones de activación lineal y pérdida cuadrada producirán una optimización convexa (si mi memoria me sirve también para redes de función de base radial con variaciones fijas). Sin embargo, las redes neuronales se usan principalmente con funciones de activación no lineal (es decir, sigmoideas), por lo tanto, la optimización se vuelve no convexa.

— Cagdas Ozgenc

@gung, entendí tu punto, y ahora tengo más preguntas, mira mi actualización :-)

— aguacate

55

En este punto (2 años después), podría ser mejor regresar su pregunta a la versión anterior, aceptar una de las respuestas a continuación y hacer una nueva pregunta de seguimiento que se vincule a esto para el contexto.

— gung - Restablece a Monica

1

@gung, sí, tienes razón, pero ahora no estoy seguro de algunos aspectos de la respuesta que voté antes. Bueno, como he dejado algunos comentarios nuevos sobre las respuestas a continuación, esperaría un tiempo para ver si es necesario pedir una nueva.

— aguacate

25

La función de costo de una red neuronal en general no es convexa ni cóncava. Esto significa que la matriz de todas las segundas derivadas parciales (la hessiana) no es semidefinida positiva ni semidefinida negativa. Dado que la segunda derivada es una matriz, es posible que no sea una ni la otra.

Para hacer esto análogo a las funciones de una variable, se podría decir que la función de costo no tiene la forma de la gráfica de ni la gráfica de . Otro ejemplo de un no-convexa, la función no es cóncava en . Una de las diferencias más notables es que tiene solo un extremo, mientras que el tiene infinitos máximos y mínimos. $x^2$ $-x^2$ $\sin(x)$ $\mathbb{R}$ $\pm x^2$ $\sin$

¿Cómo se relaciona esto con nuestra red neuronal? Una función de costo también tiene varios máximos y mínimos locales, como puede ver en esta imagen , por ejemplo. $J(W,b)$

El hecho de que tenga múltiples mínimos también se puede interpretar de una manera agradable. En cada capa, utiliza múltiples nodos a los que se les asignan diferentes parámetros para que la función de costo sea pequeña. Excepto por los valores de los parámetros, estos nodos son iguales. Por lo tanto, puede intercambiar los parámetros del primer nodo en una capa con los del segundo nodo en la misma capa y tener en cuenta este cambio en las capas posteriores. Terminaría con un conjunto diferente de parámetros, pero no se puede distinguir el valor de la función de costo (básicamente, simplemente movió un nodo a otro lugar, pero mantuvo todas las entradas / salidas iguales). $J$

— Roland
fuente

De acuerdo, entiendo la explicación de permutación que hizo, creo que tiene sentido, pero ahora me pregunto si esta es la auténtica para explicar por qué la red neuronal no es convexa.

— aguacate

1

¿Qué quieres decir con 'auténtico'?

— Roland

Quiero decir, así es como debe interpretarse, no solo una analogía.

— aguacate

44

@loganecolss Tiene razón en que esta no es la única razón por la cual las funciones de costo no son convexas, sino una de las razones más obvias. Dependiendo de la red y el conjunto de entrenamiento, puede haber otras razones por las cuales hay múltiples mínimos. Pero la conclusión es: la permutación por sí sola crea no convexidad, independientemente de otros efectos.

— Roland

1

Lo siento, no puedo entender el último párrafo. Pero también no entiendo por qué mencioné max (0, x) aquí. En cualquier caso, creo que la forma correcta de mostrar que tal vez haya un modo múltiple (mínimo local múltiple) es probarlo de alguna manera. ps Si Hessian es indefinido, no dijo nada: la función cuasiconvexa puede tener Hessian indefinida pero sigue siendo unimodal.

— bruziuz

17

Si permutas las neuronas en la capa oculta y haces la misma permutación en los pesos de las capas adyacentes, la pérdida no cambia. Por lo tanto, si hay un mínimo global distinto de cero en función de los pesos, entonces no puede ser único ya que la permutación de los pesos da otro mínimo. Por lo tanto, la función no es convexa.

— Abhinav
fuente

5

Whether the objective function is convex or not depends on the details of the network. In the case where multiple local minima exist, you ask whether they're all equivalent. In general, the answer is no, but the chance of finding a local minimum with good generalization performance appears to increase with network size.

This paper is of interest:

Choromanska et al. (2015). The Loss Surfaces of Multilayer Networks

http://arxiv.org/pdf/1412.0233v3.pdf

From the introduction:

For large-size networks, most local minima are equivalent and yield similar performance on a test set.

The probability of finding a "bad" (high value) local minimum is non-zero for small-size networks and decreases quickly with network size.

Struggling to find the global minimum on the training set (as opposed to one of the many good local ones) is not useful in practice and may lead to overfitting.

They also cite some papers describing how saddle points are a bigger issue than local minima when training large networks.

— user20160
fuente

4

Some answers for your updates:

Yes, there are in general multiple local minima. (If there was only one, it would be called the global minimum.) The local minima will not necessarily be of the same value. In general, there may be no local minima sharing the same value.
No, it's not convex unless it's a one-layer network. In the general multiple-layer case, the parameters of the later layers (the weights and activation parameters) can be highly recursive functions of the parameters in previous layers. Generally, multiplication of decision variables introduced by some recursive structure tends to destroy convexity. Another great example of this is MA(q) models in times series analysis.

Side note: I don't really know what you mean by permuting nodes and weights. If the activation function varies across nodes, for instance, and you permute the nodes, you're essentially optimizing a different neural network. That is, while the minima of this permuted network may be the same minima, this is not the same network so you can't make a statement about the multiplicity of the same minima. For an analogy of this in the least-squares framework, you are for example swapping some rows of $y$ and $X$ and saying that since the minimum of $\|y - X\beta\|$ is the same as before that there are as many minimizers as there are permutations.

— Mustafa S Eisa
fuente

1

"one-layer network" would be just what "softmax" or logistic regression looks like, right?

— avocado

By "permuting nodes and weights", I mean "swapping", and that's what I got from the above 2 old answers, and as I understood their answers, by "swapping" nodes and weights in hidden layers, we might end up having the same output in theory, and that's why we might have multiple minima. You mean this explanation is not correct?

— avocado

You have the right idea, but its not quite the same. For networks, the loss may not necessarily be binomial loss, the activation functions may not necessarily be sigmoids, etc.

— Mustafa S Eisa

Yes, I don't think it's correct. Even though it's true that you'll get the same performance whether you permute these terms or not, this doesn't define the convexity or non-convexity of any problem. The optimization problem is convex if, for a fixed loss function (not any permutation of the terms in the loss), the objective function is convex in the model parameters and the feasible region upon which you are optimizing is convex and closed.

— Mustafa S Eisa

I see, so if it's "one-layer", it might not be "softmax".

— avocado

2

You will have one global minimum if problem is convex or quasiconvex.

About convex "building blocks" during building neural networks (Computer Science version)

I think there are several of them which can be mentioned:

max(0,x) - convex and increasing
log-sum-exp - convex and increasing in each parameter
y = Ax is affine and so convex in (A), maybe increasing maybe decreasing. y = Ax is affine and so convex in (x), maybe increasing maybe decreasing.

Unfortunately it is not convex in (A, x) because it looks like indefinite quadratic form.

Usual math discrete convolution (by "usual" I mean defined with repeating signal) Y=h*X Looks that it is affine function of h or of variable X. So it's a convex in variable h or in variable X. About both variables - I don't think so because when h and X are scalars convolution will reduce to indefinite quadratic form.
max(f,g) - if f and g are convex then max(f,g) is also convex.

If you substitute one function into another and create compositions then to still in the convex room for y=h(g(x),q(x)), but h should be convex and should increase (non-decrease) in each argument....

Why neural netwoks in non-convex:

I think the convolution Y=h*X is not nessesary increasing in h. So if you not use any extra assumptions about kernel you will go out from convex optimization immediatly after you apply convolution. So there is no all fine with composition.
Also convolution and matrix multiplication is not convex if consider couple parameters as mentioned above. So there is evean a problems with matrix multiplication: it is non-convex operation in parameters (A,x)
y = Ax can be quasiconvex in (A,x) but also extra assumptions should be taken into account.

Please let me know if you disagree or have any extra consideration. The question is also very interesting to me.

p.s. max-pooling - which is downsamping with selecting max looks like some modification of elementwise max operations with affine precomposition (to pull need blocks) and it looks convex for me.

About other questions

No, logistic regression is not convex or concave, but it is log-concave. This means that after apply logarithm you will have concave function in explanatory variables. So here max log-likelihood trick is great.
If there are not only one global minimum. Nothing can be said about relation between local minimums. Or at least you can not use convex optimization and it's extensions for it, because this area of math is deeply based on global underestimator.

Maybe you have confusion about this. Because really people who create such schemas just do "something" and they receive "something". Unfortunately because we don't have perfect mechanism for tackle with non-convex optimization (in general).

But there are even more simple things beside Neural Networks - which can not be solved like non-linear least squares -- https://youtu.be/l1X4tOoIHYo?t=2992 (EE263, L8, 50:10)

— bruziuz
fuente