¿Cómo manejar movimientos inválidos en el aprendizaje por refuerzo?


Quiero crear una IA que pueda jugar cinco en raya / gomoku. Como mencioné en el título, quiero usar el aprendizaje de refuerzo para esto.

Utilizo el método de gradiente de políticas , a saber, REINFORCE, con línea de base. Para el valor y la aproximación de la función política, utilizo una red neuronal . Tiene capas convolucionales y totalmente conectadas. Todas las capas, excepto la salida, se comparten. La capa de salida de la política tiene 8×8=64 (el tamaño de la placa) unidad de salida y softmax en ellas. Entonces es estocástico. Pero, ¿qué pasa si la red produce una probabilidad muy alta de un movimiento no válido? Un movimiento no válido es cuando el agente quiere marcar un cuadrado que tiene una "X" u "O" en él. Creo que puede atascarse en ese estado del juego.

¿Podría recomendar alguna solución para este problema?

Mi conjetura es utilizar el método actor-crítico . Para un movimiento no válido, debemos dar una recompensa negativa y pasar el turno al oponente.



Solo ignora los movimientos inválidos.

Para la exploración, es probable que no solo ejecute el movimiento con la mayor probabilidad, sino que elija movimientos aleatorios en función de la probabilidad de salida. Si solo castiga movimientos ilegales, aún conservarán cierta probabilidad (por pequeña que sea) y, por lo tanto, se ejecutarán de vez en cuando (aunque rara vez). Por lo tanto, siempre retendrá un agente que ocasionalmente realiza movimientos ilegales.

Para mí, tiene más sentido simplemente establecer las probabilidades de todos los movimientos ilegales a cero y volver a formalizar el vector de salida antes de elegir su movimiento.

Gracias. probablemente no estaba claro, pero elegí el movimiento al azar por las probabilidades emitidas. Intentaré su consejo para establecer la probabilidad de movimientos ilegales a cero y ver qué ocurrirá. Que tengas un buen día.
Molnár István


Por lo general, los métodos softmax en los métodos de gradiente de políticas que usan aproximación de función lineal usan la siguiente fórmula para calcular la probabilidad de elegir la acción a . Aquí, los pesos son θ , y las características ϕ es una función del estado actual s y una acción en el conjunto de acciones A .


To eliminate illegal moves, one would limit the set of actions to only those that were legal, hence Legal(A).


In pseudocode the formula may look like this:

action_probs = Agent.getActionProbs(state)
legal_actions = filterLegalActions(state, action_probs)
best_legal_action = softmax(legal_actions)

Whether using linear or non-linear function approximation (your neural network), the idea is to only use the legal moves when computing your softmax. This method means that only valid moves will be given by the agent, which is good if you wanted to change your game later on, and that the difference in value between the limited choice in actions will be easier to discriminate by the agent. It will also be faster as the number of possible actions decreases.

Very useful. Thanks for posting both the equations and pseudocode!

The maths and pseudocode do not match here. Softmax over the legal move probabilities will adjust the relative probabilities. E.g. (0.3, 0.4, 0.2, 0.1) filtered with first and third item removed would be (0.0, 0.8, 0.0, 0.2) with your formula, but would be (0.0, 0.57, 0.0, 0.42) using the pseudocode. The pseudocode needs to take the logits, prior to action probability calculations.
Neil Slater

How does one compute the gradient of the filtered version of Softmax? Seems like this would be necessary for backpropagation to work successfuly, yes?

@brianberns Did you manage to find an answer? It seems like that would be the case to me but somehow in my toy example I'm only getting the right answer when using the log probabilities of the unfilitered softmax...


IMHO the idea of invalid moves is itself invalid. Imagine placing an "X" at coordinates (9, 9). You could consider it to be an invalid move and give it a negative reward. Absurd? Sure!

But in fact your invalid moves are just a relic of the representation (which itself is straightforward and fine). The best treatment of them is to exclude them completely from any computation.

This gets more apparent in chess:

  • In a positional representation, you might consider the move a1-a8, which only belongs in the game if there's a Rook or a Queen at a1 (and some other conditions hold).

  • In a different representation, you might consider the move Qb2. Again, this may or may not belong to the game. When the current player has no Queen, then it surely does not.

As the invalid moves are related to the representation rather than to the game, they should not be considered at all.

Great point. In [M] games, which are played on Sudoku, the constraints make many positions (coordinates+value) illegal after the first placement. There is no value in considering these illegal positions from the standpoint of placement, but, an important strategic layer is recognizing which placements minimize value of remaining, unplayed positions. (i.e. if I place an 8 here, it blocks my opponent from placing an 8 in that row, column or region. Essentially, "how many strategic positions does this placement remove from the gameboard?")


I faced a similar issue recently with Minesweeper.

The way I solved it was by ignoring the illegal/invalid moves entirely.

  1. Use the Q-network to predict the Q-values for all of your actions (valid and invalid)
  2. Pre-process the Q-values by setting all of the invalid moves to a Q-value of zero/negative number (depends on your scenario)
  3. Use a policy of your choice to select an action from the refined Q-values (i.e. greedy or Boltzmann)
  4. Execute the selected action and resume your DQN logic

Hope this helps.

About the only thing I would add to this is that you have to remember to do a backprop on the DQN when you set the Q values for illegal (s,a) pairs to large negative value so its trained to not pick those state,action pairs next time.

But I do wonder what setting large -ve target Q values does to the continuity or shape of the loss/error function (thereby affecting the gradient search). What was your experience?

@S.N. I see your point. The idea is to pick the action with the highest Q-value that is not an invalid action. Next, you execute that action and use that action in your update rule (i.e. train your DQN to favor this action in the long run). What this does is make the future Q-values of the selected action higher and thus more favorable. It will NOT make the illegal actions Q-value lower though, which doesn't matter because they are always filtered out (not considered). Let me know if you want me to elaborate more with an example. :)

@Sanavesa sure makes sense, you're essentially counting on the DQN eventually learning what the correct choices are thru the school of hard knocks. But in situations where there's only one or a few legal choice you'll end up with very slow learning. The approach I'm suggesting is a way of incorporating domain K into the problem to speed up that learning. It also what I thought you were doing in your original post where you wrote of "setting invalid moves to a Q-value of zero/negative number"

@S.N.Precisely! Both approaches have their merits. Depends on the application if its easier to learn the legal moves or just outright ignore them. For large complex applications, I feel ignoring the invalid moves is much faster for the agent to learn, but don't quote me on that.
Al usar nuestro sitio, usted reconoce que ha leído y comprende nuestra Política de Cookies y Política de Privacidad.
Licensed under cc by-sa 3.0 with attribution required.