¿Por qué siempre hay al menos una política que es mejor o igual a todas las demás políticas?

14

Resolver una tarea de aprendizaje de refuerzo significa, más o menos, encontrar una política que logre muchas recompensas a largo plazo. Para MDP finitos, podemos definir con precisión una política óptima de la siguiente manera. Las funciones de valor definen un orden parcial sobre las políticas. Una política $\pi$ se define como mejor o igual a una política $\pi'$ si su rendimiento esperado es mayor o igual que el de $\pi'$ , para todos los estados. En otras palabras, $\pi \geq \pi'$ si y solo si $v_\pi(s) \geq v_{\pi'}(s)$ , para todos los $s \in \mathcal{S}$ .Siempre hay al menos una política que es mejor o igual que todas las demás políticas. Esta es una política óptima.

markov-process reinforcement-learning

— sh1ng
fuente

Una prueba muy detallada (que usa el teorema del punto fijo de Banach) aparece en el capítulo 6.2 de "Procesos de decisión de Markov" de Puterman.

— Toghs

3

Justo después de la parte citada, el mismo párrafo le dice en realidad qué es esta política: es la que toma las mejores medidas en cada estado. En un MDP, la acción que tomamos en un estado no afecta las recompensas por las acciones tomadas en otros, por lo que simplemente podemos maximizar la política estado por estado.

— Don reba
fuente

¿No es esta respuesta completamente incorrecta? ¿Cómo puede decir que optimizar la política estado por estado conduce a una política óptima? Si optimizo sobre el estado

S_{t}

$S_t$ y me lleva

S_{t + 1}

$S_{t+1}$ y luego optimizar en

S_{t + 1}

$S_{t+1}$ conduce a una función de valor óptima

V_{t + 1}

$V_{t+1}$ pero hay otra política en la que

S_{t}

$S_t$ conduce subóptimamente a

S_{l}

$S_l$ y el óptimo La función de valor de

S_{l}

$S_l$ es mayor que

V_{t + 1}

$V_{t+1}$ . ¿Cómo puede descartar esto mediante un análisis tan superficial?

— MiloMinderbinder

@MiloMinderbinder Si la política óptima en

es elegir

, entonces el valor de

es mayor que el valor de

.

S_{t}

$S_t$

S_{t + 1}

$S_{t+1}$

S_{t + 1}

$S_{t+1}$

S_{l}

$S_l$

— Don Reba

Culpa mía. Typo corrigió: '¿No es esta respuesta completamente incorrecta? ¿Cómo puede decir que optimizar la política estado por estado conduce a una política óptima? Si optimizo sobre el estado

y me lleva a

y luego optimizar en

conduce a una función de valor óptima

de

pero hay otra política en la que

aunque conduce subóptimamente a

y, por lo tanto, la función de valor de

S_{t}

$S_t$

S_{t + 1}

$S_{t+1}$

S_{t + 1}

$S_{t+1}$

V_{t + 2}

$V_{t+2}$

S_{t + 2}

$S_{t+2}$

S_{t}

$S_t$

S_{l + 1}

$S_{l+1}$

S_{t + 1}

$S_{t+1}$ es mayor que

pero la función de valor de

es mayor bajo esta política que bajo la política encontrada al optimizar estado por estado. ¿Cómo es superado por usted?

V_{l + 1}

$V_{l+1}$

S_{t + 2}

$S_{t+2}$

— MiloMinderbinder

Creo que la definición de

evitará que esto suceda en primer lugar, ya que también debería tener en cuenta los rendimientos futuros.

V

$V$

— Flying_Banana

La pregunta sería: ¿por qué existe

? No se puede evitar el Teorema del punto fijo de Banach :-)

q_{*}

$q_*$

— Fabian Werner

10

La existencia de una política óptima no es obvia. Para ver por qué, tenga en cuenta que la función de valor proporciona solo un orden parcial sobre el espacio de las políticas. Esto significa:

π^{'} \geq π ⟺ v_{π^{'}} (s) \geq v_{π} (s), \forall s \in S

$\pi' \geq \pi \iff v_{\pi'}(s) \geq v_{\pi}(s), \forall s \in S$

Dado que esto es solo un pedido parcial, podría haber un caso en el que dos políticas, y , no sean comparables. En otras palabras, hay subconjuntos del espacio de estado, y manera que: $\pi_1$ $\pi_2$ $S_1$ $S_2$

v_{π^{'}} (s) \geq v_{π} (s), \forall s \in S_{1}

$v_{\pi'}(s) \geq v_{\pi}(s), \forall s \in S_1$

v_{π} (s) \geq v_{π^{'}} (s), \forall s \in S_{2}

$v_{\pi}(s) \geq v_{\pi'}(s),\forall s \in S_2$

In this case, we can't say that one policy is better than the other. But if we are dealing with finite MDPs with bounded value functions, then such a scenario never occurs. There is exactly one optimal value functions, though there might be multiple optimal policies.

For a proof of this, you need to understand the Banach Fixed Point theorem. For a detailed analysis, please refer.

— Karthik Thiagarajan
fuente

7

$\newcommand{\mc}{\mathcal} \newcommand{\mb}{\mathbb}$

Setting

We are considering in the setting of:

Discrete actions
Discrete states
Bounded rewards
Stationary policy
Infinite horizon

The optimal policy is defined as:

\begin{matrix} (1) & π^{*} \in \arg max_{π} V^{π} (s), \forall s \in S \end{matrix}

$\pi^\ast \in \arg \max_\pi V^\pi(s), \forall s \in \mc{S} \tag{1}$ and the optimal value function is:

\begin{matrix} (2) & V^{*} = max_{π} V^{π} (s), \forall s \in S \end{matrix}

$V^\ast = \max_\pi V^\pi (s), \forall s \in \mc S \tag{2}$ There can be a set of policies which achieve the maximum. But there is only one optimal value function:

\begin{matrix} (3) & V^{*} = V^{π^{*}} \end{matrix}

$V^\ast = V^{\pi^\ast} \tag{3}$

The question

How to prove that there exists at least one $\pi^\ast$ which satisfies (1) simultaneously for all $s \in \mc{S}$ ?

Outline of proof

Construct the optimal equation to be used as a temporary surrogate definition of optimal value function, which we will prove in step 2 that it is equivalent to the definition via Eq.(2).
$\begin{matrix} (4) & V^{*} (s) = max_{a \in A} [R (s, a) + γ \sum_{s^{'} \in S} T (s, a, s^{'}) V^{*} (s^{'})] \end{matrix}$ $V^\ast(s) = \max_{a \in \mc A} [ R(s, a) + \gamma \, \sum_{s^\prime \in \mc S} T(s, a, s^\prime) V^\ast(s^\prime)] \tag{4}$
Derive the equivalency of defining optimal value function via Eq.(4) and via Eq.(2).

(Note in fact we only need the necessity direction in the proof, because the sufficiency is obvious since we constructed Eq.(4) from Eq.(2).)
Prove that there is a unique solution to Eq.(4).
By step 2, we know that the solution obtained in step 3 is also a solution to Eq.(2), so it is an optimal value function.
From an optimal value function, we can recover an optimal policy by choosing the maximizer action in Eq.(4) for each state.

Details of the steps

1

Since $V^\ast(s) = V^{\pi^\ast}(s) = \mb E_a [Q^{\pi^\ast}(s, a)]$ , we have $V^{\pi^\ast}(s) \le \max_{a \in \mc A} Q^{\pi^\ast} (s, a)$ . And if there is any $\tilde{s}$ such that $V^{\pi^\ast} \neq \max_{a \in \mc A} Q^{\pi^\ast} (s, a)$ , we can choose a better policy by maximizing $Q^{\ast} (s, a) = Q^{\pi^\ast} (s, a)$ over $a$ .

2

(=>)

Follows by step 1.

(<=)

i.e. If $\tilde V$ satisfies $\tilde V(s) = \max_{a \in \mc A} [ R(s, a) + \gamma \, \sum_{s^\prime \in \mc S} T(s, a, s^\prime) \tilde V(s^\prime)]$ , then $\tilde V(s) = V^\ast(s) = \max_\pi V^\pi(s), \forall s \in \mc S$ .

Define the optimal Bellman operator as

\begin{matrix} (5) & T V (s) = max_{a \in A} [R (s, a) + γ \sum_{s^{'} \in S} T (s, a, s^{'}) V (s^{'})] \end{matrix}

$\mc T V(s) = \max_{a \in \mc A} [ R(s, a) + \gamma \, \sum_{s^\prime \in \mc S} T(s, a, s^\prime) V(s^\prime)] \tag{5}$ So our goal is to prove that if

\tilde{V} = T \tilde{V}

$\tilde V = \mc T \tilde V$ , then

\tilde{V} = V^{*}

$\tilde V = V^\ast$ . We show this by combining two results, following Puterman[1]:

a) If $\tilde V \ge \mc T \tilde V$ , then $\tilde V \ge V^\ast$ .

b) If $\tilde V \le \mc T \tilde V$ , then $\tilde V \le V^\ast$ .

Proof:

a)

For any $\pi = (d_1, d_2, ...)$ ,

\begin{aligned} \tilde{V} & \geq T \tilde{V} = max_{d} [R_{d} + γ P_{d} \tilde{V}] \\ \geq R_{d_{1}} + γ P_{d_{1}} \tilde{V} \end{aligned}

$\begin{align} \tilde V &\ge \mc T \tilde V = \max_{d} [ R_d + \gamma \, P_d \tilde V] \\ &\ge R_{d_1} + \gamma \, P_{d_1} \tilde V \\ \end{align}$ Here

d

$d$ is the decision rule(action profile at specific time),

R_{d}

$R_d$ is the vector representation of immediate reward induced from

d

$d$ and

P_{d}

$P_d$ is transition matrix induced from

d

$d$ .

By induction, for any $n$ ,

\tilde{V} \geq R_{d_{1}} + \sum_{i = 1}^{n - 1} γ^{i} P_{π}^{i} R_{d_{i + 1}} + γ^{n} P_{π}^{n} \tilde{V}

$\tilde V \ge R_{d_1} + \sum_{i=1}^{n-1} \gamma^i P_\pi^i R_{d_{i+1}} + \gamma^n P_\pi^n \tilde V$ where

P_{π}^{j}

$P_\pi^j$ represents the

j

$j$ -step transition matrix under

π

$\pi$ .

Since

V^{π} = R_{d_{1}} + \sum_{i = 1}^{\infty} γ^{i} P_{π}^{i} R_{d_{i + 1}}

$V^\pi = R_{d_1} + \sum_{i=1}^{\infty}\gamma^i P_\pi^i R_{d_{i+1}}$ we have

\tilde{V} - V^{π} \geq \underset{\to 0 as n \to \infty}{\underset{⏟}{γ^{n} P_{π}^{n} \tilde{V} - \sum_{i = n}^{\infty} γ^{i} P_{π}^{i} R_{d_{i + 1}}}}

$\tilde V - V^\pi \ge \underbrace{\gamma^n P_\pi^n \tilde V -\sum_{i=n}^{\infty}\gamma^i P_\pi^i R_{d_{i+1}}}_{\rightarrow 0 \ \text{as}\ n\rightarrow \infty}$ So we have

\tilde{V} \geq V^{π}

$\tilde V \ge V^\pi$ . And since this holds for any

π

$\pi$ , we conclude that

\tilde{V} \geq max_{π} V^{π} = V^{*}

$\tilde V \ge \max_\pi V^\pi = V^\ast$ b)

Follows from step 1.

3

The optimal Bellman operator is a contraction in $L_\infty$ norm, cf. [2].

Proof: For any $s$ ,

\begin{aligned} | T V_{1} (s) - T V_{2} (s) | & = | max_{a \in A} [R (s, a) + γ \sum_{s^{'} \in S} T (s, a, s^{'}) V_{1} (s^{'})] - max_{a^{'} \in A} [R (s, a^{'}) + γ \sum_{s^{'} \in S} T (s, a^{'}, s^{'}) V (s^{'})] | \\ \overset{(*)}{\leq} | max_{a \in A} [γ \sum_{s^{'} \in S} T (s, a, s^{'}) (V_{1} (s^{'}) - V_{2} (s^{'}))] | \\ \leq γ ‖ V_{1} - V_{2} ‖_{\infty} \end{aligned}

$\begin{align} \left\vert \mc T V_1(s) - \mc TV_2(s) \right\vert &= \left\vert \max_{a \in \mc A} [ R(s, a) + \gamma \, \sum_{s^\prime \in \mc S} T(s, a, s^\prime) V_1(s^\prime)] -\max_{a^\prime \in \mc A} [ R(s, a^\prime) + \gamma \, \sum_{s^\prime \in \mc S} T(s, a^\prime, s^\prime) V(s^\prime)]\right\vert \\ &\overset{(*)}{\le} \left\vert \max_{a \in \mc A} [\gamma \, \sum_{s^\prime \in \mc S} T(s, a, s^\prime) (V_1(s^\prime) - V_2(s^\prime))] \right\vert \\ &\le \gamma \Vert V_1 - V_2 \Vert_\infty \end{align}$ where in (*) we used the fact that

max_{a} f (a) - max_{a^{'}} g (a^{'}) \leq max_{a} [f (a) - g (a)]

$\max_a f(a) - \max_{a^\prime} g(a^\prime) \le \max_a [f(a) - g(a)]$

Thus by Banach fixed point theorum it follows that $\mc T$ has a unique fixed point.

References

[1] Puterman, Martin L.. “Markov Decision Processes : Discrete Stochastic Dynamic Programming.” (2016).

[2] A. Lazaric. http://researchers.lille.inria.fr/~lazaric/Webpage/MVA-RL_Course14_files/slides-lecture-02-handout.pdf

— LoveIris
fuente

-1

The policy $a=\pi(s)$ gives the best action $a$ to execute in state $s$ according to policy $\pi$ , i.e. the value function $v_\pi(s)=\max_{a \in A} q_\pi (s,a)$ is highest for action $a$ in state $s$ .

There is always at least one policy that is better than or equal to all other policies.

Thus there is always a policy $\pi_*$ which gives equal or higher expected rewards than policy $\pi$ . Note that this implies that $\pi$ could be an/the optimal policy ( $\pi_*$ ) itself.

— agold
fuente

3

How does this answer the question? You're basically repeating statements written in the quote.

— nbro