¿Cuál es el operador de Bellman en el aprendizaje por refuerzo?

En matemáticas, la palabra operador puede referirse a varios conceptos distintos pero relacionados. Un operador se puede definir como una función entre dos espacios vectoriales, se puede definir como una función donde el dominio y el codominio son iguales, o se puede definir como una función desde funciones (que son vectores) a otras funciones (para ejemplo, el operador diferencial ), es decir, una función de orden superior (si está familiarizado con la programación funcional).

¿Cuál es el operador de Bellman en el aprendizaje por refuerzo (RL)? ¿Por qué lo necesitamos? ¿Cómo se relaciona el operador de Bellman con las ecuaciones de Bellman en RL?

reinforcement-learning terminology math

— nbro
fuente

Algunos documentos relacionados con este tema son Métodos basados en características para la programación dinámica a gran escala (por John N. Tsitsiklis y Benjamin Van Roy, 1996), Un análisis del aprendizaje de diferencias temporales con aproximación de funciones (por John N. Tsitsiklis y Benjamin Van Roy, 1997) e Iteration de política de mínimos cuadrados (por Michail G. Lagoudakis y Ronald Parr, 2003).

— nbro

Algunos documentos relacionados más que encontré son Procesos de decisión generalizados de Markov: Algoritmos de programación dinámica y aprendizaje de refuerzo (por Csaba Szepesvári y Michael L. Littman, 1997) y

ϵ

$\epsilon$ -MDPs: Aprendizaje en diferentes ambientes (por István Szita, Bálint Takács, András Lörincz, 2002).

— nbro

La notación que usaré es de dos conferencias diferentes de David Silver y también está informada por estas diapositivas .

La ecuación de Bellman esperada es

\begin{matrix} (1) & v_{π} (s) = \sum_{a \in A} π (a | s) (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v_{π} (s^{'})) \end{matrix}

$v_\pi(s) = \sum_{a\in \cal{A}} \pi(a|s) \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_\pi(s')\right) \tag 1$

Si dejamos

\begin{matrix} (2) & P_{s s^{'}}^{π} = \sum_{a \in A} π (a | s) P_{s s^{'}}^{a} \end{matrix}

$\cal{P}_{ss'}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{P}_{ss'}^a \tag 2$ y

\begin{matrix} (3) & R_{s}^{π} = \sum_{a \in A} π (a | s) R_{s}^{a} \end{matrix}

$\cal{R}_{s}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{R}_{s}^a \tag 3$ entonces podemos reescribir

(1)

$(1)$ como

\begin{matrix} (4) & v_{π} (s) = R_{s}^{π} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{π} v_{π} (s^{'}) \end{matrix}

$v_\pi(s) = \cal{R}_s^\pi + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^\pi v_\pi(s') \tag 4$

Esto se puede escribir en forma de matriz

\begin{matrix} (5) & [\begin{matrix} v_{π} (1) \\ ⋮ \\ v_{π} (n) \end{matrix}] = [\begin{matrix} R_{1}^{π} \\ ⋮ \\ R_{n}^{π} \end{matrix}] + γ [\begin{matrix} P_{11}^{π} & \dots & P_{1 n}^{π} \\ ⋮ & ⋱ & ⋮ \\ P_{n 1}^{π} & \dots & P_{n n}^{π} \end{matrix}] [\begin{matrix} v_{π} (1) \\ ⋮ \\ v_{π} (n) \end{matrix}] \end{matrix}

$\left. \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix}= \begin{bmatrix} \cal{R}_1^\pi \\ \vdots \\ \cal{R}_n^\pi \end{bmatrix} +\gamma \begin{bmatrix} \cal{P}_{11}^\pi & \dots & \cal{P}_{1n}^\pi\\ \vdots & \ddots & \vdots\\ \cal{P}_{n1}^\pi & \dots & \cal{P}_{nn}^\pi \end{bmatrix} \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix} \right. \tag 5$

Or, more compactly,

\begin{matrix} (6) & v_{π} = R^{π} + γ P^{π} v_{π} \end{matrix}

$v_\pi = \cal{R}^\pi + \gamma \cal{P}^\pi v_\pi \tag 6$

Notice that both sides of $(6)$ are $n$ -dimensional vectors. Here $n=|\cal{S}|$ is the size of the state space. We can then define an operator $\cal{T}^\pi:\mathbb{R}^n\to\mathbb{R}^n$ as

\begin{matrix} (7) & T^{π} (v) = R^{π} + γ P^{π} v \end{matrix}

$\cal{T^\pi}(v) = \cal{R}^\pi + \gamma \cal{P}^\pi v \tag 7$

for any $v\in \mathbb{R}^n$ . This is the expected Bellman operator.

Similarly, you can rewrite the Bellman optimality equation

\begin{matrix} (8) & v_{*} (s) = max_{a \in A} (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v_{*} (s^{'})) \end{matrix}

$v_*(s) = \max_{a\in\cal{A}} \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_*(s')\right) \tag 8$

as the Bellman optimality operator

\begin{matrix} (9) & T^{*} (v) = max_{a \in A} (R^{a} + γ P^{a} v) \end{matrix}

$\cal{T^*}(v) = \max_{a\in\cal{A}} \left(\cal{R}^a + \gamma \cal{P}^a v\right) \tag 9$

The Bellman operators are "operators" in that they are mappings from one point to another within the vector space of state values, $\mathbb{R}^n$ .

Rewriting the Bellman equations as operators is useful for proving that certain dynamic programming algorithms (e.g. policy iteration, value iteration) converge to a unique fixed point. This usefulness comes in the form of a body of existing work in operator theory, which allows us to make use of special properties of the Bellman operators.

Specifically, the fact that the Bellman operators are contractions gives the useful results that, for any policy $\pi$ and any initial vector $v$ ,

\begin{matrix} (10) & lim_{k \to \infty} (T^{π})^{k} v = v_{π} \end{matrix}

$\lim_{k\to\infty}(\cal{T}^\pi)^k v = v_\pi \tag{10}$

\begin{matrix} (11) & lim_{k \to \infty} (T^{*})^{k} v = v_{*} \end{matrix}

$\lim_{k\to\infty}(\cal{T}^*)^k v = v_* \tag{11}$

where $v_\pi$ is the value of policy $\pi$ and $v_*$ is the value of an optimal policy $\pi^*$ . The proof is due to the contraction mapping theorem.

— Philip Raeisghasem
fuente