Setting
We are considering in the setting of:
- Discrete actions
- Discrete states
- Bounded rewards
- Stationary policy
- Infinite horizon
The optimal policy is defined as:
π∗∈argmaxπVπ(s),∀s∈S(1)
and the
optimal value function is:
V∗=maxπVπ(s),∀s∈S(2)
There can be a set of policies which achieve the maximum. But there is only one optimal value function:
V∗=Vπ∗(3)
The question
How to prove that there exists at least one π∗ which satisfies (1) simultaneously for all s∈S ?
Outline of proof
Construct the optimal equation to be used as a temporary surrogate definition of optimal value function, which we will prove in step 2 that it is equivalent to the definition via Eq.(2).
V∗(s)=maxa∈A[R(s,a)+γ∑s′∈ST(s,a,s′)V∗(s′)](4)
Derive the equivalency of defining optimal value function via Eq.(4) and via Eq.(2).
(Note in fact we only need the necessity direction in the proof, because the sufficiency is obvious since we constructed Eq.(4) from Eq.(2).)
Prove that there is a unique solution to Eq.(4).
By step 2, we know that the solution obtained in step 3 is also a solution to Eq.(2), so it is an optimal value function.
From an optimal value function, we can recover an optimal policy by choosing the maximizer action in Eq.(4) for each state.
Details of the steps
1
Since V∗(s)=Vπ∗(s)=Ea[Qπ∗(s,a)], we have Vπ∗(s)≤maxa∈AQπ∗(s,a). And if there is any s~ such that Vπ∗≠maxa∈AQπ∗(s,a), we can choose a better policy by maximizing Q∗(s,a)=Qπ∗(s,a) over a.
2
(=>)
Follows by step 1.
(<=)
i.e. If V~ satisfies V~(s)=maxa∈A[R(s,a)+γ∑s′∈ST(s,a,s′)V~(s′)], then V~(s)=V∗(s)=maxπVπ(s),∀s∈S.
Define the optimal Bellman operator as
TV(s)=maxa∈A[R(s,a)+γ∑s′∈ST(s,a,s′)V(s′)](5)
So our goal is to prove that if
V~=TV~, then
V~=V∗. We show this by combining two results, following
Puterman[1]:
a) If V~≥TV~, then V~≥V∗.
b) If V~≤TV~, then V~≤V∗.
Proof:
a)
For any π=(d1,d2,...),
V~≥TV~=maxd[Rd+γPdV~]≥Rd1+γPd1V~
Here
d is the decision rule(action profile at specific time),
Rd is the vector representation of immediate reward induced from
d and
Pd is transition matrix induced from
d.
By induction, for any n,
V~≥Rd1+∑i=1n−1γiPiπRdi+1+γnPnπV~
where
Pjπ represents the
j-step transition matrix under
π.
Since
Vπ=Rd1+∑i=1∞γiPiπRdi+1
we have
V~−Vπ≥γnPnπV~−∑i=n∞γiPiπRdi+1→0 as n→∞
So we have
V~≥Vπ. And since this holds for any
π, we conclude that
V~≥maxπVπ=V∗
b)
Follows from step 1.
3
The optimal Bellman operator is a contraction in L∞ norm, cf. [2].
Proof:
For any s,
|TV1(s)−TV2(s)|=∣∣∣∣maxa∈A[R(s,a)+γ∑s′∈ST(s,a,s′)V1(s′)]−maxa′∈A[R(s,a′)+γ∑s′∈ST(s,a′,s′)V(s′)]∣∣∣∣≤(∗)∣∣∣∣maxa∈A[γ∑s′∈ST(s,a,s′)(V1(s′)−V2(s′))]∣∣∣∣≤γ∥V1−V2∥∞
where in (*) we used the fact that
maxaf(a)−maxa′g(a′)≤maxa[f(a)−g(a)]
Thus by Banach fixed point theorum it follows that T has a unique fixed point.
References
[1] Puterman, Martin L.. “Markov Decision Processes : Discrete Stochastic Dynamic Programming.” (2016).
[2] A. Lazaric. http://researchers.lille.inria.fr/~lazaric/Webpage/MVA-RL_Course14_files/slides-lecture-02-handout.pdf