policy gradient theorem

Policy gradient theorem or, likelihood ratio policy gradient, is a theorem that transforms policy gradient into a sample based estimation problem. The derivation is the following.

We want to optimize the overall utility $U$ of using a policy $π$ over a state-action sequence, a trajectory, $τ$ , where $τ = s_{0}, u_{0}, ..., s_{H}, u_{H}$ . We can express the utility as:

U (θ) = E [t = 0 \sum H R (s_{t}, u_{t}); π_{θ}]

Given an expectation under a distribution, we can turn it into a sum over all possible events weighted by their probabilities (the definition of expectation). In our case, we can rewrite this expectation in terms of a probability function, $P$ , of a trajectory, $τ$ , under policy $π_{θ}$ and the corresponding reward, $R$ :

U (θ) = E [t = 0 \sum H R (s_{t}, u_{t}); π_{θ}] = τ \sum P (τ; θ) R (τ)

The goal is to find the parameter $θ$ that gives the maximum utility:

θ max U (θ) = θ max τ \sum P (τ; θ) R (τ)

Next, we will use gradient optimization to solve this problem. We take the gradient of $U$ with respect to $θ$ :

\nabla_{θ} U (θ) = \nabla_{θ} τ \sum P (τ; θ) R (τ)

Based on the linearity of gradient, the gradient of sum is the sum of gradient. Therefore, we have:

\nabla_{θ} τ \sum P (τ; θ) R (τ) = τ \sum \nabla_{θ} P (τ; θ) R (τ)

Here, we want to have a weighted sum so that we can sample the trajectories instead of enumerate all trajectories. To get there, we multiply and divide by $P (τ; θ)$ :

τ \sum \frac{P ( τ ; θ )}{P ( τ ; θ )} \nabla_{θ} P (τ; θ) R (τ)

Notice now we have a derivative of a $l o g$ function:

\nabla_{θ} lo g f (x) = \frac{1}{f ( x )} \nabla_{θ} f (x)

Our gradient becomes:

τ \sum \frac{P ( τ ; θ )}{P ( τ ; θ )} \nabla_{θ} P (τ; θ) R (τ) = τ \sum P (τ; θ) \nabla_{θ} log P (τ; θ) R (τ)

Here, we can apply the definition of expectation again, now in reverse:

τ \sum P (τ; θ) \nabla_{θ} log P (τ; θ) R (τ) = E_{τ \sim P (τ; θ)} [\nabla_{θ} log P (τ; θ) R (τ)]

which gives us the expected value of a function, $\nabla_{θ} log P (τ; θ) R (τ)$ , under distribution $P (τ; θ)$ . This allows us to use a sample-based estimate of $P (τ; θ)$ instead of enumerating all possible trajectories. Using an empirical estimate of the expectation with $m$ samples, we get:

\nabla_{θ} U (θ) \approx \overset{g}{^} = \frac{1}{m} i = 1 \sum m \nabla_{θ} lo g P (τ^{(i)}; θ) R (τ^{(i)})

In the above function, there is no gradient is taken for the reward function, which means the reward function can be discontinuous or unknown.

Next we will further expand the probability, $P$ , into a temporal based function of state and action. We write out the expression for $P$ , which is probability of future state given the present state and action, times policy, $π_{θ}$ , which is the probability of action given state:

\nabla_{θ} lo g P (τ^{(i)}; θ) = \nabla_{θ} lo g [t = 0 \prod H P (s_{t + 1}^{(i)} ∣ s_{t}^{(i)}, u_{t}^{(i)}) \cdot π_{θ} (u_{t}^{(i)} ∣ u_{s}^{(i)})]

We use use some log rules to separate the product:

l o g \prod PQ = \sum lo g (PQ) = \sum lo g P + \sum lo g Q

and we get:

\nabla_{θ} lo g P (τ^{(i)}; θ) = \nabla_{θ} [t = 0 \sum H lo g P (s_{t + 1}^{(i)} ∣ s_{t}^{(i)}, u_{t}^{(i)}) + t = 0 \sum H lo g π_{θ} (u_{t}^{(i)} ∣ u_{s}^{(i)})] = [t = 0 \sum H \nabla_{θ} lo g P (s_{t + 1}^{(i)} ∣ s_{t}^{(i)}, u_{t}^{(i)}) + t = 0 \sum H \nabla_{θ} lo g π_{θ} (u_{t}^{(i)} ∣ u_{s}^{(i)})]

since probability $P$ isn’t a function of $θ$ , it’s gradient is 0. Thus we get:

\nabla_{θ} lo g P (τ^{(i)}; θ) = t = 0 \sum H \nabla_{θ} lo g π_{θ} (u_{t}^{(i)} ∣ s_{t}^{(i)})

The distribution function, $P$ , disappears. This gives us the final form of the policy gradient theorem:

\nabla_{θ} U (θ) \approx \overset{g}{^} = \frac{1}{m} i = 1 \sum m \nabla_{θ} lo g π_{θ} (u_{t}^{(i)} ∣ s_{t}^{(i)}) R (τ^{(i)})

This tells us to improve the overall utility, move the policy in the direction tht make the high reward more likely.

Note

Policy gradient theorem says that to improve the overall utility, move the policy in the direction tht make the high reward more likely.

Quarry

All Entries

Recent entries

policy gradient theorem

neural network

Adam

policy gradient theorem