Policy gradient theorem or, likelihood ratio policy gradient, is a theorem that transforms policy gradient into a sample based estimation problem. The derivation is the following.
We want to optimize the overall utility of using a policy over a state-action sequence, a trajectory, , where . We can express the utility as:
Given an expectation under a distribution, we can turn it into a sum over all possible events weighted by their probabilities (the definition of expectation). In our case, we can rewrite this expectation in terms of a probability function, , of a trajectory, , under policy and the corresponding reward, :
The goal is to find the parameter that gives the maximum utility:
Next, we will use gradient optimization to solve this problem. We take the gradient of with respect to :
Based on the linearity of gradient, the gradient of sum is the sum of gradient. Therefore, we have:
Here, we want to have a weighted sum so that we can sample the trajectories instead of enumerate all trajectories. To get there, we multiply and divide by :
Notice now we have a derivative of a function:
Our gradient becomes:
Here, we can apply the definition of expectation again, now in reverse:
which gives us the expected value of a function, , under distribution . This allows us to use a sample-based estimate of instead of enumerating all possible trajectories. Using an empirical estimate of the expectation with samples, we get: