#Policy gradient is... 0?
1 messages · Page 1 of 1 (latest)
I don't understend how you put da_t under gradient on step 4?
Since only the gradient depends on a_t I split up into two integrals, then noting that for a well-behaved policy, the gradient can be swapped with the integral.
(see https://en.wikipedia.org/wiki/Leibniz_integral_rule for swapping derivative and integral)
Yes, I understed this 😅
Now I try to think about p(g_t,a_t,s_t). Why it's not depend on a_t after transform to p(g_t, s_t)\pi(a_t | s_t) and how do you get it?🤔
I use Bayes theorem to write p(g_t, a_t, s_t) = p(g_t, s_t) p(a_t|g_t, s_t), and the agent chooses action a_t based entirely on s_t, so we can write p(a_t|g_t, s_t) = p(a_t|s_t), which is just \pi(a_t|s_t).
Policy gradient is... 0?
Actually, this is correct if you assume that g_t does not depend on a_t, which should intuitively make sense. In general, however, g_t has to be inside that inner integral, you push the gradient out and you get the gradient of the expectation of g_t as expected.
This lecture may help if you want to understand the details better: https://www.youtube.com/watch?v=y3oqOjHilio&list=PLqYmG7hTraZDVH599EItlEWsUOsJbAodm
Research Scientist Hado van Hasselt covers policy algorithms that can learn policies directly and actor critic algorithms that combine value predictions for more efficient learning.
Slides: https://dpmd.ai/policygradient
Full video lecture series: https://dpmd.ai/DeepMindxUCL21
I agree that the random variable G_t depends on A_t, but when we are doing an expectance and integrating over g_t and a_t, arent these simply scalars/vectors we are integrating over with no dependence between them, whereas p(g_t, a_t, s_t) is what actually describes the dependence between G_t and A_t?
Just as, even if X and Y depend on eachother, we still have expectance E[XY] = \int dx dy p(x, y) xy = \int dx dy p(x) p(y|x) xy, with p(y|x) describing the dependence between X and Y, but x and y themselves are just scalars and dont have a dependence on eachother?
Yes that's true. If you're conditioning on the probability of g_t already, then g_t itself is just a value, so my bad there. The point is essentially the same, though: g_t depends on a_t, so you cannot separate the probabilities like that, as @bitter kettle implied already. You would need to write p(g_t|a_t, s_t), and then you can't do the inner integral anymore.
Hmm ok so when factorizing (as in attached) as p(g_t, a_t, s_t) = p(g_t, s_t) p(a_t|g_t, s_t) we actually have that generally p(a_t|g_t, s_t) \approx p(a_t|s_t) rather than equality, and this subtle approximation is what determines whether it is 0 or not? But in the case of a deterministic policy where there is no uncertainty on a_t given s_t, then we do indeed have equality p(a_t|g_t, s_t) = p(a_t|s_t), then giving 0, because the conditioning on g_t is redundant as all information about a_t is stored in s_t.
So this implies: if one uses a deterministic policy (s_t \mapsto a_t without any stochastic sampling, so g_t holds no relevant information), then the policy gradient is 0? This seems weird.
I don't see how you can state it to be approximately equal in general. Knowing g_t might give you a lot of information about a_t and vice versa, which completely warps the probabilities.
If it's a deterministic policy, you could still have a stochastic g_t, but if you assume a bijection between a_t and g_t, you still can't say that g_t is fully described by s_t, because a_t is not actually solely described by s_t. Pi_theta is actually p(a_t|s_t, theta). If theta was constant, then yes, your conclusion would hold, but theta being constant already implies that the gradient is 0.
So you can't just get rid of g_t's dependency on a_t, which means you can't do the inner integral without it.