Policy gradient is... 0? | Learn AI Together | Page 1

wooden saffron Dec 17, 2023, 8:33 PM

#

What is incorrect about the following proof that the expression for the gradient of the value function, given by the Policy Gradient Theorem, is 0? This is obviously false, but I am unsure where the mistake is.
(p(g_t, a_t, s_t) represents the joint density for (G_t, A_t, S_t))

bitter kettle Dec 17, 2023, 10:37 PM

#

I don't understend how you put da_t under gradient on step 4?

wooden saffron Dec 17, 2023, 10:58 PM

#

Since only the gradient depends on a_t I split up into two integrals, then noting that for a well-behaved policy, the gradient can be swapped with the integral.

wooden saffron Dec 18, 2023, 8:33 AM

#

(see https://en.wikipedia.org/wiki/Leibniz_integral_rule for swapping derivative and integral)

bitter kettle Dec 18, 2023, 12:47 PM

#

Yes, I understed this 😅
Now I try to think about p(g_t,a_t,s_t). Why it's not depend on a_t after transform to p(g_t, s_t)\pi(a_t | s_t) and how do you get it?🤔

wooden saffron Dec 18, 2023, 2:28 PM

#

I use Bayes theorem to write p(g_t, a_t, s_t) = p(g_t, s_t) p(a_t|g_t, s_t), and the agent chooses action a_t based entirely on s_t, so we can write p(a_t|g_t, s_t) = p(a_t|s_t), which is just \pi(a_t|s_t).

wooden saffron Dec 18, 2023, 7:21 PM

#

Policy gradient is... 0?

gray sorrel Dec 19, 2023, 1:41 AM

#

wooden saffron What is incorrect about the following proof that the expression for the gradient...

Actually, this is correct if you assume that g_t does not depend on a_t, which should intuitively make sense. In general, however, g_t has to be inside that inner integral, you push the gradient out and you get the gradient of the expectation of g_t as expected.

#

This lecture may help if you want to understand the details better: https://www.youtube.com/watch?v=y3oqOjHilio&list=PLqYmG7hTraZDVH599EItlEWsUOsJbAodm

YouTube

Google DeepMind

DeepMind x UCL RL Lecture Series - Policy-Gradient and Actor-Critic...

Research Scientist Hado van Hasselt covers policy algorithms that can learn policies directly and actor critic algorithms that combine value predictions for more efficient learning.

Slides: https://dpmd.ai/policygradient
Full video lecture series: https://dpmd.ai/DeepMindxUCL21

▶ Play video

wooden saffron Dec 19, 2023, 8:28 AM

#

I agree that the random variable G_t depends on A_t, but when we are doing an expectance and integrating over g_t and a_t, arent these simply scalars/vectors we are integrating over with no dependence between them, whereas p(g_t, a_t, s_t) is what actually describes the dependence between G_t and A_t?

Just as, even if X and Y depend on eachother, we still have expectance E[XY] = \int dx dy p(x, y) xy = \int dx dy p(x) p(y|x) xy, with p(y|x) describing the dependence between X and Y, but x and y themselves are just scalars and dont have a dependence on eachother?

gray sorrel Dec 19, 2023, 9:17 AM

#

Yes that's true. If you're conditioning on the probability of g_t already, then g_t itself is just a value, so my bad there. The point is essentially the same, though: g_t depends on a_t, so you cannot separate the probabilities like that, as @bitter kettle implied already. You would need to write p(g_t|a_t, s_t), and then you can't do the inner integral anymore.

wooden saffron Dec 19, 2023, 5:07 PM

#

Hmm ok so when factorizing (as in attached) as p(g_t, a_t, s_t) = p(g_t, s_t) p(a_t|g_t, s_t) we actually have that generally p(a_t|g_t, s_t) \approx p(a_t|s_t) rather than equality, and this subtle approximation is what determines whether it is 0 or not? But in the case of a deterministic policy where there is no uncertainty on a_t given s_t, then we do indeed have equality p(a_t|g_t, s_t) = p(a_t|s_t), then giving 0, because the conditioning on g_t is redundant as all information about a_t is stored in s_t.

So this implies: if one uses a deterministic policy (s_t \mapsto a_t without any stochastic sampling, so g_t holds no relevant information), then the policy gradient is 0? This seems weird.

gray sorrel Dec 19, 2023, 11:25 PM

#

I don't see how you can state it to be approximately equal in general. Knowing g_t might give you a lot of information about a_t and vice versa, which completely warps the probabilities.
If it's a deterministic policy, you could still have a stochastic g_t, but if you assume a bijection between a_t and g_t, you still can't say that g_t is fully described by s_t, because a_t is not actually solely described by s_t. Pi_theta is actually p(a_t|s_t, theta). If theta was constant, then yes, your conclusion would hold, but theta being constant already implies that the gradient is 0.
So you can't just get rid of g_t's dependency on a_t, which means you can't do the inner integral without it.

#Policy gradient is... 0?