#Policy gradient is... 0?

1 messages · Page 1 of 1 (latest)

wooden saffron
#

What is incorrect about the following proof that the expression for the gradient of the value function, given by the Policy Gradient Theorem, is 0? This is obviously false, but I am unsure where the mistake is.
(p(g_t, a_t, s_t) represents the joint density for (G_t, A_t, S_t))

bitter kettle
#

I don't understend how you put da_t under gradient on step 4?

wooden saffron
#

Since only the gradient depends on a_t I split up into two integrals, then noting that for a well-behaved policy, the gradient can be swapped with the integral.

wooden saffron
bitter kettle
#

Yes, I understed this 😅
Now I try to think about p(g_t,a_t,s_t). Why it's not depend on a_t after transform to p(g_t, s_t)\pi(a_t | s_t) and how do you get it?🤔

wooden saffron
#

I use Bayes theorem to write p(g_t, a_t, s_t) = p(g_t, s_t) p(a_t|g_t, s_t), and the agent chooses action a_t based entirely on s_t, so we can write p(a_t|g_t, s_t) = p(a_t|s_t), which is just \pi(a_t|s_t).

wooden saffron
#

Policy gradient is... 0?

gray sorrel
#

This lecture may help if you want to understand the details better: https://www.youtube.com/watch?v=y3oqOjHilio&list=PLqYmG7hTraZDVH599EItlEWsUOsJbAodm

Research Scientist Hado van Hasselt covers policy algorithms that can learn policies directly and actor critic algorithms that combine value predictions for more efficient learning.

Slides: https://dpmd.ai/policygradient
Full video lecture series: https://dpmd.ai/DeepMindxUCL21

▶ Play video
wooden saffron
#

I agree that the random variable G_t depends on A_t, but when we are doing an expectance and integrating over g_t and a_t, arent these simply scalars/vectors we are integrating over with no dependence between them, whereas p(g_t, a_t, s_t) is what actually describes the dependence between G_t and A_t?

Just as, even if X and Y depend on eachother, we still have expectance E[XY] = \int dx dy p(x, y) xy = \int dx dy p(x) p(y|x) xy, with p(y|x) describing the dependence between X and Y, but x and y themselves are just scalars and dont have a dependence on eachother?

gray sorrel
#

Yes that's true. If you're conditioning on the probability of g_t already, then g_t itself is just a value, so my bad there. The point is essentially the same, though: g_t depends on a_t, so you cannot separate the probabilities like that, as @bitter kettle implied already. You would need to write p(g_t|a_t, s_t), and then you can't do the inner integral anymore.

wooden saffron
#

Hmm ok so when factorizing (as in attached) as p(g_t, a_t, s_t) = p(g_t, s_t) p(a_t|g_t, s_t) we actually have that generally p(a_t|g_t, s_t) \approx p(a_t|s_t) rather than equality, and this subtle approximation is what determines whether it is 0 or not? But in the case of a deterministic policy where there is no uncertainty on a_t given s_t, then we do indeed have equality p(a_t|g_t, s_t) = p(a_t|s_t), then giving 0, because the conditioning on g_t is redundant as all information about a_t is stored in s_t.

So this implies: if one uses a deterministic policy (s_t \mapsto a_t without any stochastic sampling, so g_t holds no relevant information), then the policy gradient is 0? This seems weird.

gray sorrel
#

I don't see how you can state it to be approximately equal in general. Knowing g_t might give you a lot of information about a_t and vice versa, which completely warps the probabilities.
If it's a deterministic policy, you could still have a stochastic g_t, but if you assume a bijection between a_t and g_t, you still can't say that g_t is fully described by s_t, because a_t is not actually solely described by s_t. Pi_theta is actually p(a_t|s_t, theta). If theta was constant, then yes, your conclusion would hold, but theta being constant already implies that the gradient is 0.
So you can't just get rid of g_t's dependency on a_t, which means you can't do the inner integral without it.