Why does the discount rate (ó) in the Monte-Carlo Policy-Gradient Method (episodic) appears twice?
Clash Royale CLAN TAG#URR8PPP
up vote
5
down vote
favorite
I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).
On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at this pseudo-code I can't understand why it seems that the discount rate appears 2 times, once in the update state and a second time inside the return. [See the figure below]
It seems that the return for the steps after step 1 are just a truncation of the return of the first step. Also, if you look just one page above in the book you find an equation with just 1 discount rate (the one inside the return.)
Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:
$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$
Thanks in advance
reinforcement-learning algorithm theorics
add a comment |Â
up vote
5
down vote
favorite
I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).
On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at this pseudo-code I can't understand why it seems that the discount rate appears 2 times, once in the update state and a second time inside the return. [See the figure below]
It seems that the return for the steps after step 1 are just a truncation of the return of the first step. Also, if you look just one page above in the book you find an equation with just 1 discount rate (the one inside the return.)
Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:
$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$
Thanks in advance
reinforcement-learning algorithm theorics
2
Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â DukeZhouâ¦
Aug 22 at 18:16
add a comment |Â
up vote
5
down vote
favorite
up vote
5
down vote
favorite
I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).
On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at this pseudo-code I can't understand why it seems that the discount rate appears 2 times, once in the update state and a second time inside the return. [See the figure below]
It seems that the return for the steps after step 1 are just a truncation of the return of the first step. Also, if you look just one page above in the book you find an equation with just 1 discount rate (the one inside the return.)
Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:
$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$
Thanks in advance
reinforcement-learning algorithm theorics
I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).
On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at this pseudo-code I can't understand why it seems that the discount rate appears 2 times, once in the update state and a second time inside the return. [See the figure below]
It seems that the return for the steps after step 1 are just a truncation of the return of the first step. Also, if you look just one page above in the book you find an equation with just 1 discount rate (the one inside the return.)
Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:
$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$
Thanks in advance
reinforcement-learning algorithm theorics
edited Aug 22 at 21:04
Nat
115116
115116
asked Aug 22 at 18:06
Diego Orellana
433
433
2
Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â DukeZhouâ¦
Aug 22 at 18:16
add a comment |Â
2
Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â DukeZhouâ¦
Aug 22 at 18:16
2
2
Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â DukeZhouâ¦
Aug 22 at 18:16
Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â DukeZhouâ¦
Aug 22 at 18:16
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
4
down vote
accepted
The discount factor does appear twice, and this is correct.
This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:
$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$
Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.
Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.
add a comment |Â
up vote
4
down vote
Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.
I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.
Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:
- Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.
- That proof eventually leads to the same Equation (13.6) on page 329.
- Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.
- Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.
2
Thanks. The explanation of your third point was missing on the 2017 draft.
â Diego Orellana
Aug 22 at 19:19
2
@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â Dennis Soemers
Aug 22 at 19:22
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
The discount factor does appear twice, and this is correct.
This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:
$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$
Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.
Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.
add a comment |Â
up vote
4
down vote
accepted
The discount factor does appear twice, and this is correct.
This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:
$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$
Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.
Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.
add a comment |Â
up vote
4
down vote
accepted
up vote
4
down vote
accepted
The discount factor does appear twice, and this is correct.
This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:
$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$
Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.
Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.
The discount factor does appear twice, and this is correct.
This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:
$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$
Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.
Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.
answered Aug 22 at 18:58
Neil Slater
2,776414
2,776414
add a comment |Â
add a comment |Â
up vote
4
down vote
Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.
I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.
Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:
- Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.
- That proof eventually leads to the same Equation (13.6) on page 329.
- Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.
- Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.
2
Thanks. The explanation of your third point was missing on the 2017 draft.
â Diego Orellana
Aug 22 at 19:19
2
@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â Dennis Soemers
Aug 22 at 19:22
add a comment |Â
up vote
4
down vote
Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.
I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.
Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:
- Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.
- That proof eventually leads to the same Equation (13.6) on page 329.
- Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.
- Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.
2
Thanks. The explanation of your third point was missing on the 2017 draft.
â Diego Orellana
Aug 22 at 19:19
2
@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â Dennis Soemers
Aug 22 at 19:22
add a comment |Â
up vote
4
down vote
up vote
4
down vote
Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.
I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.
Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:
- Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.
- That proof eventually leads to the same Equation (13.6) on page 329.
- Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.
- Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.
Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.
I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.
Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:
- Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.
- That proof eventually leads to the same Equation (13.6) on page 329.
- Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.
- Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.
answered Aug 22 at 19:11
Dennis Soemers
1,7351323
1,7351323
2
Thanks. The explanation of your third point was missing on the 2017 draft.
â Diego Orellana
Aug 22 at 19:19
2
@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â Dennis Soemers
Aug 22 at 19:22
add a comment |Â
2
Thanks. The explanation of your third point was missing on the 2017 draft.
â Diego Orellana
Aug 22 at 19:19
2
@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â Dennis Soemers
Aug 22 at 19:22
2
2
Thanks. The explanation of your third point was missing on the 2017 draft.
â Diego Orellana
Aug 22 at 19:19
Thanks. The explanation of your third point was missing on the 2017 draft.
â Diego Orellana
Aug 22 at 19:19
2
2
@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â Dennis Soemers
Aug 22 at 19:22
@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â Dennis Soemers
Aug 22 at 19:22
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f7680%2fwhy-does-the-discount-rate-%25ce%25b3-in-the-monte-carlo-policy-gradient-method-episod%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â DukeZhouâ¦
Aug 22 at 18:16