Why does the discount rate (ÃŽÂ³) in the Monte-Carlo Policy-Gradient Method (episodic) appears twice?

up vote
5
down vote

favorite

I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).

On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at this pseudo-code I can't understand why it seems that the discount rate appears 2 times, once in the update state and a second time inside the return. [See the figure below]

enter image description here

It seems that the return for the steps after step 1 are just a truncation of the return of the first step. Also, if you look just one page above in the book you find an equation with just 1 discount rate (the one inside the return.)

Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:

$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$

Thanks in advance

edited Aug 22 at 21:04

Nat

115116

asked Aug 22 at 18:06

Diego Orellana

433

2

Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â€“Â DukeZhouâ™¦
Aug 22 at 18:16

add a commentÂ |Â

up vote
5
down vote

favorite

I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).

enter image description here

Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:

$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$

Thanks in advance

edited Aug 22 at 21:04

Nat

115116

asked Aug 22 at 18:06

Diego Orellana

433

2

Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â€“Â DukeZhouâ™¦
Aug 22 at 18:16

add a commentÂ |Â

up vote
5
down vote

favorite

I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).

enter image description here

Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:

$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$

Thanks in advance

edited Aug 22 at 21:04

Nat

115116

asked Aug 22 at 18:06

Diego Orellana

433

I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).

enter image description here

Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:

$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$

Thanks in advance

edited Aug 22 at 21:04

Nat

115116

asked Aug 22 at 18:06

Diego Orellana

433

edited Aug 22 at 21:04

Nat

115116

edited Aug 22 at 21:04

Nat

115116

edited Aug 22 at 21:04

Nat

115116

asked Aug 22 at 18:06

Diego Orellana

433

asked Aug 22 at 18:06

Diego Orellana

433

asked Aug 22 at 18:06

Diego Orellana

433

2

Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â€“Â DukeZhouâ™¦
Aug 22 at 18:16

add a commentÂ |Â

2

Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â€“Â DukeZhouâ™¦
Aug 22 at 18:16

Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
â€“Â DukeZhouâ™¦
Aug 22 at 18:16

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
4
down vote

accepted

The discount factor does appear twice, and this is correct.

This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:

$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$

Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.

Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.

answered Aug 22 at 18:58

Neil Slater

2,776414

add a commentÂ |Â

up vote
4
down vote

Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.

I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.

Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:

Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.

That proof eventually leads to the same Equation (13.6) on page 329.

Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.

Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.

answered Aug 22 at 19:11

Dennis Soemers

1,7351323

2

Thanks. The explanation of your third point was missing on the 2017 draft.
â€“Â Diego Orellana
Aug 22 at 19:19

2

@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â€“Â Dennis Soemers
Aug 22 at 19:22

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f7680%2fwhy-does-the-discount-rate-%25ce%25b3-in-the-monte-carlo-policy-gradient-method-episod%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
4
down vote

accepted

The discount factor does appear twice, and this is correct.

This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:

$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$

Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.

answered Aug 22 at 18:58

Neil Slater

2,776414

add a commentÂ |Â

up vote
4
down vote

accepted

The discount factor does appear twice, and this is correct.

This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:

$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$

Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.

answered Aug 22 at 18:58

Neil Slater

2,776414

add a commentÂ |Â

up vote
4
down vote

accepted

The discount factor does appear twice, and this is correct.

This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:

$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$

Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.

answered Aug 22 at 18:58

Neil Slater

2,776414

The discount factor does appear twice, and this is correct.

This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:

$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$

Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.

answered Aug 22 at 18:58

Neil Slater

2,776414

answered Aug 22 at 18:58

Neil Slater

2,776414

answered Aug 22 at 18:58

Neil Slater

2,776414

answered Aug 22 at 18:58

Neil Slater

2,776414

add a commentÂ |Â

up vote
4
down vote

Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.

I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.

Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:

Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.

That proof eventually leads to the same Equation (13.6) on page 329.

Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.

Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.

answered Aug 22 at 19:11

Dennis Soemers

1,7351323

2

Thanks. The explanation of your third point was missing on the 2017 draft.
â€“Â Diego Orellana
Aug 22 at 19:19

2

@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â€“Â Dennis Soemers
Aug 22 at 19:22

add a commentÂ |Â

up vote
4
down vote

Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.

I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.

Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:

Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.

That proof eventually leads to the same Equation (13.6) on page 329.

Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.

Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.

answered Aug 22 at 19:11

Dennis Soemers

1,7351323

2

Thanks. The explanation of your third point was missing on the 2017 draft.
â€“Â Diego Orellana
Aug 22 at 19:19

2

@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â€“Â Dennis Soemers
Aug 22 at 19:22

add a commentÂ |Â

up vote
4
down vote

Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.

I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.

Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:

Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.

That proof eventually leads to the same Equation (13.6) on page 329.

Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.

Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.

answered Aug 22 at 19:11

Dennis Soemers

1,7351323

Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.

I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.

Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:

Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.

That proof eventually leads to the same Equation (13.6) on page 329.

Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.

Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.

answered Aug 22 at 19:11

Dennis Soemers

1,7351323

answered Aug 22 at 19:11

Dennis Soemers

1,7351323

answered Aug 22 at 19:11

Dennis Soemers

1,7351323

answered Aug 22 at 19:11

Dennis Soemers

1,7351323

2

Thanks. The explanation of your third point was missing on the 2017 draft.
â€“Â Diego Orellana
Aug 22 at 19:19

2

@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â€“Â Dennis Soemers
Aug 22 at 19:22

add a commentÂ |Â

2

Thanks. The explanation of your third point was missing on the 2017 draft.
â€“Â Diego Orellana
Aug 22 at 19:19

2

@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â€“Â Dennis Soemers
Aug 22 at 19:22

Thanks. The explanation of your third point was missing on the 2017 draft.
â€“Â Diego Orellana
Aug 22 at 19:19

@DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
â€“Â Dennis Soemers
Aug 22 at 19:22

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky