Why does the discount rate (γ) in the Monte-Carlo Policy-Gradient Method (episodic) appears twice?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
5
down vote

favorite












I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).



On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at this pseudo-code I can't understand why it seems that the discount rate appears 2 times, once in the update state and a second time inside the return. [See the figure below]



enter image description here



It seems that the return for the steps after step 1 are just a truncation of the return of the first step. Also, if you look just one page above in the book you find an equation with just 1 discount rate (the one inside the return.)



Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:



$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$



Thanks in advance







share|improve this question


















  • 2




    Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
    – DukeZhou♦
    Aug 22 at 18:16















up vote
5
down vote

favorite












I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).



On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at this pseudo-code I can't understand why it seems that the discount rate appears 2 times, once in the update state and a second time inside the return. [See the figure below]



enter image description here



It seems that the return for the steps after step 1 are just a truncation of the return of the first step. Also, if you look just one page above in the book you find an equation with just 1 discount rate (the one inside the return.)



Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:



$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$



Thanks in advance







share|improve this question


















  • 2




    Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
    – DukeZhou♦
    Aug 22 at 18:16













up vote
5
down vote

favorite









up vote
5
down vote

favorite











I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).



On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at this pseudo-code I can't understand why it seems that the discount rate appears 2 times, once in the update state and a second time inside the return. [See the figure below]



enter image description here



It seems that the return for the steps after step 1 are just a truncation of the return of the first step. Also, if you look just one page above in the book you find an equation with just 1 discount rate (the one inside the return.)



Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:



$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$



Thanks in advance







share|improve this question














I was reading the book Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (complete draft, November 5, 2017).



On page 271, the pseudo-code for the episodic Monte-Carlo Policy-Gradient Method is presented. Looking at this pseudo-code I can't understand why it seems that the discount rate appears 2 times, once in the update state and a second time inside the return. [See the figure below]



enter image description here



It seems that the return for the steps after step 1 are just a truncation of the return of the first step. Also, if you look just one page above in the book you find an equation with just 1 discount rate (the one inside the return.)



Why then does the pseudo-code seem to be different? My guess is that I am misunderstanding something:



$$
mathbftheta_t+1 ~dot=~mathbftheta_t + alpha G_t fracnabla_mathbftheta pi left(A_t middle S_t, mathbftheta_t right).
tag13.6
$$



Thanks in advance









share|improve this question













share|improve this question




share|improve this question








edited Aug 22 at 21:04









Nat

115116




115116










asked Aug 22 at 18:06









Diego Orellana

433




433







  • 2




    Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
    – DukeZhou♦
    Aug 22 at 18:16













  • 2




    Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
    – DukeZhou♦
    Aug 22 at 18:16








2




2




Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
– DukeZhou♦
Aug 22 at 18:16





Welcome to SE:AI! Thanks for posting the graphic from the book. (Free pdf of the working draft here, which includes the cited page 271)
– DukeZhou♦
Aug 22 at 18:16











2 Answers
2






active

oldest

votes

















up vote
4
down vote



accepted










The discount factor does appear twice, and this is correct.



This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:



$$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$



Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.



Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.






share|improve this answer



























    up vote
    4
    down vote













    Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.



    I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.



    Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:



    • Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.

    • That proof eventually leads to the same Equation (13.6) on page 329.

    • Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.

    • Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.





    share|improve this answer
















    • 2




      Thanks. The explanation of your third point was missing on the 2017 draft.
      – Diego Orellana
      Aug 22 at 19:19






    • 2




      @DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
      – Dennis Soemers
      Aug 22 at 19:22










    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "658"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f7680%2fwhy-does-the-discount-rate-%25ce%25b3-in-the-monte-carlo-policy-gradient-method-episod%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    4
    down vote



    accepted










    The discount factor does appear twice, and this is correct.



    This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:



    $$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$



    Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.



    Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.






    share|improve this answer
























      up vote
      4
      down vote



      accepted










      The discount factor does appear twice, and this is correct.



      This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:



      $$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$



      Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.



      Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.






      share|improve this answer






















        up vote
        4
        down vote



        accepted







        up vote
        4
        down vote



        accepted






        The discount factor does appear twice, and this is correct.



        This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:



        $$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$



        Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.



        Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.






        share|improve this answer












        The discount factor does appear twice, and this is correct.



        This is because the function you are trying to maximise in REINFORCE for an episodic problem (by taking the gradient) is the expected return from a given (distribution of) start state:



        $$J(theta) = mathbbE_pi(theta)[G_t|S_t = s_0, t=0]$$



        Therefore, during the episode, when you sample the returns $G_1$, $G_2$ etc, these will be less relevant to the problem you are solving, reduced by the discount factor a second time as you noted. At the extreme with an episodic problem and $gamma = 0$ then REINFORCE will only find an optimal policy for the first action.



        Other algorithms, that work in continuous problems, such as Actor-Critic use different formulations for $J(theta)$, so do not have that factor of $gamma^t$.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Aug 22 at 18:58









        Neil Slater

        2,776414




        2,776414






















            up vote
            4
            down vote













            Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.



            I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.



            Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:



            • Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.

            • That proof eventually leads to the same Equation (13.6) on page 329.

            • Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.

            • Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.





            share|improve this answer
















            • 2




              Thanks. The explanation of your third point was missing on the 2017 draft.
              – Diego Orellana
              Aug 22 at 19:19






            • 2




              @DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
              – Dennis Soemers
              Aug 22 at 19:22














            up vote
            4
            down vote













            Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.



            I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.



            Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:



            • Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.

            • That proof eventually leads to the same Equation (13.6) on page 329.

            • Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.

            • Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.





            share|improve this answer
















            • 2




              Thanks. The explanation of your third point was missing on the 2017 draft.
              – Diego Orellana
              Aug 22 at 19:19






            • 2




              @DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
              – Dennis Soemers
              Aug 22 at 19:22












            up vote
            4
            down vote










            up vote
            4
            down vote









            Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.



            I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.



            Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:



            • Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.

            • That proof eventually leads to the same Equation (13.6) on page 329.

            • Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.

            • Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.





            share|improve this answer












            Neil's answer already provides some intuition as to why the pseudocode (with the extra $gamma^t$ term) is correct.



            I'd just like to additionally clarify that you do not seem to be misunderstanding anything, Equation (13.6) in the book is indeed different from the pseudocode.



            Now, I don't have the edition of the book that you mentioned right here, but I do have a later draft from March 22, 2018, and the text on this particular topic seems to be similar. In this edition:



            • Near the end of page 326, it is explicitly mentioned that they'll assume $gamma = 1$ in their proof for the Policy Gradient Theorem.

            • That proof eventually leads to the same Equation (13.6) on page 329.

            • Immediately below the pseudocode, on page 330, they actually briefly address the difference between the Equation and the pseudocode, saying that that difference is due to the assumption of $gamma = 1$ in the proof.

            • Right below that, in Exercise 13.2, they give some hints as for what you should be looking at if you'd like to derive the modified proof for the case where $gamma < 1$.






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Aug 22 at 19:11









            Dennis Soemers

            1,7351323




            1,7351323







            • 2




              Thanks. The explanation of your third point was missing on the 2017 draft.
              – Diego Orellana
              Aug 22 at 19:19






            • 2




              @DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
              – Dennis Soemers
              Aug 22 at 19:22












            • 2




              Thanks. The explanation of your third point was missing on the 2017 draft.
              – Diego Orellana
              Aug 22 at 19:19






            • 2




              @DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
              – Dennis Soemers
              Aug 22 at 19:22







            2




            2




            Thanks. The explanation of your third point was missing on the 2017 draft.
            – Diego Orellana
            Aug 22 at 19:19




            Thanks. The explanation of your third point was missing on the 2017 draft.
            – Diego Orellana
            Aug 22 at 19:19




            2




            2




            @DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
            – Dennis Soemers
            Aug 22 at 19:22




            @DiegoOrellana I can't find a link to the March 22 draft anymore, there appears to be an even later draft (can't find a date mentioned) here. This version actually has a fancy cover, so it might even be a final version rather than a draft. If the link does get broken in the future, I suspect a new link will be made available here.
            – Dennis Soemers
            Aug 22 at 19:22

















             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f7680%2fwhy-does-the-discount-rate-%25ce%25b3-in-the-monte-carlo-policy-gradient-method-episod%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            Long meetings (6-7 hours a day): Being “babysat” by supervisor

            Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

            Confectionery