Is the discount not needed in a deterministic environment for Reinforcement Learning?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
4
down vote

favorite
1












I'm now reading a book titled as "Deep Reinforcement Learning Hands-On" and the author said the following on the chapter about AlphaGo Zero:




Self-play



In AlphaGo Zero, the NN is used to approximate the prior probabilities of the actions and evaluate the position, which is very similar to the Actor-Critic (A2C) two-headed setup. On the input of the network, we pass the current game position (augmented with several previous positions) and return two values. The policy head returns the probability distribution over the actions and the value head estimates the game outcome as seen from the player's perspective. This value is undiscounted, as moves in Go are deterministic. Of course, if you have stochasticity in the game, like in backgammon, some discounting should be used.




All the environments that I have seen so far are stochastic environments, and I understand the discount factor is needed in stochastic environment.
I also understand that the discount factor should be added in infinite environments (no end episode) in order to avoid the infinite calculation.



But I have never heard (at least so far on my limited learning) that the discount factor is NOT needed in deterministic environment. Is it correct? And if so, why is it NOT needed?







share|improve this question


























    up vote
    4
    down vote

    favorite
    1












    I'm now reading a book titled as "Deep Reinforcement Learning Hands-On" and the author said the following on the chapter about AlphaGo Zero:




    Self-play



    In AlphaGo Zero, the NN is used to approximate the prior probabilities of the actions and evaluate the position, which is very similar to the Actor-Critic (A2C) two-headed setup. On the input of the network, we pass the current game position (augmented with several previous positions) and return two values. The policy head returns the probability distribution over the actions and the value head estimates the game outcome as seen from the player's perspective. This value is undiscounted, as moves in Go are deterministic. Of course, if you have stochasticity in the game, like in backgammon, some discounting should be used.




    All the environments that I have seen so far are stochastic environments, and I understand the discount factor is needed in stochastic environment.
    I also understand that the discount factor should be added in infinite environments (no end episode) in order to avoid the infinite calculation.



    But I have never heard (at least so far on my limited learning) that the discount factor is NOT needed in deterministic environment. Is it correct? And if so, why is it NOT needed?







    share|improve this question
























      up vote
      4
      down vote

      favorite
      1









      up vote
      4
      down vote

      favorite
      1






      1





      I'm now reading a book titled as "Deep Reinforcement Learning Hands-On" and the author said the following on the chapter about AlphaGo Zero:




      Self-play



      In AlphaGo Zero, the NN is used to approximate the prior probabilities of the actions and evaluate the position, which is very similar to the Actor-Critic (A2C) two-headed setup. On the input of the network, we pass the current game position (augmented with several previous positions) and return two values. The policy head returns the probability distribution over the actions and the value head estimates the game outcome as seen from the player's perspective. This value is undiscounted, as moves in Go are deterministic. Of course, if you have stochasticity in the game, like in backgammon, some discounting should be used.




      All the environments that I have seen so far are stochastic environments, and I understand the discount factor is needed in stochastic environment.
      I also understand that the discount factor should be added in infinite environments (no end episode) in order to avoid the infinite calculation.



      But I have never heard (at least so far on my limited learning) that the discount factor is NOT needed in deterministic environment. Is it correct? And if so, why is it NOT needed?







      share|improve this question














      I'm now reading a book titled as "Deep Reinforcement Learning Hands-On" and the author said the following on the chapter about AlphaGo Zero:




      Self-play



      In AlphaGo Zero, the NN is used to approximate the prior probabilities of the actions and evaluate the position, which is very similar to the Actor-Critic (A2C) two-headed setup. On the input of the network, we pass the current game position (augmented with several previous positions) and return two values. The policy head returns the probability distribution over the actions and the value head estimates the game outcome as seen from the player's perspective. This value is undiscounted, as moves in Go are deterministic. Of course, if you have stochasticity in the game, like in backgammon, some discounting should be used.




      All the environments that I have seen so far are stochastic environments, and I understand the discount factor is needed in stochastic environment.
      I also understand that the discount factor should be added in infinite environments (no end episode) in order to avoid the infinite calculation.



      But I have never heard (at least so far on my limited learning) that the discount factor is NOT needed in deterministic environment. Is it correct? And if so, why is it NOT needed?









      share|improve this question













      share|improve this question




      share|improve this question








      edited Aug 15 at 20:25









      DukeZhou♦

      2,85521028




      2,85521028










      asked Aug 15 at 16:58









      Blaszard

      3401314




      3401314




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          6
          down vote



          accepted










          The motivation for adding the discount factor $gamma$ is generally, at least initially, based simply in "theoretical convenience". Ideally, we'd like to define the "objective" of an RL agent as maximizing the sum of all the rewards it gathers; its return, defined as:



          $$sum_t = 0^infty R_t,$$



          where $R_t$ denotes the immediate reward at time $t$. As you also already noted in your question, this is inconvenient from a theoretical point of view, because we can have many different such sums that all end up being equal to $infty$, and then the objective of "maximizing" that quantity becomes quite meaningless. So, by far the most common solution is to introduce a discount factor $0 leq gamma < 1$, and formulate our objective as maximizing the discounted return:



          $$sum_t = 0^infty gamma^t R_t.$$



          Now we have an objective that will never be equal to $infty$, so maximizing that objective always has a well-defined meaning.




          As far as I am aware, the motivation described above is the only motivation for a discount factor being strictly necessary / needed. This is not related to the problem being stochastic or deterministic.



          If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:



          $$sum_0^T R_t,$$



          where $R_t$ is a random variable drawn from some distribution. Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor.




          Above, I addressed the question of whether or not a discount factor is necessary. This does not tell the full story though. Even in cases where a discount factor is not strictly necessary, it still might be useful.



          Intuitively, discount factors $gamma < 1$ tell us that rewards that are nearby in a temporal sense (reachable in a low number of time steps) are more important than rewards that are far away. In problems with a finite time horizon $T$, this is probably not true, but it can still be a useful heuristic / rule of thumb.



          Such a rule of thumb is particularly useful in stochastic environments, because stochasticity can introduce greater variance / uncertainty over long amounts of time than over short amounts of time. So, even if in an ideal world we'd prefer to maximize our expected sum of undiscounted rewards, it is often easier to learn how to effectively maximize a discounted sum; we'll learn behaviour that mitigates uncertainty caused by stochasticity because it prioritizes short-term rewards over long-term rewards.



          This rule of thumb especially makes a lot of sense in stochastic environments, but I don't agree with the implication in that book that it would be restricted to stochastic environments. A discount factor $gamma < 1$ has also often been found to be beneficial for learning performance in deterministic environments, even if afterwards we evaluate an algorithm's performance according to the undiscounted returns, likely because it leads to a "simpler" learning problem. In a deterministic environment there may not be any uncertainty / variance that grows over time due to the environment itself, but during a training process there is still uncertainy / variance in our agent's behaviour which grows over time. For example, it will often be selecting suboptimal actions for the sake of exploration.






          share|improve this answer






















          • Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
            – DukeZhou♦
            Aug 15 at 20:23






          • 1




            @DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
            – Dennis Soemers
            Aug 16 at 8:11










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "658"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f7580%2fis-the-discount-not-needed-in-a-deterministic-environment-for-reinforcement-lear%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          6
          down vote



          accepted










          The motivation for adding the discount factor $gamma$ is generally, at least initially, based simply in "theoretical convenience". Ideally, we'd like to define the "objective" of an RL agent as maximizing the sum of all the rewards it gathers; its return, defined as:



          $$sum_t = 0^infty R_t,$$



          where $R_t$ denotes the immediate reward at time $t$. As you also already noted in your question, this is inconvenient from a theoretical point of view, because we can have many different such sums that all end up being equal to $infty$, and then the objective of "maximizing" that quantity becomes quite meaningless. So, by far the most common solution is to introduce a discount factor $0 leq gamma < 1$, and formulate our objective as maximizing the discounted return:



          $$sum_t = 0^infty gamma^t R_t.$$



          Now we have an objective that will never be equal to $infty$, so maximizing that objective always has a well-defined meaning.




          As far as I am aware, the motivation described above is the only motivation for a discount factor being strictly necessary / needed. This is not related to the problem being stochastic or deterministic.



          If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:



          $$sum_0^T R_t,$$



          where $R_t$ is a random variable drawn from some distribution. Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor.




          Above, I addressed the question of whether or not a discount factor is necessary. This does not tell the full story though. Even in cases where a discount factor is not strictly necessary, it still might be useful.



          Intuitively, discount factors $gamma < 1$ tell us that rewards that are nearby in a temporal sense (reachable in a low number of time steps) are more important than rewards that are far away. In problems with a finite time horizon $T$, this is probably not true, but it can still be a useful heuristic / rule of thumb.



          Such a rule of thumb is particularly useful in stochastic environments, because stochasticity can introduce greater variance / uncertainty over long amounts of time than over short amounts of time. So, even if in an ideal world we'd prefer to maximize our expected sum of undiscounted rewards, it is often easier to learn how to effectively maximize a discounted sum; we'll learn behaviour that mitigates uncertainty caused by stochasticity because it prioritizes short-term rewards over long-term rewards.



          This rule of thumb especially makes a lot of sense in stochastic environments, but I don't agree with the implication in that book that it would be restricted to stochastic environments. A discount factor $gamma < 1$ has also often been found to be beneficial for learning performance in deterministic environments, even if afterwards we evaluate an algorithm's performance according to the undiscounted returns, likely because it leads to a "simpler" learning problem. In a deterministic environment there may not be any uncertainty / variance that grows over time due to the environment itself, but during a training process there is still uncertainy / variance in our agent's behaviour which grows over time. For example, it will often be selecting suboptimal actions for the sake of exploration.






          share|improve this answer






















          • Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
            – DukeZhou♦
            Aug 15 at 20:23






          • 1




            @DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
            – Dennis Soemers
            Aug 16 at 8:11














          up vote
          6
          down vote



          accepted










          The motivation for adding the discount factor $gamma$ is generally, at least initially, based simply in "theoretical convenience". Ideally, we'd like to define the "objective" of an RL agent as maximizing the sum of all the rewards it gathers; its return, defined as:



          $$sum_t = 0^infty R_t,$$



          where $R_t$ denotes the immediate reward at time $t$. As you also already noted in your question, this is inconvenient from a theoretical point of view, because we can have many different such sums that all end up being equal to $infty$, and then the objective of "maximizing" that quantity becomes quite meaningless. So, by far the most common solution is to introduce a discount factor $0 leq gamma < 1$, and formulate our objective as maximizing the discounted return:



          $$sum_t = 0^infty gamma^t R_t.$$



          Now we have an objective that will never be equal to $infty$, so maximizing that objective always has a well-defined meaning.




          As far as I am aware, the motivation described above is the only motivation for a discount factor being strictly necessary / needed. This is not related to the problem being stochastic or deterministic.



          If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:



          $$sum_0^T R_t,$$



          where $R_t$ is a random variable drawn from some distribution. Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor.




          Above, I addressed the question of whether or not a discount factor is necessary. This does not tell the full story though. Even in cases where a discount factor is not strictly necessary, it still might be useful.



          Intuitively, discount factors $gamma < 1$ tell us that rewards that are nearby in a temporal sense (reachable in a low number of time steps) are more important than rewards that are far away. In problems with a finite time horizon $T$, this is probably not true, but it can still be a useful heuristic / rule of thumb.



          Such a rule of thumb is particularly useful in stochastic environments, because stochasticity can introduce greater variance / uncertainty over long amounts of time than over short amounts of time. So, even if in an ideal world we'd prefer to maximize our expected sum of undiscounted rewards, it is often easier to learn how to effectively maximize a discounted sum; we'll learn behaviour that mitigates uncertainty caused by stochasticity because it prioritizes short-term rewards over long-term rewards.



          This rule of thumb especially makes a lot of sense in stochastic environments, but I don't agree with the implication in that book that it would be restricted to stochastic environments. A discount factor $gamma < 1$ has also often been found to be beneficial for learning performance in deterministic environments, even if afterwards we evaluate an algorithm's performance according to the undiscounted returns, likely because it leads to a "simpler" learning problem. In a deterministic environment there may not be any uncertainty / variance that grows over time due to the environment itself, but during a training process there is still uncertainy / variance in our agent's behaviour which grows over time. For example, it will often be selecting suboptimal actions for the sake of exploration.






          share|improve this answer






















          • Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
            – DukeZhou♦
            Aug 15 at 20:23






          • 1




            @DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
            – Dennis Soemers
            Aug 16 at 8:11












          up vote
          6
          down vote



          accepted







          up vote
          6
          down vote



          accepted






          The motivation for adding the discount factor $gamma$ is generally, at least initially, based simply in "theoretical convenience". Ideally, we'd like to define the "objective" of an RL agent as maximizing the sum of all the rewards it gathers; its return, defined as:



          $$sum_t = 0^infty R_t,$$



          where $R_t$ denotes the immediate reward at time $t$. As you also already noted in your question, this is inconvenient from a theoretical point of view, because we can have many different such sums that all end up being equal to $infty$, and then the objective of "maximizing" that quantity becomes quite meaningless. So, by far the most common solution is to introduce a discount factor $0 leq gamma < 1$, and formulate our objective as maximizing the discounted return:



          $$sum_t = 0^infty gamma^t R_t.$$



          Now we have an objective that will never be equal to $infty$, so maximizing that objective always has a well-defined meaning.




          As far as I am aware, the motivation described above is the only motivation for a discount factor being strictly necessary / needed. This is not related to the problem being stochastic or deterministic.



          If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:



          $$sum_0^T R_t,$$



          where $R_t$ is a random variable drawn from some distribution. Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor.




          Above, I addressed the question of whether or not a discount factor is necessary. This does not tell the full story though. Even in cases where a discount factor is not strictly necessary, it still might be useful.



          Intuitively, discount factors $gamma < 1$ tell us that rewards that are nearby in a temporal sense (reachable in a low number of time steps) are more important than rewards that are far away. In problems with a finite time horizon $T$, this is probably not true, but it can still be a useful heuristic / rule of thumb.



          Such a rule of thumb is particularly useful in stochastic environments, because stochasticity can introduce greater variance / uncertainty over long amounts of time than over short amounts of time. So, even if in an ideal world we'd prefer to maximize our expected sum of undiscounted rewards, it is often easier to learn how to effectively maximize a discounted sum; we'll learn behaviour that mitigates uncertainty caused by stochasticity because it prioritizes short-term rewards over long-term rewards.



          This rule of thumb especially makes a lot of sense in stochastic environments, but I don't agree with the implication in that book that it would be restricted to stochastic environments. A discount factor $gamma < 1$ has also often been found to be beneficial for learning performance in deterministic environments, even if afterwards we evaluate an algorithm's performance according to the undiscounted returns, likely because it leads to a "simpler" learning problem. In a deterministic environment there may not be any uncertainty / variance that grows over time due to the environment itself, but during a training process there is still uncertainy / variance in our agent's behaviour which grows over time. For example, it will often be selecting suboptimal actions for the sake of exploration.






          share|improve this answer














          The motivation for adding the discount factor $gamma$ is generally, at least initially, based simply in "theoretical convenience". Ideally, we'd like to define the "objective" of an RL agent as maximizing the sum of all the rewards it gathers; its return, defined as:



          $$sum_t = 0^infty R_t,$$



          where $R_t$ denotes the immediate reward at time $t$. As you also already noted in your question, this is inconvenient from a theoretical point of view, because we can have many different such sums that all end up being equal to $infty$, and then the objective of "maximizing" that quantity becomes quite meaningless. So, by far the most common solution is to introduce a discount factor $0 leq gamma < 1$, and formulate our objective as maximizing the discounted return:



          $$sum_t = 0^infty gamma^t R_t.$$



          Now we have an objective that will never be equal to $infty$, so maximizing that objective always has a well-defined meaning.




          As far as I am aware, the motivation described above is the only motivation for a discount factor being strictly necessary / needed. This is not related to the problem being stochastic or deterministic.



          If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:



          $$sum_0^T R_t,$$



          where $R_t$ is a random variable drawn from some distribution. Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor.




          Above, I addressed the question of whether or not a discount factor is necessary. This does not tell the full story though. Even in cases where a discount factor is not strictly necessary, it still might be useful.



          Intuitively, discount factors $gamma < 1$ tell us that rewards that are nearby in a temporal sense (reachable in a low number of time steps) are more important than rewards that are far away. In problems with a finite time horizon $T$, this is probably not true, but it can still be a useful heuristic / rule of thumb.



          Such a rule of thumb is particularly useful in stochastic environments, because stochasticity can introduce greater variance / uncertainty over long amounts of time than over short amounts of time. So, even if in an ideal world we'd prefer to maximize our expected sum of undiscounted rewards, it is often easier to learn how to effectively maximize a discounted sum; we'll learn behaviour that mitigates uncertainty caused by stochasticity because it prioritizes short-term rewards over long-term rewards.



          This rule of thumb especially makes a lot of sense in stochastic environments, but I don't agree with the implication in that book that it would be restricted to stochastic environments. A discount factor $gamma < 1$ has also often been found to be beneficial for learning performance in deterministic environments, even if afterwards we evaluate an algorithm's performance according to the undiscounted returns, likely because it leads to a "simpler" learning problem. In a deterministic environment there may not be any uncertainty / variance that grows over time due to the environment itself, but during a training process there is still uncertainy / variance in our agent's behaviour which grows over time. For example, it will often be selecting suboptimal actions for the sake of exploration.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Aug 15 at 17:37

























          answered Aug 15 at 17:32









          Dennis Soemers

          1,6001323




          1,6001323











          • Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
            – DukeZhou♦
            Aug 15 at 20:23






          • 1




            @DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
            – Dennis Soemers
            Aug 16 at 8:11
















          • Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
            – DukeZhou♦
            Aug 15 at 20:23






          • 1




            @DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
            – Dennis Soemers
            Aug 16 at 8:11















          Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
          – DukeZhou♦
          Aug 15 at 20:23




          Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
          – DukeZhou♦
          Aug 15 at 20:23




          1




          1




          @DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
          – Dennis Soemers
          Aug 16 at 8:11




          @DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
          – Dennis Soemers
          Aug 16 at 8:11

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f7580%2fis-the-discount-not-needed-in-a-deterministic-environment-for-reinforcement-lear%23new-answer', 'question_page');

          );

          Post as a guest













































































          Comments

          Popular posts from this blog

          What does second last employer means? [closed]

          List of Gilmore Girls characters

          Confectionery