Is the discount not needed in a deterministic environment for Reinforcement Learning?

up vote
4
down vote

favorite

I'm now reading a book titled as "Deep Reinforcement Learning Hands-On" and the author said the following on the chapter about AlphaGo Zero:

Self-play

In AlphaGo Zero, the NN is used to approximate the prior probabilities of the actions and evaluate the position, which is very similar to the Actor-Critic (A2C) two-headed setup. On the input of the network, we pass the current game position (augmented with several previous positions) and return two values. The policy head returns the probability distribution over the actions and the value head estimates the game outcome as seen from the player's perspective. This value is undiscounted, as moves in Go are deterministic. Of course, if you have stochasticity in the game, like in backgammon, some discounting should be used.

All the environments that I have seen so far are stochastic environments, and I understand the discount factor is needed in stochastic environment.
I also understand that the discount factor should be added in infinite environments (no end episode) in order to avoid the infinite calculation.

But I have never heard (at least so far on my limited learning) that the discount factor is NOT needed in deterministic environment. Is it correct? And if so, why is it NOT needed?

edited Aug 15 at 20:25

DukeZhouâ™¦

2,85521028

asked Aug 15 at 16:58

Blaszard

3401314

add a commentÂ |Â

up vote
4
down vote

favorite

I'm now reading a book titled as "Deep Reinforcement Learning Hands-On" and the author said the following on the chapter about AlphaGo Zero:

Self-play

In AlphaGo Zero, the NN is used to approximate the prior probabilities of the actions and evaluate the position, which is very similar to the Actor-Critic (A2C) two-headed setup. On the input of the network, we pass the current game position (augmented with several previous positions) and return two values. The policy head returns the probability distribution over the actions and the value head estimates the game outcome as seen from the player's perspective. This value is undiscounted, as moves in Go are deterministic. Of course, if you have stochasticity in the game, like in backgammon, some discounting should be used.

But I have never heard (at least so far on my limited learning) that the discount factor is NOT needed in deterministic environment. Is it correct? And if so, why is it NOT needed?

edited Aug 15 at 20:25

DukeZhouâ™¦

2,85521028

asked Aug 15 at 16:58

Blaszard

3401314

add a commentÂ |Â

up vote
4
down vote

favorite

I'm now reading a book titled as "Deep Reinforcement Learning Hands-On" and the author said the following on the chapter about AlphaGo Zero:

Self-play

In AlphaGo Zero, the NN is used to approximate the prior probabilities of the actions and evaluate the position, which is very similar to the Actor-Critic (A2C) two-headed setup. On the input of the network, we pass the current game position (augmented with several previous positions) and return two values. The policy head returns the probability distribution over the actions and the value head estimates the game outcome as seen from the player's perspective. This value is undiscounted, as moves in Go are deterministic. Of course, if you have stochasticity in the game, like in backgammon, some discounting should be used.

But I have never heard (at least so far on my limited learning) that the discount factor is NOT needed in deterministic environment. Is it correct? And if so, why is it NOT needed?

edited Aug 15 at 20:25

DukeZhouâ™¦

2,85521028

asked Aug 15 at 16:58

Blaszard

3401314

I'm now reading a book titled as "Deep Reinforcement Learning Hands-On" and the author said the following on the chapter about AlphaGo Zero:

Self-play

In AlphaGo Zero, the NN is used to approximate the prior probabilities of the actions and evaluate the position, which is very similar to the Actor-Critic (A2C) two-headed setup. On the input of the network, we pass the current game position (augmented with several previous positions) and return two values. The policy head returns the probability distribution over the actions and the value head estimates the game outcome as seen from the player's perspective. This value is undiscounted, as moves in Go are deterministic. Of course, if you have stochasticity in the game, like in backgammon, some discounting should be used.

But I have never heard (at least so far on my limited learning) that the discount factor is NOT needed in deterministic environment. Is it correct? And if so, why is it NOT needed?

edited Aug 15 at 20:25

DukeZhouâ™¦

2,85521028

asked Aug 15 at 16:58

Blaszard

3401314

edited Aug 15 at 20:25

DukeZhouâ™¦

2,85521028

edited Aug 15 at 20:25

DukeZhouâ™¦

2,85521028

edited Aug 15 at 20:25

DukeZhouâ™¦

2,85521028

asked Aug 15 at 16:58

Blaszard

3401314

asked Aug 15 at 16:58

Blaszard

3401314

asked Aug 15 at 16:58

Blaszard

3401314

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
6
down vote

accepted

The motivation for adding the discount factor $gamma$ is generally, at least initially, based simply in "theoretical convenience". Ideally, we'd like to define the "objective" of an RL agent as maximizing the sum of all the rewards it gathers; its return, defined as:

$$sum_t = 0^infty R_t,$$

where $R_t$ denotes the immediate reward at time $t$. As you also already noted in your question, this is inconvenient from a theoretical point of view, because we can have many different such sums that all end up being equal to $infty$, and then the objective of "maximizing" that quantity becomes quite meaningless. So, by far the most common solution is to introduce a discount factor $0 leq gamma < 1$, and formulate our objective as maximizing the discounted return:

$$sum_t = 0^infty gamma^t R_t.$$

Now we have an objective that will never be equal to $infty$, so maximizing that objective always has a well-defined meaning.

As far as I am aware, the motivation described above is the only motivation for a discount factor being strictly necessary / needed. This is not related to the problem being stochastic or deterministic.

If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:

$$sum_0^T R_t,$$

where $R_t$ is a random variable drawn from some distribution. Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor.

Above, I addressed the question of whether or not a discount factor is necessary. This does not tell the full story though. Even in cases where a discount factor is not strictly necessary, it still might be useful.

Intuitively, discount factors $gamma < 1$ tell us that rewards that are nearby in a temporal sense (reachable in a low number of time steps) are more important than rewards that are far away. In problems with a finite time horizon $T$, this is probably not true, but it can still be a useful heuristic / rule of thumb.

Such a rule of thumb is particularly useful in stochastic environments, because stochasticity can introduce greater variance / uncertainty over long amounts of time than over short amounts of time. So, even if in an ideal world we'd prefer to maximize our expected sum of undiscounted rewards, it is often easier to learn how to effectively maximize a discounted sum; we'll learn behaviour that mitigates uncertainty caused by stochasticity because it prioritizes short-term rewards over long-term rewards.

This rule of thumb especially makes a lot of sense in stochastic environments, but I don't agree with the implication in that book that it would be restricted to stochastic environments. A discount factor $gamma < 1$ has also often been found to be beneficial for learning performance in deterministic environments, even if afterwards we evaluate an algorithm's performance according to the undiscounted returns, likely because it leads to a "simpler" learning problem. In a deterministic environment there may not be any uncertainty / variance that grows over time due to the environment itself, but during a training process there is still uncertainy / variance in our agent's behaviour which grows over time. For example, it will often be selecting suboptimal actions for the sake of exploration.

edited Aug 15 at 17:37

answered Aug 15 at 17:32

Dennis Soemers

1,6001323

Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
â€“Â DukeZhouâ™¦
Aug 15 at 20:23

1

@DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
â€“Â Dennis Soemers
Aug 16 at 8:11

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f7580%2fis-the-discount-not-needed-in-a-deterministic-environment-for-reinforcement-lear%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
6
down vote

accepted

$$sum_t = 0^infty R_t,$$

$$sum_t = 0^infty gamma^t R_t.$$

Now we have an objective that will never be equal to $infty$, so maximizing that objective always has a well-defined meaning.

If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:

$$sum_0^T R_t,$$

where $R_t$ is a random variable drawn from some distribution. Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor.

edited Aug 15 at 17:37

answered Aug 15 at 17:32

Dennis Soemers

1,6001323

Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
â€“Â DukeZhouâ™¦
Aug 15 at 20:23

1

@DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
â€“Â Dennis Soemers
Aug 16 at 8:11

add a commentÂ |Â

up vote
6
down vote

accepted

$$sum_t = 0^infty R_t,$$

$$sum_t = 0^infty gamma^t R_t.$$

Now we have an objective that will never be equal to $infty$, so maximizing that objective always has a well-defined meaning.

If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:

$$sum_0^T R_t,$$

where $R_t$ is a random variable drawn from some distribution. Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor.

edited Aug 15 at 17:37

answered Aug 15 at 17:32

Dennis Soemers

1,6001323

Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
â€“Â DukeZhouâ™¦
Aug 15 at 20:23

1

@DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
â€“Â Dennis Soemers
Aug 16 at 8:11

add a commentÂ |Â

up vote
6
down vote

accepted

$$sum_t = 0^infty R_t,$$

$$sum_t = 0^infty gamma^t R_t.$$

Now we have an objective that will never be equal to $infty$, so maximizing that objective always has a well-defined meaning.

If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:

$$sum_0^T R_t,$$

where $R_t$ is a random variable drawn from some distribution. Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor.

edited Aug 15 at 17:37

answered Aug 15 at 17:32

Dennis Soemers

1,6001323

$$sum_t = 0^infty R_t,$$

$$sum_t = 0^infty gamma^t R_t.$$

Now we have an objective that will never be equal to $infty$, so maximizing that objective always has a well-defined meaning.

If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:

$$sum_0^T R_t,$$

where $R_t$ is a random variable drawn from some distribution. Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor.

edited Aug 15 at 17:37

answered Aug 15 at 17:32

Dennis Soemers

1,6001323

edited Aug 15 at 17:37

answered Aug 15 at 17:32

Dennis Soemers

1,6001323

answered Aug 15 at 17:32

Dennis Soemers

1,6001323

answered Aug 15 at 17:32

Dennis Soemers

1,6001323

Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
â€“Â DukeZhouâ™¦
Aug 15 at 20:23

1

@DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
â€“Â Dennis Soemers
Aug 16 at 8:11

add a commentÂ |Â

Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
â€“Â DukeZhouâ™¦
Aug 15 at 20:23

1

@DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
â€“Â Dennis Soemers
Aug 16 at 8:11

Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma?
â€“Â DukeZhouâ™¦
Aug 15 at 20:23

@DukeZhou It's $gamma$ raised to the power $t$ (time). Suppose, for example, that $gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $infty$ (assuming that none of the individual rewards $R_t$ are equal to $infty$)
â€“Â Dennis Soemers
Aug 16 at 8:11

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky