Gradient decent optimization

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
1
down vote

favorite

I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.

asked 1 hour ago

Pb89

539

add a commentÂ |Â

up vote
1
down vote

favorite

asked 1 hour ago

Pb89

539

add a commentÂ |Â

up vote
1
down vote

favorite

asked 1 hour ago

Pb89

539

optimization gradient-descent

asked 1 hour ago

Pb89

539

asked 1 hour ago

Pb89

539

asked 1 hour ago

Pb89

539

asked 1 hour ago

Pb89

539

asked 1 hour ago

Pb89

539

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
1
down vote

Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.

Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.

answered 53 mins ago

SmallChess

5,44341837

add a commentÂ |Â

up vote
1
down vote

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$

Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

answered 39 mins ago

Sycorax

36k694180

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
22 mins ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
19 mins ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
9 mins ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
7 mins ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
4 mins ago

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373055%2fgradient-decent-optimization%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.

Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.

answered 53 mins ago

SmallChess

5,44341837

add a commentÂ |Â

up vote
1
down vote

Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.

Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.

answered 53 mins ago

SmallChess

5,44341837

add a commentÂ |Â

up vote
1
down vote

Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.

Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.

answered 53 mins ago

SmallChess

5,44341837

Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.

Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.

answered 53 mins ago

SmallChess

5,44341837

answered 53 mins ago

SmallChess

5,44341837

answered 53 mins ago

SmallChess

5,44341837

answered 53 mins ago

SmallChess

5,44341837

add a commentÂ |Â

up vote
1
down vote

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$

Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

answered 39 mins ago

Sycorax

36k694180

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
22 mins ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
19 mins ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
9 mins ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
7 mins ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
4 mins ago

add a commentÂ |Â

up vote
1
down vote

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$

Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

answered 39 mins ago

Sycorax

36k694180

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
22 mins ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
19 mins ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
9 mins ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
7 mins ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
4 mins ago

add a commentÂ |Â

up vote
1
down vote

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$

Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

answered 39 mins ago

Sycorax

36k694180

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$

Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

answered 39 mins ago

Sycorax

36k694180

answered 39 mins ago

Sycorax

36k694180

answered 39 mins ago

Sycorax

36k694180

answered 39 mins ago

Sycorax

36k694180

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
22 mins ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
19 mins ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
9 mins ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
7 mins ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
4 mins ago

add a commentÂ |Â

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
22 mins ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
19 mins ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
9 mins ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
7 mins ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
4 mins ago

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
22 mins ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
19 mins ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
9 mins ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
7 mins ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
4 mins ago

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky