Gradient descent optimization

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

I am trying to understand gradient descent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.

edited 17 mins ago

Berkan

5,53212033

asked 4 hours ago

Pb89

589

gradient descent is a decent optimisation algorithm.
â€“Â Berkan
16 mins ago

add a commentÂ |Â

up vote
2
down vote

favorite

edited 17 mins ago

Berkan

5,53212033

asked 4 hours ago

Pb89

589

gradient descent is a decent optimisation algorithm.
â€“Â Berkan
16 mins ago

add a commentÂ |Â

up vote
2
down vote

favorite

edited 17 mins ago

Berkan

5,53212033

asked 4 hours ago

Pb89

589

optimization gradient-descent

edited 17 mins ago

Berkan

5,53212033

asked 4 hours ago

Pb89

589

edited 17 mins ago

Berkan

5,53212033

asked 4 hours ago

Pb89

589

edited 17 mins ago

Berkan

5,53212033

edited 17 mins ago

Berkan

5,53212033

edited 17 mins ago

Berkan

5,53212033

asked 4 hours ago

Pb89

589

asked 4 hours ago

Pb89

589

asked 4 hours ago

Pb89

589

gradient descent is a decent optimisation algorithm.
â€“Â Berkan
16 mins ago

add a commentÂ |Â

gradient descent is a decent optimisation algorithm.
â€“Â Berkan
16 mins ago

gradient descent is a decent optimisation algorithm.
â€“Â Berkan
16 mins ago

add a commentÂ |Â

4 Answers
4

active

oldest

votes

up vote
2
down vote

When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima?

In each iteration, the algorithm will change all weights at the same time based on gradient vector. In fact, the gradient is a vector. The length of the gradient is as same as number of the weights in the model.

On the other hand, changing one parameter at a time did exist and it is called coordinate decent algorithm, which is a type of gradient free optimization algorithm. In practice, it may not work as well as gradient based algorithm.

Here is an interesting answer on gradient free algorithm

Is it possible to train a neural network without backpropagation?

edited 18 mins ago

answered 2 hours ago

hxd1011

17.3k445134

add a commentÂ |Â

up vote
1
down vote

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$

Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.

edited 3 hours ago

answered 4 hours ago

Sycorax

36k694180

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
4 hours ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
4 hours ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
3 hours ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
3 hours ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
3 hours ago

add a commentÂ |Â

up vote
1
down vote

The aim of gradient descent is to minimize the cost function. This minimization is achieved by adjusting weights, for your case w1 and w2. In general there could be n such weights.

Gradient descent is done in the following way:

initialize weights randomly.

compute the cost function and gradient with initialized weights.

update weigths:
It might happen that the gradient is O for some weights, in that case
those weights doesn't show any change after updating.
for example: Let say gradient is [1,0] the W2 will remain
unchanged.

check the cost function with updated weights, if the decrement is acceptable enough continue the iterations else terminate.

while updating weights which weight ( W1 or W2) gets changed is entirely decided by gradient. All the weights get updated ( some weights might not change based on gradient).

answered 2 hours ago

A Santosh Kumar

111

New contributor

"if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
â€“Â Pb89
23 mins ago

add a commentÂ |Â

up vote
1
down vote

Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.

Check here.

edited 1 hour ago

Sven Hohenstein

4,74762333

answered 4 hours ago

SmallChess

5,44341837

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373055%2fgradient-descent-optimization%23new-answer', 'question_page');

);

Post as a guest

Name

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

up vote
2
down vote

When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima?

Here is an interesting answer on gradient free algorithm

Is it possible to train a neural network without backpropagation?

edited 18 mins ago

answered 2 hours ago

hxd1011

17.3k445134

add a commentÂ |Â

up vote
2
down vote

When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima?

Here is an interesting answer on gradient free algorithm

Is it possible to train a neural network without backpropagation?

edited 18 mins ago

answered 2 hours ago

hxd1011

17.3k445134

add a commentÂ |Â

up vote
2
down vote

When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima?

Here is an interesting answer on gradient free algorithm

Is it possible to train a neural network without backpropagation?

edited 18 mins ago

answered 2 hours ago

hxd1011

17.3k445134

When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima?

Here is an interesting answer on gradient free algorithm

Is it possible to train a neural network without backpropagation?

edited 18 mins ago

answered 2 hours ago

hxd1011

17.3k445134

edited 18 mins ago

answered 2 hours ago

hxd1011

17.3k445134

answered 2 hours ago

hxd1011

17.3k445134

answered 2 hours ago

hxd1011

17.3k445134

add a commentÂ |Â

up vote
1
down vote

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$

Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

edited 3 hours ago

answered 4 hours ago

Sycorax

36k694180

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
4 hours ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
4 hours ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
3 hours ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
3 hours ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
3 hours ago

add a commentÂ |Â

up vote
1
down vote

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$

Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

edited 3 hours ago

answered 4 hours ago

Sycorax

36k694180

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
4 hours ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
4 hours ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
3 hours ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
3 hours ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
3 hours ago

add a commentÂ |Â

up vote
1
down vote

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$

Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

edited 3 hours ago

answered 4 hours ago

Sycorax

36k694180

Gradient descent updates all parameters at each step. You can see this in the update rule:

$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$

Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.

edited 3 hours ago

answered 4 hours ago

Sycorax

36k694180

edited 3 hours ago

answered 4 hours ago

Sycorax

36k694180

answered 4 hours ago

Sycorax

36k694180

answered 4 hours ago

Sycorax

36k694180

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
4 hours ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
4 hours ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
3 hours ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
3 hours ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
3 hours ago

add a commentÂ |Â

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
4 hours ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
4 hours ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
3 hours ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
3 hours ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
3 hours ago

So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â€“Â Pb89
4 hours ago

and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â€“Â Pb89
4 hours ago

The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â€“Â Sycorax
3 hours ago

If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â€“Â Pb89
3 hours ago

The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â€“Â Sycorax
3 hours ago

add a commentÂ |Â

up vote
1
down vote

The aim of gradient descent is to minimize the cost function. This minimization is achieved by adjusting weights, for your case w1 and w2. In general there could be n such weights.

Gradient descent is done in the following way:

initialize weights randomly.

compute the cost function and gradient with initialized weights.

update weigths:
It might happen that the gradient is O for some weights, in that case
those weights doesn't show any change after updating.
for example: Let say gradient is [1,0] the W2 will remain
unchanged.

check the cost function with updated weights, if the decrement is acceptable enough continue the iterations else terminate.

while updating weights which weight ( W1 or W2) gets changed is entirely decided by gradient. All the weights get updated ( some weights might not change based on gradient).

answered 2 hours ago

A Santosh Kumar

111

New contributor

"if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
â€“Â Pb89
23 mins ago

add a commentÂ |Â

up vote
1
down vote

The aim of gradient descent is to minimize the cost function. This minimization is achieved by adjusting weights, for your case w1 and w2. In general there could be n such weights.

Gradient descent is done in the following way:

initialize weights randomly.

compute the cost function and gradient with initialized weights.

update weigths:
It might happen that the gradient is O for some weights, in that case
those weights doesn't show any change after updating.
for example: Let say gradient is [1,0] the W2 will remain
unchanged.

check the cost function with updated weights, if the decrement is acceptable enough continue the iterations else terminate.

while updating weights which weight ( W1 or W2) gets changed is entirely decided by gradient. All the weights get updated ( some weights might not change based on gradient).

answered 2 hours ago

A Santosh Kumar

111

New contributor

"if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
â€“Â Pb89
23 mins ago

add a commentÂ |Â

up vote
1
down vote

The aim of gradient descent is to minimize the cost function. This minimization is achieved by adjusting weights, for your case w1 and w2. In general there could be n such weights.

Gradient descent is done in the following way:

initialize weights randomly.

compute the cost function and gradient with initialized weights.

update weigths:
It might happen that the gradient is O for some weights, in that case
those weights doesn't show any change after updating.
for example: Let say gradient is [1,0] the W2 will remain
unchanged.

check the cost function with updated weights, if the decrement is acceptable enough continue the iterations else terminate.

while updating weights which weight ( W1 or W2) gets changed is entirely decided by gradient. All the weights get updated ( some weights might not change based on gradient).

answered 2 hours ago

A Santosh Kumar

111

New contributor

The aim of gradient descent is to minimize the cost function. This minimization is achieved by adjusting weights, for your case w1 and w2. In general there could be n such weights.

Gradient descent is done in the following way:

initialize weights randomly.

compute the cost function and gradient with initialized weights.

update weigths:
It might happen that the gradient is O for some weights, in that case
those weights doesn't show any change after updating.
for example: Let say gradient is [1,0] the W2 will remain
unchanged.

check the cost function with updated weights, if the decrement is acceptable enough continue the iterations else terminate.

while updating weights which weight ( W1 or W2) gets changed is entirely decided by gradient. All the weights get updated ( some weights might not change based on gradient).

answered 2 hours ago

A Santosh Kumar

111

New contributor

answered 2 hours ago

A Santosh Kumar

111

New contributor

answered 2 hours ago

A Santosh Kumar

111

answered 2 hours ago

A Santosh Kumar

111

New contributor

A Santosh Kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

"if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
â€“Â Pb89
23 mins ago

add a commentÂ |Â

"if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
â€“Â Pb89
23 mins ago

"if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
â€“Â Pb89
23 mins ago

add a commentÂ |Â

up vote
1
down vote

Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.

Check here.

edited 1 hour ago

Sven Hohenstein

4,74762333

answered 4 hours ago

SmallChess

5,44341837

add a commentÂ |Â

up vote
1
down vote

Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.

Check here.

edited 1 hour ago

Sven Hohenstein

4,74762333

answered 4 hours ago

SmallChess

5,44341837

add a commentÂ |Â

up vote
1
down vote

Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.

Check here.

edited 1 hour ago

Sven Hohenstein

4,74762333

answered 4 hours ago

SmallChess

5,44341837

Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.

Check here.

edited 1 hour ago

Sven Hohenstein

4,74762333

answered 4 hours ago

SmallChess

5,44341837

edited 1 hour ago

Sven Hohenstein

4,74762333

edited 1 hour ago

Sven Hohenstein

4,74762333

edited 1 hour ago

Sven Hohenstein

4,74762333

answered 4 hours ago

SmallChess

5,44341837

answered 4 hours ago

SmallChess

5,44341837

answered 4 hours ago

SmallChess

5,44341837

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky