Gradient descent optimization

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I am trying to understand gradient descent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.










share|cite|improve this question























  • gradient descent is a decent optimisation algorithm.
    – Berkan
    16 mins ago
















up vote
2
down vote

favorite












I am trying to understand gradient descent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.










share|cite|improve this question























  • gradient descent is a decent optimisation algorithm.
    – Berkan
    16 mins ago












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I am trying to understand gradient descent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.










share|cite|improve this question















I am trying to understand gradient descent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.







optimization gradient-descent






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 17 mins ago









Berkan

5,53212033




5,53212033










asked 4 hours ago









Pb89

589




589











  • gradient descent is a decent optimisation algorithm.
    – Berkan
    16 mins ago
















  • gradient descent is a decent optimisation algorithm.
    – Berkan
    16 mins ago















gradient descent is a decent optimisation algorithm.
– Berkan
16 mins ago




gradient descent is a decent optimisation algorithm.
– Berkan
16 mins ago










4 Answers
4






active

oldest

votes

















up vote
2
down vote














When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima?




In each iteration, the algorithm will change all weights at the same time based on gradient vector. In fact, the gradient is a vector. The length of the gradient is as same as number of the weights in the model.



On the other hand, changing one parameter at a time did exist and it is called coordinate decent algorithm, which is a type of gradient free optimization algorithm. In practice, it may not work as well as gradient based algorithm.



Here is an interesting answer on gradient free algorithm



Is it possible to train a neural network without backpropagation?






share|cite|improve this answer





























    up vote
    1
    down vote













    Gradient descent updates all parameters at each step. You can see this in the update rule:



    $$
    w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
    $$



    Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.



    The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.






    share|cite|improve this answer






















    • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
      – Pb89
      4 hours ago











    • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
      – Pb89
      4 hours ago










    • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
      – Sycorax
      3 hours ago










    • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
      – Pb89
      3 hours ago










    • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
      – Sycorax
      3 hours ago


















    up vote
    1
    down vote













    The aim of gradient descent is to minimize the cost function. This minimization is achieved by adjusting weights, for your case w1 and w2. In general there could be n such weights.



    Gradient descent is done in the following way:



    1. initialize weights randomly.

    2. compute the cost function and gradient with initialized weights.

    3. update weigths:
      It might happen that the gradient is O for some weights, in that case
      those weights doesn't show any change after updating.
      for example: Let say gradient is [1,0] the W2 will remain
      unchanged.

    4. check the cost function with updated weights, if the decrement is acceptable enough continue the iterations else terminate.

    while updating weights which weight ( W1 or W2) gets changed is entirely decided by gradient. All the weights get updated ( some weights might not change based on gradient).






    share|cite|improve this answer








    New contributor




    A Santosh Kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.

















    • "if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
      – Pb89
      23 mins ago

















    up vote
    1
    down vote













    Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



    Check here.






    share|cite|improve this answer






















      Your Answer




      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "65"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373055%2fgradient-descent-optimization%23new-answer', 'question_page');

      );

      Post as a guest






























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      2
      down vote














      When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima?




      In each iteration, the algorithm will change all weights at the same time based on gradient vector. In fact, the gradient is a vector. The length of the gradient is as same as number of the weights in the model.



      On the other hand, changing one parameter at a time did exist and it is called coordinate decent algorithm, which is a type of gradient free optimization algorithm. In practice, it may not work as well as gradient based algorithm.



      Here is an interesting answer on gradient free algorithm



      Is it possible to train a neural network without backpropagation?






      share|cite|improve this answer


























        up vote
        2
        down vote














        When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima?




        In each iteration, the algorithm will change all weights at the same time based on gradient vector. In fact, the gradient is a vector. The length of the gradient is as same as number of the weights in the model.



        On the other hand, changing one parameter at a time did exist and it is called coordinate decent algorithm, which is a type of gradient free optimization algorithm. In practice, it may not work as well as gradient based algorithm.



        Here is an interesting answer on gradient free algorithm



        Is it possible to train a neural network without backpropagation?






        share|cite|improve this answer
























          up vote
          2
          down vote










          up vote
          2
          down vote










          When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima?




          In each iteration, the algorithm will change all weights at the same time based on gradient vector. In fact, the gradient is a vector. The length of the gradient is as same as number of the weights in the model.



          On the other hand, changing one parameter at a time did exist and it is called coordinate decent algorithm, which is a type of gradient free optimization algorithm. In practice, it may not work as well as gradient based algorithm.



          Here is an interesting answer on gradient free algorithm



          Is it possible to train a neural network without backpropagation?






          share|cite|improve this answer















          When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima?




          In each iteration, the algorithm will change all weights at the same time based on gradient vector. In fact, the gradient is a vector. The length of the gradient is as same as number of the weights in the model.



          On the other hand, changing one parameter at a time did exist and it is called coordinate decent algorithm, which is a type of gradient free optimization algorithm. In practice, it may not work as well as gradient based algorithm.



          Here is an interesting answer on gradient free algorithm



          Is it possible to train a neural network without backpropagation?







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited 18 mins ago

























          answered 2 hours ago









          hxd1011

          17.3k445134




          17.3k445134






















              up vote
              1
              down vote













              Gradient descent updates all parameters at each step. You can see this in the update rule:



              $$
              w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
              $$



              Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.



              The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.






              share|cite|improve this answer






















              • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                – Pb89
                4 hours ago











              • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                – Pb89
                4 hours ago










              • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                – Sycorax
                3 hours ago










              • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                – Pb89
                3 hours ago










              • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                – Sycorax
                3 hours ago















              up vote
              1
              down vote













              Gradient descent updates all parameters at each step. You can see this in the update rule:



              $$
              w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
              $$



              Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.



              The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.






              share|cite|improve this answer






















              • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                – Pb89
                4 hours ago











              • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                – Pb89
                4 hours ago










              • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                – Sycorax
                3 hours ago










              • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                – Pb89
                3 hours ago










              • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                – Sycorax
                3 hours ago













              up vote
              1
              down vote










              up vote
              1
              down vote









              Gradient descent updates all parameters at each step. You can see this in the update rule:



              $$
              w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
              $$



              Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.



              The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.






              share|cite|improve this answer














              Gradient descent updates all parameters at each step. You can see this in the update rule:



              $$
              w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
              $$



              Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.



              The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.







              share|cite|improve this answer














              share|cite|improve this answer



              share|cite|improve this answer








              edited 3 hours ago

























              answered 4 hours ago









              Sycorax

              36k694180




              36k694180











              • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                – Pb89
                4 hours ago











              • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                – Pb89
                4 hours ago










              • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                – Sycorax
                3 hours ago










              • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                – Pb89
                3 hours ago










              • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                – Sycorax
                3 hours ago

















              • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                – Pb89
                4 hours ago











              • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                – Pb89
                4 hours ago










              • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                – Sycorax
                3 hours ago










              • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                – Pb89
                3 hours ago










              • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                – Sycorax
                3 hours ago
















              So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
              – Pb89
              4 hours ago





              So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
              – Pb89
              4 hours ago













              and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
              – Pb89
              4 hours ago




              and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
              – Pb89
              4 hours ago












              The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
              – Sycorax
              3 hours ago




              The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
              – Sycorax
              3 hours ago












              If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
              – Pb89
              3 hours ago




              If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
              – Pb89
              3 hours ago












              The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
              – Sycorax
              3 hours ago





              The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
              – Sycorax
              3 hours ago











              up vote
              1
              down vote













              The aim of gradient descent is to minimize the cost function. This minimization is achieved by adjusting weights, for your case w1 and w2. In general there could be n such weights.



              Gradient descent is done in the following way:



              1. initialize weights randomly.

              2. compute the cost function and gradient with initialized weights.

              3. update weigths:
                It might happen that the gradient is O for some weights, in that case
                those weights doesn't show any change after updating.
                for example: Let say gradient is [1,0] the W2 will remain
                unchanged.

              4. check the cost function with updated weights, if the decrement is acceptable enough continue the iterations else terminate.

              while updating weights which weight ( W1 or W2) gets changed is entirely decided by gradient. All the weights get updated ( some weights might not change based on gradient).






              share|cite|improve this answer








              New contributor




              A Santosh Kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.

















              • "if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
                – Pb89
                23 mins ago














              up vote
              1
              down vote













              The aim of gradient descent is to minimize the cost function. This minimization is achieved by adjusting weights, for your case w1 and w2. In general there could be n such weights.



              Gradient descent is done in the following way:



              1. initialize weights randomly.

              2. compute the cost function and gradient with initialized weights.

              3. update weigths:
                It might happen that the gradient is O for some weights, in that case
                those weights doesn't show any change after updating.
                for example: Let say gradient is [1,0] the W2 will remain
                unchanged.

              4. check the cost function with updated weights, if the decrement is acceptable enough continue the iterations else terminate.

              while updating weights which weight ( W1 or W2) gets changed is entirely decided by gradient. All the weights get updated ( some weights might not change based on gradient).






              share|cite|improve this answer








              New contributor




              A Santosh Kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.

















              • "if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
                – Pb89
                23 mins ago












              up vote
              1
              down vote










              up vote
              1
              down vote









              The aim of gradient descent is to minimize the cost function. This minimization is achieved by adjusting weights, for your case w1 and w2. In general there could be n such weights.



              Gradient descent is done in the following way:



              1. initialize weights randomly.

              2. compute the cost function and gradient with initialized weights.

              3. update weigths:
                It might happen that the gradient is O for some weights, in that case
                those weights doesn't show any change after updating.
                for example: Let say gradient is [1,0] the W2 will remain
                unchanged.

              4. check the cost function with updated weights, if the decrement is acceptable enough continue the iterations else terminate.

              while updating weights which weight ( W1 or W2) gets changed is entirely decided by gradient. All the weights get updated ( some weights might not change based on gradient).






              share|cite|improve this answer








              New contributor




              A Santosh Kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.









              The aim of gradient descent is to minimize the cost function. This minimization is achieved by adjusting weights, for your case w1 and w2. In general there could be n such weights.



              Gradient descent is done in the following way:



              1. initialize weights randomly.

              2. compute the cost function and gradient with initialized weights.

              3. update weigths:
                It might happen that the gradient is O for some weights, in that case
                those weights doesn't show any change after updating.
                for example: Let say gradient is [1,0] the W2 will remain
                unchanged.

              4. check the cost function with updated weights, if the decrement is acceptable enough continue the iterations else terminate.

              while updating weights which weight ( W1 or W2) gets changed is entirely decided by gradient. All the weights get updated ( some weights might not change based on gradient).







              share|cite|improve this answer








              New contributor




              A Santosh Kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.









              share|cite|improve this answer



              share|cite|improve this answer






              New contributor




              A Santosh Kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.









              answered 2 hours ago









              A Santosh Kumar

              111




              111




              New contributor




              A Santosh Kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.





              New contributor





              A Santosh Kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              A Santosh Kumar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.











              • "if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
                – Pb89
                23 mins ago
















              • "if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
                – Pb89
                23 mins ago















              "if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
              – Pb89
              23 mins ago




              "if the decrement is acceptable enough continue the iterations else terminate", is there a default value which is applied in packages of python ( sklearn) or R packages such as caret? It can be user specified only in a manually created gradient descent function?
              – Pb89
              23 mins ago










              up vote
              1
              down vote













              Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



              Check here.






              share|cite|improve this answer


























                up vote
                1
                down vote













                Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



                Check here.






                share|cite|improve this answer
























                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



                  Check here.






                  share|cite|improve this answer














                  Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



                  Check here.







                  share|cite|improve this answer














                  share|cite|improve this answer



                  share|cite|improve this answer








                  edited 1 hour ago









                  Sven Hohenstein

                  4,74762333




                  4,74762333










                  answered 4 hours ago









                  SmallChess

                  5,44341837




                  5,44341837



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373055%2fgradient-descent-optimization%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Comments

                      Popular posts from this blog

                      What does second last employer means? [closed]

                      List of Gilmore Girls characters

                      Confectionery