Gradient decent optimization

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
1
down vote

favorite












I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.










share|cite|improve this question



























    up vote
    1
    down vote

    favorite












    I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.










    share|cite|improve this question























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.










      share|cite|improve this question













      I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.







      optimization gradient-descent






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked 1 hour ago









      Pb89

      539




      539




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote













          Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



          Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.






          share|cite|improve this answer



























            up vote
            1
            down vote













            Gradient descent updates all parameters at each step. You can see this in the update rule:



            $$
            w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
            $$



            Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.






            share|cite|improve this answer




















            • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
              – Pb89
              22 mins ago











            • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
              – Pb89
              19 mins ago










            • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
              – Sycorax
              9 mins ago










            • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
              – Pb89
              7 mins ago










            • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
              – Sycorax
              4 mins ago











            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "65"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373055%2fgradient-decent-optimization%23new-answer', 'question_page');

            );

            Post as a guest






























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote













            Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



            Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.






            share|cite|improve this answer
























              up vote
              1
              down vote













              Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



              Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.






              share|cite|improve this answer






















                up vote
                1
                down vote










                up vote
                1
                down vote









                Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



                Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.






                share|cite|improve this answer












                Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



                Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.







                share|cite|improve this answer












                share|cite|improve this answer



                share|cite|improve this answer










                answered 53 mins ago









                SmallChess

                5,44341837




                5,44341837






















                    up vote
                    1
                    down vote













                    Gradient descent updates all parameters at each step. You can see this in the update rule:



                    $$
                    w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
                    $$



                    Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.






                    share|cite|improve this answer




















                    • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                      – Pb89
                      22 mins ago











                    • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                      – Pb89
                      19 mins ago










                    • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                      – Sycorax
                      9 mins ago










                    • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                      – Pb89
                      7 mins ago










                    • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                      – Sycorax
                      4 mins ago















                    up vote
                    1
                    down vote













                    Gradient descent updates all parameters at each step. You can see this in the update rule:



                    $$
                    w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
                    $$



                    Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.






                    share|cite|improve this answer




















                    • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                      – Pb89
                      22 mins ago











                    • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                      – Pb89
                      19 mins ago










                    • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                      – Sycorax
                      9 mins ago










                    • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                      – Pb89
                      7 mins ago










                    • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                      – Sycorax
                      4 mins ago













                    up vote
                    1
                    down vote










                    up vote
                    1
                    down vote









                    Gradient descent updates all parameters at each step. You can see this in the update rule:



                    $$
                    w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
                    $$



                    Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.






                    share|cite|improve this answer












                    Gradient descent updates all parameters at each step. You can see this in the update rule:



                    $$
                    w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
                    $$



                    Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.







                    share|cite|improve this answer












                    share|cite|improve this answer



                    share|cite|improve this answer










                    answered 39 mins ago









                    Sycorax

                    36k694180




                    36k694180











                    • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                      – Pb89
                      22 mins ago











                    • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                      – Pb89
                      19 mins ago










                    • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                      – Sycorax
                      9 mins ago










                    • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                      – Pb89
                      7 mins ago










                    • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                      – Sycorax
                      4 mins ago

















                    • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                      – Pb89
                      22 mins ago











                    • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                      – Pb89
                      19 mins ago










                    • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                      – Sycorax
                      9 mins ago










                    • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                      – Pb89
                      7 mins ago










                    • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                      – Sycorax
                      4 mins ago
















                    So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                    – Pb89
                    22 mins ago





                    So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                    – Pb89
                    22 mins ago













                    and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                    – Pb89
                    19 mins ago




                    and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                    – Pb89
                    19 mins ago












                    The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                    – Sycorax
                    9 mins ago




                    The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                    – Sycorax
                    9 mins ago












                    If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                    – Pb89
                    7 mins ago




                    If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                    – Pb89
                    7 mins ago












                    The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                    – Sycorax
                    4 mins ago





                    The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                    – Sycorax
                    4 mins ago


















                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373055%2fgradient-decent-optimization%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Comments

                    Popular posts from this blog

                    Long meetings (6-7 hours a day): Being “babysat” by supervisor

                    Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

                    Confectionery