Explanation of generalization of Newton's Method for multiple dimensions

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I've been following the CS 229 lecture videos for machine learning, and in lecture 4 (~14:00), Ng explains Newton's Method for optimization to maximize an objective function ($f$), but doesn't clearly explain the derivation of the higher dimension generalization:



$$
theta := H^-1 nabla_theta f(theta)
$$



I've read that the Hessian matrix ($H$) is a multiple dimension generalization of the second order derivative, but other than that I'm not sure I understand the formulation of the Hessian or why we multiply by its inverse; perhaps I need to brush up on linear algebra.



What's an intuitive explanation of the Hessian and its inverse in this formula?










share|cite|improve this question







New contributor




Michael Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    up vote
    2
    down vote

    favorite












    I've been following the CS 229 lecture videos for machine learning, and in lecture 4 (~14:00), Ng explains Newton's Method for optimization to maximize an objective function ($f$), but doesn't clearly explain the derivation of the higher dimension generalization:



    $$
    theta := H^-1 nabla_theta f(theta)
    $$



    I've read that the Hessian matrix ($H$) is a multiple dimension generalization of the second order derivative, but other than that I'm not sure I understand the formulation of the Hessian or why we multiply by its inverse; perhaps I need to brush up on linear algebra.



    What's an intuitive explanation of the Hessian and its inverse in this formula?










    share|cite|improve this question







    New contributor




    Michael Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





















      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I've been following the CS 229 lecture videos for machine learning, and in lecture 4 (~14:00), Ng explains Newton's Method for optimization to maximize an objective function ($f$), but doesn't clearly explain the derivation of the higher dimension generalization:



      $$
      theta := H^-1 nabla_theta f(theta)
      $$



      I've read that the Hessian matrix ($H$) is a multiple dimension generalization of the second order derivative, but other than that I'm not sure I understand the formulation of the Hessian or why we multiply by its inverse; perhaps I need to brush up on linear algebra.



      What's an intuitive explanation of the Hessian and its inverse in this formula?










      share|cite|improve this question







      New contributor




      Michael Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I've been following the CS 229 lecture videos for machine learning, and in lecture 4 (~14:00), Ng explains Newton's Method for optimization to maximize an objective function ($f$), but doesn't clearly explain the derivation of the higher dimension generalization:



      $$
      theta := H^-1 nabla_theta f(theta)
      $$



      I've read that the Hessian matrix ($H$) is a multiple dimension generalization of the second order derivative, but other than that I'm not sure I understand the formulation of the Hessian or why we multiply by its inverse; perhaps I need to brush up on linear algebra.



      What's an intuitive explanation of the Hessian and its inverse in this formula?







      optimization linear-algebra matrix-inverse hessian






      share|cite|improve this question







      New contributor




      Michael Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|cite|improve this question







      New contributor




      Michael Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|cite|improve this question




      share|cite|improve this question






      New contributor




      Michael Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 5 hours ago









      Michael Yang

      112




      112




      New contributor




      Michael Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Michael Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Michael Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote













          Overview



          Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.



          Local quadratic approximation



          Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:



          $$f(x) approx f_T(x) =
          f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$



          This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.



          $$nabla f(x_0) = left [ matrix
          fracpartial fpartial x_1 (x_0) \
          vdots \
          fracpartial fpartial x_d (x_0) \
          right ]
          quad
          H(x_0) = left [ matrix
          fracpartial^2 fpartial x_1^2 (x_0) &
          cdots &
          fracpartial^2 fpartial x_1 partial x_d (x_0) \
          vdots & ddots & vdots \
          fracpartial^2 fpartial x_d partial x_1 (x_0) &
          cdots &
          fracpartial^2 fpartial x_d^2 (x_0) \
          right ]$$



          Minimizing the local approximation



          Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.



          The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):



          $$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$



          Setting it to zero, we have:



          $$H(x_0) (x - x_0) = -nabla f(x_0)$$



          We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:



          $$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$






          share|cite|improve this answer




















            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "65"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );






            Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.









             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f369848%2fexplanation-of-generalization-of-newtons-method-for-multiple-dimensions%23new-answer', 'question_page');

            );

            Post as a guest






























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            3
            down vote













            Overview



            Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.



            Local quadratic approximation



            Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:



            $$f(x) approx f_T(x) =
            f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$



            This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.



            $$nabla f(x_0) = left [ matrix
            fracpartial fpartial x_1 (x_0) \
            vdots \
            fracpartial fpartial x_d (x_0) \
            right ]
            quad
            H(x_0) = left [ matrix
            fracpartial^2 fpartial x_1^2 (x_0) &
            cdots &
            fracpartial^2 fpartial x_1 partial x_d (x_0) \
            vdots & ddots & vdots \
            fracpartial^2 fpartial x_d partial x_1 (x_0) &
            cdots &
            fracpartial^2 fpartial x_d^2 (x_0) \
            right ]$$



            Minimizing the local approximation



            Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.



            The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):



            $$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$



            Setting it to zero, we have:



            $$H(x_0) (x - x_0) = -nabla f(x_0)$$



            We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:



            $$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$






            share|cite|improve this answer
























              up vote
              3
              down vote













              Overview



              Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.



              Local quadratic approximation



              Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:



              $$f(x) approx f_T(x) =
              f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$



              This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.



              $$nabla f(x_0) = left [ matrix
              fracpartial fpartial x_1 (x_0) \
              vdots \
              fracpartial fpartial x_d (x_0) \
              right ]
              quad
              H(x_0) = left [ matrix
              fracpartial^2 fpartial x_1^2 (x_0) &
              cdots &
              fracpartial^2 fpartial x_1 partial x_d (x_0) \
              vdots & ddots & vdots \
              fracpartial^2 fpartial x_d partial x_1 (x_0) &
              cdots &
              fracpartial^2 fpartial x_d^2 (x_0) \
              right ]$$



              Minimizing the local approximation



              Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.



              The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):



              $$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$



              Setting it to zero, we have:



              $$H(x_0) (x - x_0) = -nabla f(x_0)$$



              We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:



              $$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$






              share|cite|improve this answer






















                up vote
                3
                down vote










                up vote
                3
                down vote









                Overview



                Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.



                Local quadratic approximation



                Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:



                $$f(x) approx f_T(x) =
                f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$



                This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.



                $$nabla f(x_0) = left [ matrix
                fracpartial fpartial x_1 (x_0) \
                vdots \
                fracpartial fpartial x_d (x_0) \
                right ]
                quad
                H(x_0) = left [ matrix
                fracpartial^2 fpartial x_1^2 (x_0) &
                cdots &
                fracpartial^2 fpartial x_1 partial x_d (x_0) \
                vdots & ddots & vdots \
                fracpartial^2 fpartial x_d partial x_1 (x_0) &
                cdots &
                fracpartial^2 fpartial x_d^2 (x_0) \
                right ]$$



                Minimizing the local approximation



                Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.



                The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):



                $$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$



                Setting it to zero, we have:



                $$H(x_0) (x - x_0) = -nabla f(x_0)$$



                We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:



                $$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$






                share|cite|improve this answer












                Overview



                Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.



                Local quadratic approximation



                Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:



                $$f(x) approx f_T(x) =
                f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$



                This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.



                $$nabla f(x_0) = left [ matrix
                fracpartial fpartial x_1 (x_0) \
                vdots \
                fracpartial fpartial x_d (x_0) \
                right ]
                quad
                H(x_0) = left [ matrix
                fracpartial^2 fpartial x_1^2 (x_0) &
                cdots &
                fracpartial^2 fpartial x_1 partial x_d (x_0) \
                vdots & ddots & vdots \
                fracpartial^2 fpartial x_d partial x_1 (x_0) &
                cdots &
                fracpartial^2 fpartial x_d^2 (x_0) \
                right ]$$



                Minimizing the local approximation



                Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.



                The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):



                $$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$



                Setting it to zero, we have:



                $$H(x_0) (x - x_0) = -nabla f(x_0)$$



                We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:



                $$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$







                share|cite|improve this answer












                share|cite|improve this answer



                share|cite|improve this answer










                answered 51 mins ago









                user20160

                14.5k12351




                14.5k12351




















                    Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.









                     

                    draft saved


                    draft discarded


















                    Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.












                    Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.











                    Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.













                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f369848%2fexplanation-of-generalization-of-newtons-method-for-multiple-dimensions%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Comments

                    Popular posts from this blog

                    Long meetings (6-7 hours a day): Being “babysat” by supervisor

                    Is the Concept of Multiple Fantasy Races Scientifically Flawed? [closed]

                    Confectionery