Explanation of generalization of Newton's Method for multiple dimensions
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I've been following the CS 229 lecture videos for machine learning, and in lecture 4 (~14:00), Ng explains Newton's Method for optimization to maximize an objective function ($f$), but doesn't clearly explain the derivation of the higher dimension generalization:
$$
theta := H^-1 nabla_theta f(theta)
$$
I've read that the Hessian matrix ($H$) is a multiple dimension generalization of the second order derivative, but other than that I'm not sure I understand the formulation of the Hessian or why we multiply by its inverse; perhaps I need to brush up on linear algebra.
What's an intuitive explanation of the Hessian and its inverse in this formula?
optimization linear-algebra matrix-inverse hessian
New contributor
add a comment |Â
up vote
2
down vote
favorite
I've been following the CS 229 lecture videos for machine learning, and in lecture 4 (~14:00), Ng explains Newton's Method for optimization to maximize an objective function ($f$), but doesn't clearly explain the derivation of the higher dimension generalization:
$$
theta := H^-1 nabla_theta f(theta)
$$
I've read that the Hessian matrix ($H$) is a multiple dimension generalization of the second order derivative, but other than that I'm not sure I understand the formulation of the Hessian or why we multiply by its inverse; perhaps I need to brush up on linear algebra.
What's an intuitive explanation of the Hessian and its inverse in this formula?
optimization linear-algebra matrix-inverse hessian
New contributor
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I've been following the CS 229 lecture videos for machine learning, and in lecture 4 (~14:00), Ng explains Newton's Method for optimization to maximize an objective function ($f$), but doesn't clearly explain the derivation of the higher dimension generalization:
$$
theta := H^-1 nabla_theta f(theta)
$$
I've read that the Hessian matrix ($H$) is a multiple dimension generalization of the second order derivative, but other than that I'm not sure I understand the formulation of the Hessian or why we multiply by its inverse; perhaps I need to brush up on linear algebra.
What's an intuitive explanation of the Hessian and its inverse in this formula?
optimization linear-algebra matrix-inverse hessian
New contributor
I've been following the CS 229 lecture videos for machine learning, and in lecture 4 (~14:00), Ng explains Newton's Method for optimization to maximize an objective function ($f$), but doesn't clearly explain the derivation of the higher dimension generalization:
$$
theta := H^-1 nabla_theta f(theta)
$$
I've read that the Hessian matrix ($H$) is a multiple dimension generalization of the second order derivative, but other than that I'm not sure I understand the formulation of the Hessian or why we multiply by its inverse; perhaps I need to brush up on linear algebra.
What's an intuitive explanation of the Hessian and its inverse in this formula?
optimization linear-algebra matrix-inverse hessian
optimization linear-algebra matrix-inverse hessian
New contributor
New contributor
New contributor
asked 5 hours ago
Michael Yang
112
112
New contributor
New contributor
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
3
down vote
Overview
Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.
Local quadratic approximation
Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:
$$f(x) approx f_T(x) =
f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$
This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.
$$nabla f(x_0) = left [ matrix
fracpartial fpartial x_1 (x_0) \
vdots \
fracpartial fpartial x_d (x_0) \
right ]
quad
H(x_0) = left [ matrix
fracpartial^2 fpartial x_1^2 (x_0) &
cdots &
fracpartial^2 fpartial x_1 partial x_d (x_0) \
vdots & ddots & vdots \
fracpartial^2 fpartial x_d partial x_1 (x_0) &
cdots &
fracpartial^2 fpartial x_d^2 (x_0) \
right ]$$
Minimizing the local approximation
Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.
The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):
$$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$
Setting it to zero, we have:
$$H(x_0) (x - x_0) = -nabla f(x_0)$$
We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:
$$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
Overview
Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.
Local quadratic approximation
Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:
$$f(x) approx f_T(x) =
f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$
This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.
$$nabla f(x_0) = left [ matrix
fracpartial fpartial x_1 (x_0) \
vdots \
fracpartial fpartial x_d (x_0) \
right ]
quad
H(x_0) = left [ matrix
fracpartial^2 fpartial x_1^2 (x_0) &
cdots &
fracpartial^2 fpartial x_1 partial x_d (x_0) \
vdots & ddots & vdots \
fracpartial^2 fpartial x_d partial x_1 (x_0) &
cdots &
fracpartial^2 fpartial x_d^2 (x_0) \
right ]$$
Minimizing the local approximation
Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.
The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):
$$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$
Setting it to zero, we have:
$$H(x_0) (x - x_0) = -nabla f(x_0)$$
We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:
$$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$
add a comment |Â
up vote
3
down vote
Overview
Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.
Local quadratic approximation
Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:
$$f(x) approx f_T(x) =
f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$
This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.
$$nabla f(x_0) = left [ matrix
fracpartial fpartial x_1 (x_0) \
vdots \
fracpartial fpartial x_d (x_0) \
right ]
quad
H(x_0) = left [ matrix
fracpartial^2 fpartial x_1^2 (x_0) &
cdots &
fracpartial^2 fpartial x_1 partial x_d (x_0) \
vdots & ddots & vdots \
fracpartial^2 fpartial x_d partial x_1 (x_0) &
cdots &
fracpartial^2 fpartial x_d^2 (x_0) \
right ]$$
Minimizing the local approximation
Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.
The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):
$$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$
Setting it to zero, we have:
$$H(x_0) (x - x_0) = -nabla f(x_0)$$
We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:
$$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$
add a comment |Â
up vote
3
down vote
up vote
3
down vote
Overview
Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.
Local quadratic approximation
Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:
$$f(x) approx f_T(x) =
f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$
This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.
$$nabla f(x_0) = left [ matrix
fracpartial fpartial x_1 (x_0) \
vdots \
fracpartial fpartial x_d (x_0) \
right ]
quad
H(x_0) = left [ matrix
fracpartial^2 fpartial x_1^2 (x_0) &
cdots &
fracpartial^2 fpartial x_1 partial x_d (x_0) \
vdots & ddots & vdots \
fracpartial^2 fpartial x_d partial x_1 (x_0) &
cdots &
fracpartial^2 fpartial x_d^2 (x_0) \
right ]$$
Minimizing the local approximation
Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.
The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):
$$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$
Setting it to zero, we have:
$$H(x_0) (x - x_0) = -nabla f(x_0)$$
We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:
$$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$
Overview
Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.
Local quadratic approximation
Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:
$$f(x) approx f_T(x) =
f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$
This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.
$$nabla f(x_0) = left [ matrix
fracpartial fpartial x_1 (x_0) \
vdots \
fracpartial fpartial x_d (x_0) \
right ]
quad
H(x_0) = left [ matrix
fracpartial^2 fpartial x_1^2 (x_0) &
cdots &
fracpartial^2 fpartial x_1 partial x_d (x_0) \
vdots & ddots & vdots \
fracpartial^2 fpartial x_d partial x_1 (x_0) &
cdots &
fracpartial^2 fpartial x_d^2 (x_0) \
right ]$$
Minimizing the local approximation
Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.
The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):
$$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$
Setting it to zero, we have:
$$H(x_0) (x - x_0) = -nabla f(x_0)$$
We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:
$$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$
answered 51 mins ago
user20160
14.5k12351
14.5k12351
add a comment |Â
add a comment |Â
Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.
Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.
Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.
Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f369848%2fexplanation-of-generalization-of-newtons-method-for-multiple-dimensions%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password