Explanation of generalization of Newton's Method for multiple dimensions

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

I've been following the CS 229 lecture videos for machine learning, and in lecture 4 (~14:00), Ng explains Newton's Method for optimization to maximize an objective function ($f$), but doesn't clearly explain the derivation of the higher dimension generalization:

$$
theta := H^-1 nabla_theta f(theta)
$$

I've read that the Hessian matrix ($H$) is a multiple dimension generalization of the second order derivative, but other than that I'm not sure I understand the formulation of the Hessian or why we multiply by its inverse; perhaps I need to brush up on linear algebra.

What's an intuitive explanation of the Hessian and its inverse in this formula?

asked 5 hours ago

Michael Yang

112

New contributor

add a commentÂ |Â

up vote
2
down vote

favorite

$$
theta := H^-1 nabla_theta f(theta)
$$

What's an intuitive explanation of the Hessian and its inverse in this formula?

asked 5 hours ago

Michael Yang

112

New contributor

add a commentÂ |Â

up vote
2
down vote

favorite

$$
theta := H^-1 nabla_theta f(theta)
$$

What's an intuitive explanation of the Hessian and its inverse in this formula?

asked 5 hours ago

Michael Yang

112

New contributor

$$
theta := H^-1 nabla_theta f(theta)
$$

What's an intuitive explanation of the Hessian and its inverse in this formula?

optimization linear-algebra matrix-inverse hessian

asked 5 hours ago

Michael Yang

112

New contributor

asked 5 hours ago

Michael Yang

112

New contributor

asked 5 hours ago

Michael Yang

112

New contributor

asked 5 hours ago

Michael Yang

112

asked 5 hours ago

Michael Yang

112

New contributor

Michael Yang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
3
down vote

Overview

Suppose we want to minimize an objective function $f$ that maps a parameter vector $x in mathbbR^d$ to a scalar value. The idea behind Newton's method is to locally approximate $f$ with a quadratic function, then solve for the point that minimizes this approximation. We then jump to this new point, where we build a new approximation, and repeat the process until convergence.

Local quadratic approximation

Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:

$$f(x) approx f_T(x) =
f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$

This is a quadratic function whose value at $x_0$, as well as its first and second partial derivatives, match those of $f$. $nabla f(x_0)$ is the gradient of $f$ evaluated at $x_0$. This is a vector containing the partial derivatives of $f$, and is analogous to the derivative for functions of single variables. Similarly, $H(x_0)$ is the Hessian of $f$ evaluated at $x_0$. It's a matrix containing second partial derivatives of $f$, and is analogous to the second derivative for functions of single variables.

$$nabla f(x_0) = left [ matrix
fracpartial fpartial x_1 (x_0) \
vdots \
fracpartial fpartial x_d (x_0) \
right ]
quad
H(x_0) = left [ matrix
fracpartial^2 fpartial x_1^2 (x_0) &
cdots &
fracpartial^2 fpartial x_1 partial x_d (x_0) \
vdots & ddots & vdots \
fracpartial^2 fpartial x_d partial x_1 (x_0) &
cdots &
fracpartial^2 fpartial x_d^2 (x_0) \
right ]$$

Minimizing the local approximation

Given the local approximation $f_T$ at the current point $x_0$, we want to find a new point $x$ that minimizes $f_T$. This can be done in closed form by taking the gradient of $f_T$ with respect to $x$, setting it to zero, then solving for $x$.

The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):

$$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$

Setting it to zero, we have:

$$H(x_0) (x - x_0) = -nabla f(x_0)$$

We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:

$$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$

answered 51 mins ago

user20160

14.5k12351

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f369848%2fexplanation-of-generalization-of-newtons-method-for-multiple-dimensions%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

Overview

Local quadratic approximation

Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:

$$f(x) approx f_T(x) =
f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$

Minimizing the local approximation

The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):

$$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$

Setting it to zero, we have:

$$H(x_0) (x - x_0) = -nabla f(x_0)$$

We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:

$$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$

answered 51 mins ago

user20160

14.5k12351

add a commentÂ |Â

up vote
3
down vote

Overview

Local quadratic approximation

Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:

$$f(x) approx f_T(x) =
f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$

Minimizing the local approximation

The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):

$$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$

Setting it to zero, we have:

$$H(x_0) (x - x_0) = -nabla f(x_0)$$

We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:

$$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$

answered 51 mins ago

user20160

14.5k12351

add a commentÂ |Â

up vote
3
down vote

Overview

Local quadratic approximation

Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:

$$f(x) approx f_T(x) =
f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$

Minimizing the local approximation

The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):

$$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$

Setting it to zero, we have:

$$H(x_0) (x - x_0) = -nabla f(x_0)$$

We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:

$$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$

answered 51 mins ago

user20160

14.5k12351

Overview

Local quadratic approximation

Say our current location in parameter space is the vector $x_0$. We can approximate $f$ in the vicinity of $x_0$ using the second order Taylor series expansion:

$$f(x) approx f_T(x) =
f(x_0) + (x-x_0)^T nabla f(x_0) + frac12(x-x_0)^T H(x_0) (x-x_0)$$

Minimizing the local approximation

The gradient of $f_T$ can be found by differentiating the expression above (keeping in mind that $nabla f(x_0)$ is just a fixed vector and $H(x_0)$ is just a fixed matrix):

$$nabla f_T(x) = nabla f(x_0) + H(x_0) (x - x_0)$$

Setting it to zero, we have:

$$H(x_0) (x - x_0) = -nabla f(x_0)$$

We need to isolate $x$, so multiply both sides by the inverse of the Hessian, then move $x_0$ to the righthand side:

$$x = x_0 - Big( H(x_0) Big)^-1 nabla f(x_0)$$

answered 51 mins ago

user20160

14.5k12351

answered 51 mins ago

user20160

14.5k12351

answered 51 mins ago

user20160

14.5k12351

answered 51 mins ago

user20160

14.5k12351

add a commentÂ |Â

Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Michael Yang is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky