Variations of the squared error function

up vote
2
down vote

favorite

Depending on the source I find people using different variations of the "squared error function", how come that be?

Variation 1

enter image description here

Variation 2

enter image description here

Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)

The stuff inside the ()^2 is simply notation I get that, but dividing by 1/m and 1/2 will cleary get a different result. Which version is the "correct" one, or is there no such thing as a correct or "official" squared error function?

asked 1 hour ago

Sebastian Nielsen

1134

New contributor

As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â€“Â DuttaA
1 hour ago

add a commentÂ |Â

up vote
2
down vote

favorite

Depending on the source I find people using different variations of the "squared error function", how come that be?

Variation 1

enter image description here

Variation 2

enter image description here

Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)

asked 1 hour ago

Sebastian Nielsen

1134

New contributor

As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â€“Â DuttaA
1 hour ago

add a commentÂ |Â

up vote
2
down vote

favorite

Depending on the source I find people using different variations of the "squared error function", how come that be?

Variation 1

enter image description here

Variation 2

enter image description here

Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)

asked 1 hour ago

Sebastian Nielsen

1134

New contributor

Depending on the source I find people using different variations of the "squared error function", how come that be?

Variation 1

enter image description here

Variation 2

enter image description here

Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)

neural-networks

asked 1 hour ago

Sebastian Nielsen

1134

New contributor

asked 1 hour ago

Sebastian Nielsen

1134

New contributor

asked 1 hour ago

Sebastian Nielsen

1134

New contributor

asked 1 hour ago

Sebastian Nielsen

1134

asked 1 hour ago

Sebastian Nielsen

1134

New contributor

Sebastian Nielsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â€“Â DuttaA
1 hour ago

add a commentÂ |Â

As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â€“Â DuttaA
1 hour ago

As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â€“Â DuttaA
1 hour ago

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:

$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$

where:

$x^(i)$ denotes the $i$th training example

$h_theta(x^(i))$ denotes the model's output for that instance/example

$y^(i)$ denotes the ground truth / target / label for that instance

$m$ denotes the number of training examples

Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:

$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$

Now it already looks quite a lot like your second variation.

The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.

Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.

With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:

The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesnÃ¢Â€Â™t matter that we introduce a constant here.

answered 1 hour ago

Dennis Soemers

2,0731326

1

I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â€“Â DuttaA
1 hour ago

@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â€“Â Dennis Soemers
1 hour ago

Sorry then...I didn't check the sources.
â€“Â DuttaA
1 hour ago

1

@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â€“Â Dennis Soemers
1 hour ago

Well explained Dennis!
â€“Â Sebastian Nielsen
1 min ago

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f8091%2fvariations-of-the-squared-error-function%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:

$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$

where:

$x^(i)$ denotes the $i$th training example

$h_theta(x^(i))$ denotes the model's output for that instance/example

$y^(i)$ denotes the ground truth / target / label for that instance

$m$ denotes the number of training examples

Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:

$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$

Now it already looks quite a lot like your second variation.

The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesnÃ¢Â€Â™t matter that we introduce a constant here.

answered 1 hour ago

Dennis Soemers

2,0731326

1

I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â€“Â DuttaA
1 hour ago

@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â€“Â Dennis Soemers
1 hour ago

Sorry then...I didn't check the sources.
â€“Â DuttaA
1 hour ago

1

@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â€“Â Dennis Soemers
1 hour ago

Well explained Dennis!
â€“Â Sebastian Nielsen
1 min ago

add a commentÂ |Â

up vote
2
down vote

accepted

The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:

$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$

where:

$x^(i)$ denotes the $i$th training example

$h_theta(x^(i))$ denotes the model's output for that instance/example

$y^(i)$ denotes the ground truth / target / label for that instance

$m$ denotes the number of training examples

Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:

$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$

Now it already looks quite a lot like your second variation.

The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesnÃ¢Â€Â™t matter that we introduce a constant here.

answered 1 hour ago

Dennis Soemers

2,0731326

1

I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â€“Â DuttaA
1 hour ago

@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â€“Â Dennis Soemers
1 hour ago

Sorry then...I didn't check the sources.
â€“Â DuttaA
1 hour ago

1

@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â€“Â Dennis Soemers
1 hour ago

Well explained Dennis!
â€“Â Sebastian Nielsen
1 min ago

add a commentÂ |Â

up vote
2
down vote

accepted

The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:

$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$

where:

$x^(i)$ denotes the $i$th training example

$h_theta(x^(i))$ denotes the model's output for that instance/example

$y^(i)$ denotes the ground truth / target / label for that instance

$m$ denotes the number of training examples

Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:

$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$

Now it already looks quite a lot like your second variation.

The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesnÃ¢Â€Â™t matter that we introduce a constant here.

answered 1 hour ago

Dennis Soemers

2,0731326

The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:

$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$

where:

$x^(i)$ denotes the $i$th training example

$h_theta(x^(i))$ denotes the model's output for that instance/example

$y^(i)$ denotes the ground truth / target / label for that instance

$m$ denotes the number of training examples

Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:

$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$

Now it already looks quite a lot like your second variation.

The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesnÃ¢Â€Â™t matter that we introduce a constant here.

answered 1 hour ago

Dennis Soemers

2,0731326

answered 1 hour ago

Dennis Soemers

2,0731326

answered 1 hour ago

Dennis Soemers

2,0731326

answered 1 hour ago

Dennis Soemers

2,0731326

1

I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â€“Â DuttaA
1 hour ago

@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â€“Â Dennis Soemers
1 hour ago

Sorry then...I didn't check the sources.
â€“Â DuttaA
1 hour ago

1

@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â€“Â Dennis Soemers
1 hour ago

Well explained Dennis!
â€“Â Sebastian Nielsen
1 min ago

add a commentÂ |Â

1

I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â€“Â DuttaA
1 hour ago

@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â€“Â Dennis Soemers
1 hour ago

Sorry then...I didn't check the sources.
â€“Â DuttaA
1 hour ago

1

@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â€“Â Dennis Soemers
1 hour ago

Well explained Dennis!
â€“Â Sebastian Nielsen
1 min ago

I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â€“Â DuttaA
1 hour ago

@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â€“Â Dennis Soemers
1 hour ago

Sorry then...I didn't check the sources.
â€“Â DuttaA
1 hour ago

@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â€“Â Dennis Soemers
1 hour ago

Well explained Dennis!
â€“Â Sebastian Nielsen
1 min ago

add a commentÂ |Â

Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky