Variations of the squared error function

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite
1












Depending on the source I find people using different variations of the "squared error function", how come that be?




Variation 1




enter image description here




Variation 2




enter image description here



Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)



The stuff inside the ()^2 is simply notation I get that, but dividing by 1/m and 1/2 will cleary get a different result. Which version is the "correct" one, or is there no such thing as a correct or "official" squared error function?










share|improve this question







New contributor




Sebastian Nielsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
    – DuttaA
    1 hour ago














up vote
2
down vote

favorite
1












Depending on the source I find people using different variations of the "squared error function", how come that be?




Variation 1




enter image description here




Variation 2




enter image description here



Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)



The stuff inside the ()^2 is simply notation I get that, but dividing by 1/m and 1/2 will cleary get a different result. Which version is the "correct" one, or is there no such thing as a correct or "official" squared error function?










share|improve this question







New contributor




Sebastian Nielsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
    – DuttaA
    1 hour ago












up vote
2
down vote

favorite
1









up vote
2
down vote

favorite
1






1





Depending on the source I find people using different variations of the "squared error function", how come that be?




Variation 1




enter image description here




Variation 2




enter image description here



Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)



The stuff inside the ()^2 is simply notation I get that, but dividing by 1/m and 1/2 will cleary get a different result. Which version is the "correct" one, or is there no such thing as a correct or "official" squared error function?










share|improve this question







New contributor




Sebastian Nielsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











Depending on the source I find people using different variations of the "squared error function", how come that be?




Variation 1




enter image description here




Variation 2




enter image description here



Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)



The stuff inside the ()^2 is simply notation I get that, but dividing by 1/m and 1/2 will cleary get a different result. Which version is the "correct" one, or is there no such thing as a correct or "official" squared error function?







neural-networks






share|improve this question







New contributor




Sebastian Nielsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question







New contributor




Sebastian Nielsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question






New contributor




Sebastian Nielsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 1 hour ago









Sebastian Nielsen

1134




1134




New contributor




Sebastian Nielsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Sebastian Nielsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Sebastian Nielsen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
    – DuttaA
    1 hour ago
















  • As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
    – DuttaA
    1 hour ago















As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
– DuttaA
1 hour ago




As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
– DuttaA
1 hour ago










1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:



$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$



where:




  • $x^(i)$ denotes the $i$th training example


  • $h_theta(x^(i))$ denotes the model's output for that instance/example


  • $y^(i)$ denotes the ground truth / target / label for that instance


  • $m$ denotes the number of training examples

Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:



$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$




Now it already looks quite a lot like your second variation.



The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.



Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.




With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:




The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesn’t matter that we introduce a constant here.







share|improve this answer
















  • 1




    I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
    – DuttaA
    1 hour ago










  • @DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
    – Dennis Soemers
    1 hour ago










  • Sorry then...I didn't check the sources.
    – DuttaA
    1 hour ago






  • 1




    @DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
    – Dennis Soemers
    1 hour ago










  • Well explained Dennis!
    – Sebastian Nielsen
    1 min ago










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.









 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f8091%2fvariations-of-the-squared-error-function%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:



$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$



where:




  • $x^(i)$ denotes the $i$th training example


  • $h_theta(x^(i))$ denotes the model's output for that instance/example


  • $y^(i)$ denotes the ground truth / target / label for that instance


  • $m$ denotes the number of training examples

Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:



$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$




Now it already looks quite a lot like your second variation.



The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.



Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.




With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:




The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesn’t matter that we introduce a constant here.







share|improve this answer
















  • 1




    I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
    – DuttaA
    1 hour ago










  • @DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
    – Dennis Soemers
    1 hour ago










  • Sorry then...I didn't check the sources.
    – DuttaA
    1 hour ago






  • 1




    @DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
    – Dennis Soemers
    1 hour ago










  • Well explained Dennis!
    – Sebastian Nielsen
    1 min ago














up vote
2
down vote



accepted










The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:



$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$



where:




  • $x^(i)$ denotes the $i$th training example


  • $h_theta(x^(i))$ denotes the model's output for that instance/example


  • $y^(i)$ denotes the ground truth / target / label for that instance


  • $m$ denotes the number of training examples

Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:



$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$




Now it already looks quite a lot like your second variation.



The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.



Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.




With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:




The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesn’t matter that we introduce a constant here.







share|improve this answer
















  • 1




    I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
    – DuttaA
    1 hour ago










  • @DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
    – Dennis Soemers
    1 hour ago










  • Sorry then...I didn't check the sources.
    – DuttaA
    1 hour ago






  • 1




    @DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
    – Dennis Soemers
    1 hour ago










  • Well explained Dennis!
    – Sebastian Nielsen
    1 min ago












up vote
2
down vote



accepted







up vote
2
down vote



accepted






The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:



$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$



where:




  • $x^(i)$ denotes the $i$th training example


  • $h_theta(x^(i))$ denotes the model's output for that instance/example


  • $y^(i)$ denotes the ground truth / target / label for that instance


  • $m$ denotes the number of training examples

Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:



$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$




Now it already looks quite a lot like your second variation.



The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.



Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.




With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:




The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesn’t matter that we introduce a constant here.







share|improve this answer












The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:



$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$



where:




  • $x^(i)$ denotes the $i$th training example


  • $h_theta(x^(i))$ denotes the model's output for that instance/example


  • $y^(i)$ denotes the ground truth / target / label for that instance


  • $m$ denotes the number of training examples

Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:



$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$




Now it already looks quite a lot like your second variation.



The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.



Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.




With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:




The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesn’t matter that we introduce a constant here.








share|improve this answer












share|improve this answer



share|improve this answer










answered 1 hour ago









Dennis Soemers

2,0731326




2,0731326







  • 1




    I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
    – DuttaA
    1 hour ago










  • @DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
    – Dennis Soemers
    1 hour ago










  • Sorry then...I didn't check the sources.
    – DuttaA
    1 hour ago






  • 1




    @DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
    – Dennis Soemers
    1 hour ago










  • Well explained Dennis!
    – Sebastian Nielsen
    1 min ago












  • 1




    I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
    – DuttaA
    1 hour ago










  • @DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
    – Dennis Soemers
    1 hour ago










  • Sorry then...I didn't check the sources.
    – DuttaA
    1 hour ago






  • 1




    @DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
    – Dennis Soemers
    1 hour ago










  • Well explained Dennis!
    – Sebastian Nielsen
    1 min ago







1




1




I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
– DuttaA
1 hour ago




I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
– DuttaA
1 hour ago












@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
– Dennis Soemers
1 hour ago




@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
– Dennis Soemers
1 hour ago












Sorry then...I didn't check the sources.
– DuttaA
1 hour ago




Sorry then...I didn't check the sources.
– DuttaA
1 hour ago




1




1




@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
– Dennis Soemers
1 hour ago




@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
– Dennis Soemers
1 hour ago












Well explained Dennis!
– Sebastian Nielsen
1 min ago




Well explained Dennis!
– Sebastian Nielsen
1 min ago










Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.









 

draft saved


draft discarded


















Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.












Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.











Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.













 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f8091%2fvariations-of-the-squared-error-function%23new-answer', 'question_page');

);

Post as a guest













































































Comments

Popular posts from this blog

What does second last employer means? [closed]

Installing NextGIS Connect into QGIS 3?

One-line joke