Variations of the squared error function
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
Depending on the source I find people using different variations of the "squared error function", how come that be?
Variation 1
Variation 2
Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)
The stuff inside the ()^2 is simply notation I get that, but dividing by 1/m and 1/2 will cleary get a different result. Which version is the "correct" one, or is there no such thing as a correct or "official" squared error function?
neural-networks
New contributor
add a comment |Â
up vote
2
down vote
favorite
Depending on the source I find people using different variations of the "squared error function", how come that be?
Variation 1
Variation 2
Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)
The stuff inside the ()^2 is simply notation I get that, but dividing by 1/m and 1/2 will cleary get a different result. Which version is the "correct" one, or is there no such thing as a correct or "official" squared error function?
neural-networks
New contributor
As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â DuttaA
1 hour ago
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
Depending on the source I find people using different variations of the "squared error function", how come that be?
Variation 1
Variation 2
Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)
The stuff inside the ()^2 is simply notation I get that, but dividing by 1/m and 1/2 will cleary get a different result. Which version is the "correct" one, or is there no such thing as a correct or "official" squared error function?
neural-networks
New contributor
Depending on the source I find people using different variations of the "squared error function", how come that be?
Variation 1
Variation 2
Notice that it is being devided by 1 over m as opposed to variation 1 (1/2)
The stuff inside the ()^2 is simply notation I get that, but dividing by 1/m and 1/2 will cleary get a different result. Which version is the "correct" one, or is there no such thing as a correct or "official" squared error function?
neural-networks
neural-networks
New contributor
New contributor
New contributor
asked 1 hour ago
Sebastian Nielsen
1134
1134
New contributor
New contributor
As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â DuttaA
1 hour ago
add a comment |Â
As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â DuttaA
1 hour ago
As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â DuttaA
1 hour ago
As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â DuttaA
1 hour ago
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:
$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$
where:
$x^(i)$ denotes the $i$th training example
$h_theta(x^(i))$ denotes the model's output for that instance/example
$y^(i)$ denotes the ground truth / target / label for that instance
$m$ denotes the number of training examples
Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:
$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$
Now it already looks quite a lot like your second variation.
The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.
Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.
With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:
The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesnâÂÂt matter that we introduce a constant here.
1
I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â DuttaA
1 hour ago
@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â Dennis Soemers
1 hour ago
Sorry then...I didn't check the sources.
â DuttaA
1 hour ago
1
@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â Dennis Soemers
1 hour ago
Well explained Dennis!
â Sebastian Nielsen
1 min ago
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:
$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$
where:
$x^(i)$ denotes the $i$th training example
$h_theta(x^(i))$ denotes the model's output for that instance/example
$y^(i)$ denotes the ground truth / target / label for that instance
$m$ denotes the number of training examples
Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:
$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$
Now it already looks quite a lot like your second variation.
The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.
Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.
With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:
The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesnâÂÂt matter that we introduce a constant here.
1
I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â DuttaA
1 hour ago
@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â Dennis Soemers
1 hour ago
Sorry then...I didn't check the sources.
â DuttaA
1 hour ago
1
@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â Dennis Soemers
1 hour ago
Well explained Dennis!
â Sebastian Nielsen
1 min ago
add a comment |Â
up vote
2
down vote
accepted
The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:
$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$
where:
$x^(i)$ denotes the $i$th training example
$h_theta(x^(i))$ denotes the model's output for that instance/example
$y^(i)$ denotes the ground truth / target / label for that instance
$m$ denotes the number of training examples
Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:
$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$
Now it already looks quite a lot like your second variation.
The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.
Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.
With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:
The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesnâÂÂt matter that we introduce a constant here.
1
I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â DuttaA
1 hour ago
@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â Dennis Soemers
1 hour ago
Sorry then...I didn't check the sources.
â DuttaA
1 hour ago
1
@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â Dennis Soemers
1 hour ago
Well explained Dennis!
â Sebastian Nielsen
1 min ago
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:
$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$
where:
$x^(i)$ denotes the $i$th training example
$h_theta(x^(i))$ denotes the model's output for that instance/example
$y^(i)$ denotes the ground truth / target / label for that instance
$m$ denotes the number of training examples
Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:
$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$
Now it already looks quite a lot like your second variation.
The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.
Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.
With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:
The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesnâÂÂt matter that we introduce a constant here.
The first variation is named "$E_total$". It contains a sum which is not very well-specified (has no index, no limits). Rewriting it using the notation of the second variation would lead to:
$$E_total = sum_i = 1^m frac12 left( y^(i) - h_theta(x^(i)) right)^2,$$
where:
$x^(i)$ denotes the $i$th training example
$h_theta(x^(i))$ denotes the model's output for that instance/example
$y^(i)$ denotes the ground truth / target / label for that instance
$m$ denotes the number of training examples
Because the term inside the large brackets is squared, the sign doesn't matter, so we can rewrite it (switch around the subtracted terms) to:
$$E_total = sum_i = 1^m frac12 left( h_theta(x^(i)) - y^(i) right)^2.$$
Now it already looks quite a lot like your second variation.
The second variation does still have a $frac1m$ terms outside the sum. That is because your second variation computes the mean squared error over all the training examples, rather than the total error computed by the first variation.
Either error can be used for training. I'd personally lean towards using the mean error rather than the total error, mainly because the scale of the mean error is independent of the batch size $m$, whereas the scale of the total error is proportional to the batch size used for training. Either option is valid, but they'll likely require different hyperparameter values (especially for the learning rate), due to the difference in scale.
With that $frac1m$ term explained, the only remaining difference is the $frac12$ term inside the sum (can also be pulled out of the sum), which is present in the first variation but not in the second. The reason for including that term is given in the page you linked to for the first variation:
The $frac12$ is included so that exponent is cancelled when we differentiate later on. The result is eventually multiplied by a learning rate anyway so it doesnâÂÂt matter that we introduce a constant here.
answered 1 hour ago
Dennis Soemers
2,0731326
2,0731326
1
I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â DuttaA
1 hour ago
@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â Dennis Soemers
1 hour ago
Sorry then...I didn't check the sources.
â DuttaA
1 hour ago
1
@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â Dennis Soemers
1 hour ago
Well explained Dennis!
â Sebastian Nielsen
1 min ago
add a comment |Â
1
I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â DuttaA
1 hour ago
@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â Dennis Soemers
1 hour ago
Sorry then...I didn't check the sources.
â DuttaA
1 hour ago
1
@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â Dennis Soemers
1 hour ago
Well explained Dennis!
â Sebastian Nielsen
1 min ago
1
1
I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â DuttaA
1 hour ago
I think the main caveat you should mention is that the 'constant' outside does not matter as learning rate will take care of it anyways...since 1/2 is also introduced for convenience, so technically it should have been 1/2m ..otherwise there is not much to answer
â DuttaA
1 hour ago
@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â Dennis Soemers
1 hour ago
@DuttaA That's already mentioned in the final quote from the original source on the first variation, right?
â Dennis Soemers
1 hour ago
Sorry then...I didn't check the sources.
â DuttaA
1 hour ago
Sorry then...I didn't check the sources.
â DuttaA
1 hour ago
1
1
@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â Dennis Soemers
1 hour ago
@DuttaA I mean the quote right at the very bottom of my answer (which I quoted from the page, but now it's also inside my answer in the form of a quote box).
â Dennis Soemers
1 hour ago
Well explained Dennis!
â Sebastian Nielsen
1 min ago
Well explained Dennis!
â Sebastian Nielsen
1 min ago
add a comment |Â
Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.
Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.
Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.
Sebastian Nielsen is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f8091%2fvariations-of-the-squared-error-function%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
As you go in depth of Machine Learning you'll see constants everywhere matter less and less.
â DuttaA
1 hour ago