L1 and L2 penalty vs L1 and L2 norms

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
1
down vote

favorite












I understand the usages of L1 and L2 norms however I am unsure of usage of L1 and L2 penalty when building models.



From what I understand:



L1: Laplace Prior L2: Gaussian Prior



are two of the penalty terms. I have tried to read about these but there is surprisingly no discussion on these, it always leads to Lasso and Ridge, which I understand.



Can someone help me bridge the gap so as to what does these refer to? and if they are related to L1 and L2 norms in the end, how?



Thanks for your help.










share|cite|improve this question







New contributor




power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
    – Richard Hardy
    4 hours ago










  • Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
    – power.puffed
    2 hours ago

















up vote
1
down vote

favorite












I understand the usages of L1 and L2 norms however I am unsure of usage of L1 and L2 penalty when building models.



From what I understand:



L1: Laplace Prior L2: Gaussian Prior



are two of the penalty terms. I have tried to read about these but there is surprisingly no discussion on these, it always leads to Lasso and Ridge, which I understand.



Can someone help me bridge the gap so as to what does these refer to? and if they are related to L1 and L2 norms in the end, how?



Thanks for your help.










share|cite|improve this question







New contributor




power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
    – Richard Hardy
    4 hours ago










  • Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
    – power.puffed
    2 hours ago













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I understand the usages of L1 and L2 norms however I am unsure of usage of L1 and L2 penalty when building models.



From what I understand:



L1: Laplace Prior L2: Gaussian Prior



are two of the penalty terms. I have tried to read about these but there is surprisingly no discussion on these, it always leads to Lasso and Ridge, which I understand.



Can someone help me bridge the gap so as to what does these refer to? and if they are related to L1 and L2 norms in the end, how?



Thanks for your help.










share|cite|improve this question







New contributor




power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I understand the usages of L1 and L2 norms however I am unsure of usage of L1 and L2 penalty when building models.



From what I understand:



L1: Laplace Prior L2: Gaussian Prior



are two of the penalty terms. I have tried to read about these but there is surprisingly no discussion on these, it always leads to Lasso and Ridge, which I understand.



Can someone help me bridge the gap so as to what does these refer to? and if they are related to L1 and L2 norms in the end, how?



Thanks for your help.







regularization






share|cite|improve this question







New contributor




power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|cite|improve this question







New contributor




power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|cite|improve this question




share|cite|improve this question






New contributor




power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 4 hours ago









power.puffed

62




62




New contributor




power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
    – Richard Hardy
    4 hours ago










  • Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
    – power.puffed
    2 hours ago

















  • Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
    – Richard Hardy
    4 hours ago










  • Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
    – power.puffed
    2 hours ago
















Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
– Richard Hardy
4 hours ago




Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
– Richard Hardy
4 hours ago












Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
– power.puffed
2 hours ago





Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
– power.puffed
2 hours ago











1 Answer
1






active

oldest

votes

















up vote
3
down vote













Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $L_1$, $L_2$ and $L_p$ norms defined as



$$beginalign
|boldsymbolx|_1 &= sum_i | x_i | \
| boldsymbolx|_2 &= sqrt sum_i \
| boldsymbolx|_p &= left( sum_i | x_i |^p right)^1/p
endalign$$



In machine learning, we often want to predict some $y$ using some function $f$ of $mathbfx$ parametrized by a vector of parameters $boldsymbolbeta$. To achieve this, we minimize loss function $mathcalL$. We sometimes want to penalize the parameters, by forcing them to have smaller values, the rationale for this is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $L_2$ norm of the vector of weights, and minimizing the whole thing



$$
undersetboldsymbolbetaoperatornamearg,min ; mathcalLbig(y, ,f(mathbfx; boldsymbolbeta) big) + lambda, |,boldsymbolbeta, |_2
$$



where $lambdage0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.



Indeed, using $L_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $L_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).



For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.






share|cite|improve this answer






















  • I think OP would also benefit from a discussion of the distinction between mse and mae minimization
    – generic_user
    1 hour ago










  • @generic_user this was already described in a number of places on this site, I gave several links that discuss this.
    – Tim♦
    1 hour ago










Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






power.puffed is a new contributor. Be nice, and check out our Code of Conduct.









 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f375949%2fl1-and-l2-penalty-vs-l1-and-l2-norms%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
3
down vote













Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $L_1$, $L_2$ and $L_p$ norms defined as



$$beginalign
|boldsymbolx|_1 &= sum_i | x_i | \
| boldsymbolx|_2 &= sqrt sum_i \
| boldsymbolx|_p &= left( sum_i | x_i |^p right)^1/p
endalign$$



In machine learning, we often want to predict some $y$ using some function $f$ of $mathbfx$ parametrized by a vector of parameters $boldsymbolbeta$. To achieve this, we minimize loss function $mathcalL$. We sometimes want to penalize the parameters, by forcing them to have smaller values, the rationale for this is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $L_2$ norm of the vector of weights, and minimizing the whole thing



$$
undersetboldsymbolbetaoperatornamearg,min ; mathcalLbig(y, ,f(mathbfx; boldsymbolbeta) big) + lambda, |,boldsymbolbeta, |_2
$$



where $lambdage0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.



Indeed, using $L_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $L_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).



For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.






share|cite|improve this answer






















  • I think OP would also benefit from a discussion of the distinction between mse and mae minimization
    – generic_user
    1 hour ago










  • @generic_user this was already described in a number of places on this site, I gave several links that discuss this.
    – Tim♦
    1 hour ago














up vote
3
down vote













Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $L_1$, $L_2$ and $L_p$ norms defined as



$$beginalign
|boldsymbolx|_1 &= sum_i | x_i | \
| boldsymbolx|_2 &= sqrt sum_i \
| boldsymbolx|_p &= left( sum_i | x_i |^p right)^1/p
endalign$$



In machine learning, we often want to predict some $y$ using some function $f$ of $mathbfx$ parametrized by a vector of parameters $boldsymbolbeta$. To achieve this, we minimize loss function $mathcalL$. We sometimes want to penalize the parameters, by forcing them to have smaller values, the rationale for this is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $L_2$ norm of the vector of weights, and minimizing the whole thing



$$
undersetboldsymbolbetaoperatornamearg,min ; mathcalLbig(y, ,f(mathbfx; boldsymbolbeta) big) + lambda, |,boldsymbolbeta, |_2
$$



where $lambdage0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.



Indeed, using $L_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $L_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).



For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.






share|cite|improve this answer






















  • I think OP would also benefit from a discussion of the distinction between mse and mae minimization
    – generic_user
    1 hour ago










  • @generic_user this was already described in a number of places on this site, I gave several links that discuss this.
    – Tim♦
    1 hour ago












up vote
3
down vote










up vote
3
down vote









Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $L_1$, $L_2$ and $L_p$ norms defined as



$$beginalign
|boldsymbolx|_1 &= sum_i | x_i | \
| boldsymbolx|_2 &= sqrt sum_i \
| boldsymbolx|_p &= left( sum_i | x_i |^p right)^1/p
endalign$$



In machine learning, we often want to predict some $y$ using some function $f$ of $mathbfx$ parametrized by a vector of parameters $boldsymbolbeta$. To achieve this, we minimize loss function $mathcalL$. We sometimes want to penalize the parameters, by forcing them to have smaller values, the rationale for this is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $L_2$ norm of the vector of weights, and minimizing the whole thing



$$
undersetboldsymbolbetaoperatornamearg,min ; mathcalLbig(y, ,f(mathbfx; boldsymbolbeta) big) + lambda, |,boldsymbolbeta, |_2
$$



where $lambdage0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.



Indeed, using $L_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $L_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).



For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.






share|cite|improve this answer














Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $L_1$, $L_2$ and $L_p$ norms defined as



$$beginalign
|boldsymbolx|_1 &= sum_i | x_i | \
| boldsymbolx|_2 &= sqrt sum_i \
| boldsymbolx|_p &= left( sum_i | x_i |^p right)^1/p
endalign$$



In machine learning, we often want to predict some $y$ using some function $f$ of $mathbfx$ parametrized by a vector of parameters $boldsymbolbeta$. To achieve this, we minimize loss function $mathcalL$. We sometimes want to penalize the parameters, by forcing them to have smaller values, the rationale for this is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $L_2$ norm of the vector of weights, and minimizing the whole thing



$$
undersetboldsymbolbetaoperatornamearg,min ; mathcalLbig(y, ,f(mathbfx; boldsymbolbeta) big) + lambda, |,boldsymbolbeta, |_2
$$



where $lambdage0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.



Indeed, using $L_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $L_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).



For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited 33 mins ago

























answered 1 hour ago









Tim♦

54k9122206




54k9122206











  • I think OP would also benefit from a discussion of the distinction between mse and mae minimization
    – generic_user
    1 hour ago










  • @generic_user this was already described in a number of places on this site, I gave several links that discuss this.
    – Tim♦
    1 hour ago
















  • I think OP would also benefit from a discussion of the distinction between mse and mae minimization
    – generic_user
    1 hour ago










  • @generic_user this was already described in a number of places on this site, I gave several links that discuss this.
    – Tim♦
    1 hour ago















I think OP would also benefit from a discussion of the distinction between mse and mae minimization
– generic_user
1 hour ago




I think OP would also benefit from a discussion of the distinction between mse and mae minimization
– generic_user
1 hour ago












@generic_user this was already described in a number of places on this site, I gave several links that discuss this.
– Tim♦
1 hour ago




@generic_user this was already described in a number of places on this site, I gave several links that discuss this.
– Tim♦
1 hour ago










power.puffed is a new contributor. Be nice, and check out our Code of Conduct.









 

draft saved


draft discarded


















power.puffed is a new contributor. Be nice, and check out our Code of Conduct.












power.puffed is a new contributor. Be nice, and check out our Code of Conduct.











power.puffed is a new contributor. Be nice, and check out our Code of Conduct.













 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f375949%2fl1-and-l2-penalty-vs-l1-and-l2-norms%23new-answer', 'question_page');

);

Post as a guest













































































Comments

Popular posts from this blog

What does second last employer means? [closed]

List of Gilmore Girls characters

Confectionery