L1 and L2 penalty vs L1 and L2 norms
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
1
down vote
favorite
I understand the usages of L1 and L2 norms however I am unsure of usage of L1 and L2 penalty when building models.
From what I understand:
L1: Laplace Prior L2: Gaussian Prior
are two of the penalty terms. I have tried to read about these but there is surprisingly no discussion on these, it always leads to Lasso and Ridge, which I understand.
Can someone help me bridge the gap so as to what does these refer to? and if they are related to L1 and L2 norms in the end, how?
Thanks for your help.
regularization
New contributor
power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
1
down vote
favorite
I understand the usages of L1 and L2 norms however I am unsure of usage of L1 and L2 penalty when building models.
From what I understand:
L1: Laplace Prior L2: Gaussian Prior
are two of the penalty terms. I have tried to read about these but there is surprisingly no discussion on these, it always leads to Lasso and Ridge, which I understand.
Can someone help me bridge the gap so as to what does these refer to? and if they are related to L1 and L2 norms in the end, how?
Thanks for your help.
regularization
New contributor
power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
– Richard Hardy
4 hours ago
Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
– power.puffed
2 hours ago
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I understand the usages of L1 and L2 norms however I am unsure of usage of L1 and L2 penalty when building models.
From what I understand:
L1: Laplace Prior L2: Gaussian Prior
are two of the penalty terms. I have tried to read about these but there is surprisingly no discussion on these, it always leads to Lasso and Ridge, which I understand.
Can someone help me bridge the gap so as to what does these refer to? and if they are related to L1 and L2 norms in the end, how?
Thanks for your help.
regularization
New contributor
power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
I understand the usages of L1 and L2 norms however I am unsure of usage of L1 and L2 penalty when building models.
From what I understand:
L1: Laplace Prior L2: Gaussian Prior
are two of the penalty terms. I have tried to read about these but there is surprisingly no discussion on these, it always leads to Lasso and Ridge, which I understand.
Can someone help me bridge the gap so as to what does these refer to? and if they are related to L1 and L2 norms in the end, how?
Thanks for your help.
regularization
regularization
New contributor
power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 4 hours ago


power.puffed
62
62
New contributor
power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
power.puffed is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
– Richard Hardy
4 hours ago
Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
– power.puffed
2 hours ago
add a comment |Â
Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
– Richard Hardy
4 hours ago
Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
– power.puffed
2 hours ago
Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
– Richard Hardy
4 hours ago
Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
– Richard Hardy
4 hours ago
Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
– power.puffed
2 hours ago
Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
– power.puffed
2 hours ago
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
3
down vote
Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $L_1$, $L_2$ and $L_p$ norms defined as
$$beginalign
|boldsymbolx|_1 &= sum_i | x_i | \
| boldsymbolx|_2 &= sqrt sum_i \
| boldsymbolx|_p &= left( sum_i | x_i |^p right)^1/p
endalign$$
In machine learning, we often want to predict some $y$ using some function $f$ of $mathbfx$ parametrized by a vector of parameters $boldsymbolbeta$. To achieve this, we minimize loss function $mathcalL$. We sometimes want to penalize the parameters, by forcing them to have smaller values, the rationale for this is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $L_2$ norm of the vector of weights, and minimizing the whole thing
$$
undersetboldsymbolbetaoperatornamearg,min ; mathcalLbig(y, ,f(mathbfx; boldsymbolbeta) big) + lambda, |,boldsymbolbeta, |_2
$$
where $lambdage0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.
Indeed, using $L_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $L_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).
For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.
I think OP would also benefit from a discussion of the distinction between mse and mae minimization
– generic_user
1 hour ago
@generic_user this was already described in a number of places on this site, I gave several links that discuss this.
– Tim♦
1 hour ago
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $L_1$, $L_2$ and $L_p$ norms defined as
$$beginalign
|boldsymbolx|_1 &= sum_i | x_i | \
| boldsymbolx|_2 &= sqrt sum_i \
| boldsymbolx|_p &= left( sum_i | x_i |^p right)^1/p
endalign$$
In machine learning, we often want to predict some $y$ using some function $f$ of $mathbfx$ parametrized by a vector of parameters $boldsymbolbeta$. To achieve this, we minimize loss function $mathcalL$. We sometimes want to penalize the parameters, by forcing them to have smaller values, the rationale for this is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $L_2$ norm of the vector of weights, and minimizing the whole thing
$$
undersetboldsymbolbetaoperatornamearg,min ; mathcalLbig(y, ,f(mathbfx; boldsymbolbeta) big) + lambda, |,boldsymbolbeta, |_2
$$
where $lambdage0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.
Indeed, using $L_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $L_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).
For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.
I think OP would also benefit from a discussion of the distinction between mse and mae minimization
– generic_user
1 hour ago
@generic_user this was already described in a number of places on this site, I gave several links that discuss this.
– Tim♦
1 hour ago
add a comment |Â
up vote
3
down vote
Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $L_1$, $L_2$ and $L_p$ norms defined as
$$beginalign
|boldsymbolx|_1 &= sum_i | x_i | \
| boldsymbolx|_2 &= sqrt sum_i \
| boldsymbolx|_p &= left( sum_i | x_i |^p right)^1/p
endalign$$
In machine learning, we often want to predict some $y$ using some function $f$ of $mathbfx$ parametrized by a vector of parameters $boldsymbolbeta$. To achieve this, we minimize loss function $mathcalL$. We sometimes want to penalize the parameters, by forcing them to have smaller values, the rationale for this is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $L_2$ norm of the vector of weights, and minimizing the whole thing
$$
undersetboldsymbolbetaoperatornamearg,min ; mathcalLbig(y, ,f(mathbfx; boldsymbolbeta) big) + lambda, |,boldsymbolbeta, |_2
$$
where $lambdage0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.
Indeed, using $L_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $L_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).
For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.
I think OP would also benefit from a discussion of the distinction between mse and mae minimization
– generic_user
1 hour ago
@generic_user this was already described in a number of places on this site, I gave several links that discuss this.
– Tim♦
1 hour ago
add a comment |Â
up vote
3
down vote
up vote
3
down vote
Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $L_1$, $L_2$ and $L_p$ norms defined as
$$beginalign
|boldsymbolx|_1 &= sum_i | x_i | \
| boldsymbolx|_2 &= sqrt sum_i \
| boldsymbolx|_p &= left( sum_i | x_i |^p right)^1/p
endalign$$
In machine learning, we often want to predict some $y$ using some function $f$ of $mathbfx$ parametrized by a vector of parameters $boldsymbolbeta$. To achieve this, we minimize loss function $mathcalL$. We sometimes want to penalize the parameters, by forcing them to have smaller values, the rationale for this is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $L_2$ norm of the vector of weights, and minimizing the whole thing
$$
undersetboldsymbolbetaoperatornamearg,min ; mathcalLbig(y, ,f(mathbfx; boldsymbolbeta) big) + lambda, |,boldsymbolbeta, |_2
$$
where $lambdage0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.
Indeed, using $L_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $L_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).
For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.
Norm in mathematics is some function that measures "length" or "size" of a vector. Among the popular norms, there are $L_1$, $L_2$ and $L_p$ norms defined as
$$beginalign
|boldsymbolx|_1 &= sum_i | x_i | \
| boldsymbolx|_2 &= sqrt sum_i \
| boldsymbolx|_p &= left( sum_i | x_i |^p right)^1/p
endalign$$
In machine learning, we often want to predict some $y$ using some function $f$ of $mathbfx$ parametrized by a vector of parameters $boldsymbolbeta$. To achieve this, we minimize loss function $mathcalL$. We sometimes want to penalize the parameters, by forcing them to have smaller values, the rationale for this is described, for example here, here, or here. One of the ways of achieving this, is by adding the regularization terms, e.g. $L_2$ norm of the vector of weights, and minimizing the whole thing
$$
undersetboldsymbolbetaoperatornamearg,min ; mathcalLbig(y, ,f(mathbfx; boldsymbolbeta) big) + lambda, |,boldsymbolbeta, |_2
$$
where $lambdage0$ is a hyperparameter. So basically, we use the norms in here to measure the "size" of the model weights. By adding the size of the weights to the loss function, we force the minimization algorithm to seek for such solution that along with minimizing the loss function, would make the "size" of weights smaller. The $lambda$ hyperparameter lets you control how large effect this should have on the optimization algorithm.
Indeed, using $L_2$ as penalty may be seen as equivalent of using Gaussian priors for the parameters, while using $L_1$ norm would be equivalent of using Laplace priors (but in practice, you need much stronger priors, check e.g. the paper Shrinkage priors for Bayesian penalized regression by van Erp et al).
For more details check e.g. Why L1 norm for sparse models, Why does the Lasso provide Variable Selection?, or When should I use lasso vs ridge? threads.
edited 33 mins ago
answered 1 hour ago


Tim♦
54k9122206
54k9122206
I think OP would also benefit from a discussion of the distinction between mse and mae minimization
– generic_user
1 hour ago
@generic_user this was already described in a number of places on this site, I gave several links that discuss this.
– Tim♦
1 hour ago
add a comment |Â
I think OP would also benefit from a discussion of the distinction between mse and mae minimization
– generic_user
1 hour ago
@generic_user this was already described in a number of places on this site, I gave several links that discuss this.
– Tim♦
1 hour ago
I think OP would also benefit from a discussion of the distinction between mse and mae minimization
– generic_user
1 hour ago
I think OP would also benefit from a discussion of the distinction between mse and mae minimization
– generic_user
1 hour ago
@generic_user this was already described in a number of places on this site, I gave several links that discuss this.
– Tim♦
1 hour ago
@generic_user this was already described in a number of places on this site, I gave several links that discuss this.
– Tim♦
1 hour ago
add a comment |Â
power.puffed is a new contributor. Be nice, and check out our Code of Conduct.
power.puffed is a new contributor. Be nice, and check out our Code of Conduct.
power.puffed is a new contributor. Be nice, and check out our Code of Conduct.
power.puffed is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f375949%2fl1-and-l2-penalty-vs-l1-and-l2-norms%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Could you be even more explicit regarding what you do not understand? When you refer to these, do you mean penalty terms or priors, or what?
– Richard Hardy
4 hours ago
Hi @RichardHardy, I meant I wanted to know more about LI and L2 penalty and how are they different from L1 and L2 norms
– power.puffed
2 hours ago