Can overfitting be a good thing in some cases?
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I know the goal of machine learning is to create generalizable models and therefore overfitting is undesirable.
However, I wonder if it could be desirable in some cases. For example, let's say I want to predict if a student will dropout a course, and I want to do this before the end of the course by using a proxy label, their assignment submission status. In this scenario, I would not mind if the model trained on the proxy label overfits and does not generalize well to the unseen data, since I care about only a specific set of users.
I wonder if this is a valid way of thinking for this specific scenario. Any ideas?
machine-learning overfitting train
add a comment |Â
up vote
2
down vote
favorite
I know the goal of machine learning is to create generalizable models and therefore overfitting is undesirable.
However, I wonder if it could be desirable in some cases. For example, let's say I want to predict if a student will dropout a course, and I want to do this before the end of the course by using a proxy label, their assignment submission status. In this scenario, I would not mind if the model trained on the proxy label overfits and does not generalize well to the unseen data, since I care about only a specific set of users.
I wonder if this is a valid way of thinking for this specific scenario. Any ideas?
machine-learning overfitting train
You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
– Björn
30 mins ago
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I know the goal of machine learning is to create generalizable models and therefore overfitting is undesirable.
However, I wonder if it could be desirable in some cases. For example, let's say I want to predict if a student will dropout a course, and I want to do this before the end of the course by using a proxy label, their assignment submission status. In this scenario, I would not mind if the model trained on the proxy label overfits and does not generalize well to the unseen data, since I care about only a specific set of users.
I wonder if this is a valid way of thinking for this specific scenario. Any ideas?
machine-learning overfitting train
I know the goal of machine learning is to create generalizable models and therefore overfitting is undesirable.
However, I wonder if it could be desirable in some cases. For example, let's say I want to predict if a student will dropout a course, and I want to do this before the end of the course by using a proxy label, their assignment submission status. In this scenario, I would not mind if the model trained on the proxy label overfits and does not generalize well to the unseen data, since I care about only a specific set of users.
I wonder if this is a valid way of thinking for this specific scenario. Any ideas?
machine-learning overfitting train
machine-learning overfitting train
asked 2 hours ago
renakre
145116
145116
You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
– Björn
30 mins ago
add a comment |Â
You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
– Björn
30 mins ago
You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
– Björn
30 mins ago
You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
– Björn
30 mins ago
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
Does your trained model performs effective on both training and test data? when it's so then its not overfitting. If your model performs well in training data but worse in test data, then it is overfitting which wouldn't make sense for any application. If you mean that your training data and test data are same, then yeah overfitting wouldn't be an issue there. But if you want to work on unseen test data, then overfitting is always a problem.
New contributor
Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
1
down vote
Actually the labels "generalization" and "overfitting" might be a bit misleading here.
What you want in your example is a good prediction of the dropout status.
So technically:
In training you therefore need to have an unbiased sample of dropout and non-dropout-students. It is extremely important to prepare not only the model, but even more the data you are using to evaluate (train, validate, etc.).
There are text-book examples of overfitting, where you can e.g. plot a performance indicator (e.g. mislabelling rate) of the your training data and compare it with validation data. The training performance will always become better, but at some point the validation will become worse. There it is probably very clear that you'd rather stop the learning process before it worsens the performance.
What is meant by "generalization" is actually very specific. You want your trained model to be the best possible once it encounters previously unseen data. You use the validation data, because there you know "the truth". Unlike with real data.
So still as above: you want your model to give a prediction of the students status.
- if your model is overfitted, it will give higher valued indicators for the students in your training set; but will perform worse on non-trained data
- if you model has good generalization power; it will perform equally well on training data, as well as non-training data
If you talk about "specific data sets", then either these are the basis of your training and validation; or you simply do it wrong. And this has nothing to do with generalization or overfitting in machine learning.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
Does your trained model performs effective on both training and test data? when it's so then its not overfitting. If your model performs well in training data but worse in test data, then it is overfitting which wouldn't make sense for any application. If you mean that your training data and test data are same, then yeah overfitting wouldn't be an issue there. But if you want to work on unseen test data, then overfitting is always a problem.
New contributor
Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
2
down vote
Does your trained model performs effective on both training and test data? when it's so then its not overfitting. If your model performs well in training data but worse in test data, then it is overfitting which wouldn't make sense for any application. If you mean that your training data and test data are same, then yeah overfitting wouldn't be an issue there. But if you want to work on unseen test data, then overfitting is always a problem.
New contributor
Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Does your trained model performs effective on both training and test data? when it's so then its not overfitting. If your model performs well in training data but worse in test data, then it is overfitting which wouldn't make sense for any application. If you mean that your training data and test data are same, then yeah overfitting wouldn't be an issue there. But if you want to work on unseen test data, then overfitting is always a problem.
New contributor
Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Does your trained model performs effective on both training and test data? when it's so then its not overfitting. If your model performs well in training data but worse in test data, then it is overfitting which wouldn't make sense for any application. If you mean that your training data and test data are same, then yeah overfitting wouldn't be an issue there. But if you want to work on unseen test data, then overfitting is always a problem.
New contributor
Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
edited 54 mins ago
Ferdi
3,55142151
3,55142151
New contributor
Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 2 hours ago


Sanga
211
211
New contributor
Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |Â
add a comment |Â
up vote
1
down vote
Actually the labels "generalization" and "overfitting" might be a bit misleading here.
What you want in your example is a good prediction of the dropout status.
So technically:
In training you therefore need to have an unbiased sample of dropout and non-dropout-students. It is extremely important to prepare not only the model, but even more the data you are using to evaluate (train, validate, etc.).
There are text-book examples of overfitting, where you can e.g. plot a performance indicator (e.g. mislabelling rate) of the your training data and compare it with validation data. The training performance will always become better, but at some point the validation will become worse. There it is probably very clear that you'd rather stop the learning process before it worsens the performance.
What is meant by "generalization" is actually very specific. You want your trained model to be the best possible once it encounters previously unseen data. You use the validation data, because there you know "the truth". Unlike with real data.
So still as above: you want your model to give a prediction of the students status.
- if your model is overfitted, it will give higher valued indicators for the students in your training set; but will perform worse on non-trained data
- if you model has good generalization power; it will perform equally well on training data, as well as non-training data
If you talk about "specific data sets", then either these are the basis of your training and validation; or you simply do it wrong. And this has nothing to do with generalization or overfitting in machine learning.
add a comment |Â
up vote
1
down vote
Actually the labels "generalization" and "overfitting" might be a bit misleading here.
What you want in your example is a good prediction of the dropout status.
So technically:
In training you therefore need to have an unbiased sample of dropout and non-dropout-students. It is extremely important to prepare not only the model, but even more the data you are using to evaluate (train, validate, etc.).
There are text-book examples of overfitting, where you can e.g. plot a performance indicator (e.g. mislabelling rate) of the your training data and compare it with validation data. The training performance will always become better, but at some point the validation will become worse. There it is probably very clear that you'd rather stop the learning process before it worsens the performance.
What is meant by "generalization" is actually very specific. You want your trained model to be the best possible once it encounters previously unseen data. You use the validation data, because there you know "the truth". Unlike with real data.
So still as above: you want your model to give a prediction of the students status.
- if your model is overfitted, it will give higher valued indicators for the students in your training set; but will perform worse on non-trained data
- if you model has good generalization power; it will perform equally well on training data, as well as non-training data
If you talk about "specific data sets", then either these are the basis of your training and validation; or you simply do it wrong. And this has nothing to do with generalization or overfitting in machine learning.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Actually the labels "generalization" and "overfitting" might be a bit misleading here.
What you want in your example is a good prediction of the dropout status.
So technically:
In training you therefore need to have an unbiased sample of dropout and non-dropout-students. It is extremely important to prepare not only the model, but even more the data you are using to evaluate (train, validate, etc.).
There are text-book examples of overfitting, where you can e.g. plot a performance indicator (e.g. mislabelling rate) of the your training data and compare it with validation data. The training performance will always become better, but at some point the validation will become worse. There it is probably very clear that you'd rather stop the learning process before it worsens the performance.
What is meant by "generalization" is actually very specific. You want your trained model to be the best possible once it encounters previously unseen data. You use the validation data, because there you know "the truth". Unlike with real data.
So still as above: you want your model to give a prediction of the students status.
- if your model is overfitted, it will give higher valued indicators for the students in your training set; but will perform worse on non-trained data
- if you model has good generalization power; it will perform equally well on training data, as well as non-training data
If you talk about "specific data sets", then either these are the basis of your training and validation; or you simply do it wrong. And this has nothing to do with generalization or overfitting in machine learning.
Actually the labels "generalization" and "overfitting" might be a bit misleading here.
What you want in your example is a good prediction of the dropout status.
So technically:
In training you therefore need to have an unbiased sample of dropout and non-dropout-students. It is extremely important to prepare not only the model, but even more the data you are using to evaluate (train, validate, etc.).
There are text-book examples of overfitting, where you can e.g. plot a performance indicator (e.g. mislabelling rate) of the your training data and compare it with validation data. The training performance will always become better, but at some point the validation will become worse. There it is probably very clear that you'd rather stop the learning process before it worsens the performance.
What is meant by "generalization" is actually very specific. You want your trained model to be the best possible once it encounters previously unseen data. You use the validation data, because there you know "the truth". Unlike with real data.
So still as above: you want your model to give a prediction of the students status.
- if your model is overfitted, it will give higher valued indicators for the students in your training set; but will perform worse on non-trained data
- if you model has good generalization power; it will perform equally well on training data, as well as non-training data
If you talk about "specific data sets", then either these are the basis of your training and validation; or you simply do it wrong. And this has nothing to do with generalization or overfitting in machine learning.
answered 16 mins ago
cherub
1,260210
1,260210
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f372696%2fcan-overfitting-be-a-good-thing-in-some-cases%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
– Björn
30 mins ago