Can overfitting be a good thing in some cases?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I know the goal of machine learning is to create generalizable models and therefore overfitting is undesirable.



However, I wonder if it could be desirable in some cases. For example, let's say I want to predict if a student will dropout a course, and I want to do this before the end of the course by using a proxy label, their assignment submission status. In this scenario, I would not mind if the model trained on the proxy label overfits and does not generalize well to the unseen data, since I care about only a specific set of users.



I wonder if this is a valid way of thinking for this specific scenario. Any ideas?










share|cite|improve this question





















  • You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
    – Björn
    30 mins ago
















up vote
2
down vote

favorite












I know the goal of machine learning is to create generalizable models and therefore overfitting is undesirable.



However, I wonder if it could be desirable in some cases. For example, let's say I want to predict if a student will dropout a course, and I want to do this before the end of the course by using a proxy label, their assignment submission status. In this scenario, I would not mind if the model trained on the proxy label overfits and does not generalize well to the unseen data, since I care about only a specific set of users.



I wonder if this is a valid way of thinking for this specific scenario. Any ideas?










share|cite|improve this question





















  • You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
    – Björn
    30 mins ago












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I know the goal of machine learning is to create generalizable models and therefore overfitting is undesirable.



However, I wonder if it could be desirable in some cases. For example, let's say I want to predict if a student will dropout a course, and I want to do this before the end of the course by using a proxy label, their assignment submission status. In this scenario, I would not mind if the model trained on the proxy label overfits and does not generalize well to the unseen data, since I care about only a specific set of users.



I wonder if this is a valid way of thinking for this specific scenario. Any ideas?










share|cite|improve this question













I know the goal of machine learning is to create generalizable models and therefore overfitting is undesirable.



However, I wonder if it could be desirable in some cases. For example, let's say I want to predict if a student will dropout a course, and I want to do this before the end of the course by using a proxy label, their assignment submission status. In this scenario, I would not mind if the model trained on the proxy label overfits and does not generalize well to the unseen data, since I care about only a specific set of users.



I wonder if this is a valid way of thinking for this specific scenario. Any ideas?







machine-learning overfitting train






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked 2 hours ago









renakre

145116




145116











  • You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
    – Björn
    30 mins ago
















  • You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
    – Björn
    30 mins ago















You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
– Björn
30 mins ago




You mean you only want to have a kind of efficient way of memorizing the data (in that case overfitting is indeed good)? Or do you actually want to predict new data from the same users (only a certain type of overfitting is good - this may e.g. change your cross-validation strategy)?
– Björn
30 mins ago










2 Answers
2






active

oldest

votes

















up vote
2
down vote













Does your trained model performs effective on both training and test data? when it's so then its not overfitting. If your model performs well in training data but worse in test data, then it is overfitting which wouldn't make sense for any application. If you mean that your training data and test data are same, then yeah overfitting wouldn't be an issue there. But if you want to work on unseen test data, then overfitting is always a problem.






share|cite|improve this answer










New contributor




Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
























    up vote
    1
    down vote













    Actually the labels "generalization" and "overfitting" might be a bit misleading here.



    What you want in your example is a good prediction of the dropout status.



    So technically:
    In training you therefore need to have an unbiased sample of dropout and non-dropout-students. It is extremely important to prepare not only the model, but even more the data you are using to evaluate (train, validate, etc.).



    There are text-book examples of overfitting, where you can e.g. plot a performance indicator (e.g. mislabelling rate) of the your training data and compare it with validation data. The training performance will always become better, but at some point the validation will become worse. There it is probably very clear that you'd rather stop the learning process before it worsens the performance.



    What is meant by "generalization" is actually very specific. You want your trained model to be the best possible once it encounters previously unseen data. You use the validation data, because there you know "the truth". Unlike with real data.



    So still as above: you want your model to give a prediction of the students status.
    - if your model is overfitted, it will give higher valued indicators for the students in your training set; but will perform worse on non-trained data
    - if you model has good generalization power; it will perform equally well on training data, as well as non-training data



    If you talk about "specific data sets", then either these are the basis of your training and validation; or you simply do it wrong. And this has nothing to do with generalization or overfitting in machine learning.






    share|cite|improve this answer




















      Your Answer




      StackExchange.ifUsing("editor", function ()
      return StackExchange.using("mathjaxEditing", function ()
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      );
      );
      , "mathjax-editing");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "65"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f372696%2fcan-overfitting-be-a-good-thing-in-some-cases%23new-answer', 'question_page');

      );

      Post as a guest






























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      2
      down vote













      Does your trained model performs effective on both training and test data? when it's so then its not overfitting. If your model performs well in training data but worse in test data, then it is overfitting which wouldn't make sense for any application. If you mean that your training data and test data are same, then yeah overfitting wouldn't be an issue there. But if you want to work on unseen test data, then overfitting is always a problem.






      share|cite|improve this answer










      New contributor




      Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





















        up vote
        2
        down vote













        Does your trained model performs effective on both training and test data? when it's so then its not overfitting. If your model performs well in training data but worse in test data, then it is overfitting which wouldn't make sense for any application. If you mean that your training data and test data are same, then yeah overfitting wouldn't be an issue there. But if you want to work on unseen test data, then overfitting is always a problem.






        share|cite|improve this answer










        New contributor




        Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.



















          up vote
          2
          down vote










          up vote
          2
          down vote









          Does your trained model performs effective on both training and test data? when it's so then its not overfitting. If your model performs well in training data but worse in test data, then it is overfitting which wouldn't make sense for any application. If you mean that your training data and test data are same, then yeah overfitting wouldn't be an issue there. But if you want to work on unseen test data, then overfitting is always a problem.






          share|cite|improve this answer










          New contributor




          Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          Does your trained model performs effective on both training and test data? when it's so then its not overfitting. If your model performs well in training data but worse in test data, then it is overfitting which wouldn't make sense for any application. If you mean that your training data and test data are same, then yeah overfitting wouldn't be an issue there. But if you want to work on unseen test data, then overfitting is always a problem.







          share|cite|improve this answer










          New contributor




          Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          share|cite|improve this answer



          share|cite|improve this answer








          edited 54 mins ago









          Ferdi

          3,55142151




          3,55142151






          New contributor




          Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          answered 2 hours ago









          Sanga

          211




          211




          New contributor




          Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.





          New contributor





          Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          Sanga is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






















              up vote
              1
              down vote













              Actually the labels "generalization" and "overfitting" might be a bit misleading here.



              What you want in your example is a good prediction of the dropout status.



              So technically:
              In training you therefore need to have an unbiased sample of dropout and non-dropout-students. It is extremely important to prepare not only the model, but even more the data you are using to evaluate (train, validate, etc.).



              There are text-book examples of overfitting, where you can e.g. plot a performance indicator (e.g. mislabelling rate) of the your training data and compare it with validation data. The training performance will always become better, but at some point the validation will become worse. There it is probably very clear that you'd rather stop the learning process before it worsens the performance.



              What is meant by "generalization" is actually very specific. You want your trained model to be the best possible once it encounters previously unseen data. You use the validation data, because there you know "the truth". Unlike with real data.



              So still as above: you want your model to give a prediction of the students status.
              - if your model is overfitted, it will give higher valued indicators for the students in your training set; but will perform worse on non-trained data
              - if you model has good generalization power; it will perform equally well on training data, as well as non-training data



              If you talk about "specific data sets", then either these are the basis of your training and validation; or you simply do it wrong. And this has nothing to do with generalization or overfitting in machine learning.






              share|cite|improve this answer
























                up vote
                1
                down vote













                Actually the labels "generalization" and "overfitting" might be a bit misleading here.



                What you want in your example is a good prediction of the dropout status.



                So technically:
                In training you therefore need to have an unbiased sample of dropout and non-dropout-students. It is extremely important to prepare not only the model, but even more the data you are using to evaluate (train, validate, etc.).



                There are text-book examples of overfitting, where you can e.g. plot a performance indicator (e.g. mislabelling rate) of the your training data and compare it with validation data. The training performance will always become better, but at some point the validation will become worse. There it is probably very clear that you'd rather stop the learning process before it worsens the performance.



                What is meant by "generalization" is actually very specific. You want your trained model to be the best possible once it encounters previously unseen data. You use the validation data, because there you know "the truth". Unlike with real data.



                So still as above: you want your model to give a prediction of the students status.
                - if your model is overfitted, it will give higher valued indicators for the students in your training set; but will perform worse on non-trained data
                - if you model has good generalization power; it will perform equally well on training data, as well as non-training data



                If you talk about "specific data sets", then either these are the basis of your training and validation; or you simply do it wrong. And this has nothing to do with generalization or overfitting in machine learning.






                share|cite|improve this answer






















                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  Actually the labels "generalization" and "overfitting" might be a bit misleading here.



                  What you want in your example is a good prediction of the dropout status.



                  So technically:
                  In training you therefore need to have an unbiased sample of dropout and non-dropout-students. It is extremely important to prepare not only the model, but even more the data you are using to evaluate (train, validate, etc.).



                  There are text-book examples of overfitting, where you can e.g. plot a performance indicator (e.g. mislabelling rate) of the your training data and compare it with validation data. The training performance will always become better, but at some point the validation will become worse. There it is probably very clear that you'd rather stop the learning process before it worsens the performance.



                  What is meant by "generalization" is actually very specific. You want your trained model to be the best possible once it encounters previously unseen data. You use the validation data, because there you know "the truth". Unlike with real data.



                  So still as above: you want your model to give a prediction of the students status.
                  - if your model is overfitted, it will give higher valued indicators for the students in your training set; but will perform worse on non-trained data
                  - if you model has good generalization power; it will perform equally well on training data, as well as non-training data



                  If you talk about "specific data sets", then either these are the basis of your training and validation; or you simply do it wrong. And this has nothing to do with generalization or overfitting in machine learning.






                  share|cite|improve this answer












                  Actually the labels "generalization" and "overfitting" might be a bit misleading here.



                  What you want in your example is a good prediction of the dropout status.



                  So technically:
                  In training you therefore need to have an unbiased sample of dropout and non-dropout-students. It is extremely important to prepare not only the model, but even more the data you are using to evaluate (train, validate, etc.).



                  There are text-book examples of overfitting, where you can e.g. plot a performance indicator (e.g. mislabelling rate) of the your training data and compare it with validation data. The training performance will always become better, but at some point the validation will become worse. There it is probably very clear that you'd rather stop the learning process before it worsens the performance.



                  What is meant by "generalization" is actually very specific. You want your trained model to be the best possible once it encounters previously unseen data. You use the validation data, because there you know "the truth". Unlike with real data.



                  So still as above: you want your model to give a prediction of the students status.
                  - if your model is overfitted, it will give higher valued indicators for the students in your training set; but will perform worse on non-trained data
                  - if you model has good generalization power; it will perform equally well on training data, as well as non-training data



                  If you talk about "specific data sets", then either these are the basis of your training and validation; or you simply do it wrong. And this has nothing to do with generalization or overfitting in machine learning.







                  share|cite|improve this answer












                  share|cite|improve this answer



                  share|cite|improve this answer










                  answered 16 mins ago









                  cherub

                  1,260210




                  1,260210



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f372696%2fcan-overfitting-be-a-good-thing-in-some-cases%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Comments

                      Popular posts from this blog

                      What does second last employer means? [closed]

                      List of Gilmore Girls characters

                      Confectionery