Unbalanced training data for different classes

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
1
down vote

favorite












What precautions do I need to take while trying to develop a CNN for classification of images if there is much more training data for one label. For example:



label1 : 1000 images
label2 : 100 images
label3 : 100 images
label4 : 100 images


Numbers will become larger later but proportion is likely to stay the same.



Thanks for your insight.










share|improve this question

























    up vote
    1
    down vote

    favorite












    What precautions do I need to take while trying to develop a CNN for classification of images if there is much more training data for one label. For example:



    label1 : 1000 images
    label2 : 100 images
    label3 : 100 images
    label4 : 100 images


    Numbers will become larger later but proportion is likely to stay the same.



    Thanks for your insight.










    share|improve this question























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      What precautions do I need to take while trying to develop a CNN for classification of images if there is much more training data for one label. For example:



      label1 : 1000 images
      label2 : 100 images
      label3 : 100 images
      label4 : 100 images


      Numbers will become larger later but proportion is likely to stay the same.



      Thanks for your insight.










      share|improve this question













      What precautions do I need to take while trying to develop a CNN for classification of images if there is much more training data for one label. For example:



      label1 : 1000 images
      label2 : 100 images
      label3 : 100 images
      label4 : 100 images


      Numbers will become larger later but proportion is likely to stay the same.



      Thanks for your insight.







      neural-network convnet image-classification image-recognition






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked 5 hours ago









      rnso

      2448




      2448




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote













          You can duplicate the images and add them. You can use data augmentation techniques for the labels which have less images. The below code is for Keras.



          datagen = ImageDataGenerator(
          rotation_range=40,
          width_shift_range=0.2,
          height_shift_range=0.2,
          shear_range=0.2,
          zoom_range=0.2,
          horizontal_flip=True,
          fill_mode='nearest')


          I hope this helps. You should not be worried about one label having more data, rather you should think how to increase data for other labels.






          share|improve this answer






















          • Is vertical_flip also useful? Can it be done with this function? Also, what do you think of augmentation technique given in this post: datascience.stackexchange.com/questions/38795/… ?
            – rnso
            1 hour ago











          • I think it will work if you change the (8 x 8) according to your image size but it's always better to define a new function to suit your needs.
            – Danny
            1 hour ago










          • You should first train your model on the unbalanced training set and check your results. These may serve as a baseline for further optimization. You can try different settings as well, like making sure that your batches have at least one example of each class. What I mean is: first check whether the unbalanced classes are in fact a problem, before trying to solve it
            – id-2205
            1 hour ago


















          up vote
          2
          down vote













          The dataset you are using contains almost above 90% of the training data belonging to one single class and will greatly impact your results. This imbalance of the data is ought to generate what we call as Skew classes. Presence of skew classes is going to influence your predictions and the learned model could become one that predicts the majority class.



          In order to overcome this problem, you can do the following:




          • Sampling: Up sample or down sample your dataset to ensure equal representations of the data.


          • Discarding excess data: If the data in other classes is sufficient, simply discard some data from the dominating class.


          • Weighting: Certain training algorithms take weights to put emphasis on the classes and could be helpful during skew classes.

          This answer is based on this article. Refer it for detailed explaination.






          share|improve this answer






















          • this answer seems promising, but have a look at this post in order to improve your answer. Look especially at the Provide context for links section. You can summarize the main points of the articles you shared in case the links are removed.
            – BrunoGL
            42 mins ago











          • Thanks Bruno, I'll keep that in mind.
            – thanatoz
            36 mins ago










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38796%2funbalanced-training-data-for-different-classes%23new-answer', 'question_page');

          );

          Post as a guest






























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote













          You can duplicate the images and add them. You can use data augmentation techniques for the labels which have less images. The below code is for Keras.



          datagen = ImageDataGenerator(
          rotation_range=40,
          width_shift_range=0.2,
          height_shift_range=0.2,
          shear_range=0.2,
          zoom_range=0.2,
          horizontal_flip=True,
          fill_mode='nearest')


          I hope this helps. You should not be worried about one label having more data, rather you should think how to increase data for other labels.






          share|improve this answer






















          • Is vertical_flip also useful? Can it be done with this function? Also, what do you think of augmentation technique given in this post: datascience.stackexchange.com/questions/38795/… ?
            – rnso
            1 hour ago











          • I think it will work if you change the (8 x 8) according to your image size but it's always better to define a new function to suit your needs.
            – Danny
            1 hour ago










          • You should first train your model on the unbalanced training set and check your results. These may serve as a baseline for further optimization. You can try different settings as well, like making sure that your batches have at least one example of each class. What I mean is: first check whether the unbalanced classes are in fact a problem, before trying to solve it
            – id-2205
            1 hour ago















          up vote
          2
          down vote













          You can duplicate the images and add them. You can use data augmentation techniques for the labels which have less images. The below code is for Keras.



          datagen = ImageDataGenerator(
          rotation_range=40,
          width_shift_range=0.2,
          height_shift_range=0.2,
          shear_range=0.2,
          zoom_range=0.2,
          horizontal_flip=True,
          fill_mode='nearest')


          I hope this helps. You should not be worried about one label having more data, rather you should think how to increase data for other labels.






          share|improve this answer






















          • Is vertical_flip also useful? Can it be done with this function? Also, what do you think of augmentation technique given in this post: datascience.stackexchange.com/questions/38795/… ?
            – rnso
            1 hour ago











          • I think it will work if you change the (8 x 8) according to your image size but it's always better to define a new function to suit your needs.
            – Danny
            1 hour ago










          • You should first train your model on the unbalanced training set and check your results. These may serve as a baseline for further optimization. You can try different settings as well, like making sure that your batches have at least one example of each class. What I mean is: first check whether the unbalanced classes are in fact a problem, before trying to solve it
            – id-2205
            1 hour ago













          up vote
          2
          down vote










          up vote
          2
          down vote









          You can duplicate the images and add them. You can use data augmentation techniques for the labels which have less images. The below code is for Keras.



          datagen = ImageDataGenerator(
          rotation_range=40,
          width_shift_range=0.2,
          height_shift_range=0.2,
          shear_range=0.2,
          zoom_range=0.2,
          horizontal_flip=True,
          fill_mode='nearest')


          I hope this helps. You should not be worried about one label having more data, rather you should think how to increase data for other labels.






          share|improve this answer














          You can duplicate the images and add them. You can use data augmentation techniques for the labels which have less images. The below code is for Keras.



          datagen = ImageDataGenerator(
          rotation_range=40,
          width_shift_range=0.2,
          height_shift_range=0.2,
          shear_range=0.2,
          zoom_range=0.2,
          horizontal_flip=True,
          fill_mode='nearest')


          I hope this helps. You should not be worried about one label having more data, rather you should think how to increase data for other labels.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 1 hour ago

























          answered 1 hour ago









          Danny

          1016




          1016











          • Is vertical_flip also useful? Can it be done with this function? Also, what do you think of augmentation technique given in this post: datascience.stackexchange.com/questions/38795/… ?
            – rnso
            1 hour ago











          • I think it will work if you change the (8 x 8) according to your image size but it's always better to define a new function to suit your needs.
            – Danny
            1 hour ago










          • You should first train your model on the unbalanced training set and check your results. These may serve as a baseline for further optimization. You can try different settings as well, like making sure that your batches have at least one example of each class. What I mean is: first check whether the unbalanced classes are in fact a problem, before trying to solve it
            – id-2205
            1 hour ago

















          • Is vertical_flip also useful? Can it be done with this function? Also, what do you think of augmentation technique given in this post: datascience.stackexchange.com/questions/38795/… ?
            – rnso
            1 hour ago











          • I think it will work if you change the (8 x 8) according to your image size but it's always better to define a new function to suit your needs.
            – Danny
            1 hour ago










          • You should first train your model on the unbalanced training set and check your results. These may serve as a baseline for further optimization. You can try different settings as well, like making sure that your batches have at least one example of each class. What I mean is: first check whether the unbalanced classes are in fact a problem, before trying to solve it
            – id-2205
            1 hour ago
















          Is vertical_flip also useful? Can it be done with this function? Also, what do you think of augmentation technique given in this post: datascience.stackexchange.com/questions/38795/… ?
          – rnso
          1 hour ago





          Is vertical_flip also useful? Can it be done with this function? Also, what do you think of augmentation technique given in this post: datascience.stackexchange.com/questions/38795/… ?
          – rnso
          1 hour ago













          I think it will work if you change the (8 x 8) according to your image size but it's always better to define a new function to suit your needs.
          – Danny
          1 hour ago




          I think it will work if you change the (8 x 8) according to your image size but it's always better to define a new function to suit your needs.
          – Danny
          1 hour ago












          You should first train your model on the unbalanced training set and check your results. These may serve as a baseline for further optimization. You can try different settings as well, like making sure that your batches have at least one example of each class. What I mean is: first check whether the unbalanced classes are in fact a problem, before trying to solve it
          – id-2205
          1 hour ago





          You should first train your model on the unbalanced training set and check your results. These may serve as a baseline for further optimization. You can try different settings as well, like making sure that your batches have at least one example of each class. What I mean is: first check whether the unbalanced classes are in fact a problem, before trying to solve it
          – id-2205
          1 hour ago











          up vote
          2
          down vote













          The dataset you are using contains almost above 90% of the training data belonging to one single class and will greatly impact your results. This imbalance of the data is ought to generate what we call as Skew classes. Presence of skew classes is going to influence your predictions and the learned model could become one that predicts the majority class.



          In order to overcome this problem, you can do the following:




          • Sampling: Up sample or down sample your dataset to ensure equal representations of the data.


          • Discarding excess data: If the data in other classes is sufficient, simply discard some data from the dominating class.


          • Weighting: Certain training algorithms take weights to put emphasis on the classes and could be helpful during skew classes.

          This answer is based on this article. Refer it for detailed explaination.






          share|improve this answer






















          • this answer seems promising, but have a look at this post in order to improve your answer. Look especially at the Provide context for links section. You can summarize the main points of the articles you shared in case the links are removed.
            – BrunoGL
            42 mins ago











          • Thanks Bruno, I'll keep that in mind.
            – thanatoz
            36 mins ago














          up vote
          2
          down vote













          The dataset you are using contains almost above 90% of the training data belonging to one single class and will greatly impact your results. This imbalance of the data is ought to generate what we call as Skew classes. Presence of skew classes is going to influence your predictions and the learned model could become one that predicts the majority class.



          In order to overcome this problem, you can do the following:




          • Sampling: Up sample or down sample your dataset to ensure equal representations of the data.


          • Discarding excess data: If the data in other classes is sufficient, simply discard some data from the dominating class.


          • Weighting: Certain training algorithms take weights to put emphasis on the classes and could be helpful during skew classes.

          This answer is based on this article. Refer it for detailed explaination.






          share|improve this answer






















          • this answer seems promising, but have a look at this post in order to improve your answer. Look especially at the Provide context for links section. You can summarize the main points of the articles you shared in case the links are removed.
            – BrunoGL
            42 mins ago











          • Thanks Bruno, I'll keep that in mind.
            – thanatoz
            36 mins ago












          up vote
          2
          down vote










          up vote
          2
          down vote









          The dataset you are using contains almost above 90% of the training data belonging to one single class and will greatly impact your results. This imbalance of the data is ought to generate what we call as Skew classes. Presence of skew classes is going to influence your predictions and the learned model could become one that predicts the majority class.



          In order to overcome this problem, you can do the following:




          • Sampling: Up sample or down sample your dataset to ensure equal representations of the data.


          • Discarding excess data: If the data in other classes is sufficient, simply discard some data from the dominating class.


          • Weighting: Certain training algorithms take weights to put emphasis on the classes and could be helpful during skew classes.

          This answer is based on this article. Refer it for detailed explaination.






          share|improve this answer














          The dataset you are using contains almost above 90% of the training data belonging to one single class and will greatly impact your results. This imbalance of the data is ought to generate what we call as Skew classes. Presence of skew classes is going to influence your predictions and the learned model could become one that predicts the majority class.



          In order to overcome this problem, you can do the following:




          • Sampling: Up sample or down sample your dataset to ensure equal representations of the data.


          • Discarding excess data: If the data in other classes is sufficient, simply discard some data from the dominating class.


          • Weighting: Certain training algorithms take weights to put emphasis on the classes and could be helpful during skew classes.

          This answer is based on this article. Refer it for detailed explaination.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 13 mins ago

























          answered 1 hour ago









          thanatoz

          937




          937











          • this answer seems promising, but have a look at this post in order to improve your answer. Look especially at the Provide context for links section. You can summarize the main points of the articles you shared in case the links are removed.
            – BrunoGL
            42 mins ago











          • Thanks Bruno, I'll keep that in mind.
            – thanatoz
            36 mins ago
















          • this answer seems promising, but have a look at this post in order to improve your answer. Look especially at the Provide context for links section. You can summarize the main points of the articles you shared in case the links are removed.
            – BrunoGL
            42 mins ago











          • Thanks Bruno, I'll keep that in mind.
            – thanatoz
            36 mins ago















          this answer seems promising, but have a look at this post in order to improve your answer. Look especially at the Provide context for links section. You can summarize the main points of the articles you shared in case the links are removed.
          – BrunoGL
          42 mins ago





          this answer seems promising, but have a look at this post in order to improve your answer. Look especially at the Provide context for links section. You can summarize the main points of the articles you shared in case the links are removed.
          – BrunoGL
          42 mins ago













          Thanks Bruno, I'll keep that in mind.
          – thanatoz
          36 mins ago




          Thanks Bruno, I'll keep that in mind.
          – thanatoz
          36 mins ago

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38796%2funbalanced-training-data-for-different-classes%23new-answer', 'question_page');

          );

          Post as a guest













































































          Comments

          Popular posts from this blog

          What does second last employer means? [closed]

          List of Gilmore Girls characters

          Confectionery