How does ML algorithms treat unseen data, a conceptual discussion

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite
1












I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.



In order to teach the ML algo something I have learned to single out the 5% and drop the rest of the data. Let us say that I want to predict if a picture is of a dog or a cat. In my data 2.5% of the pictures are of dogs and 2.5% of cats, the rest are just random pictures. So, I single out the cat and dog pictures and label them so that the ML algo can learn from that. Am I broadly right so far?



So, if I train my algo on only cat and dog pictures and get a satisfactory accuracy, what will then happen in live usage when 95% of the pictures are not of cats or dogs? I.e. I show my model a picture of a house, what does it predict? Will my algo always predict either cat or dog, or will it somehow tell me that it has no clue what this picture is?



Any thoughts?







share|improve this question


























    up vote
    0
    down vote

    favorite
    1












    I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.



    In order to teach the ML algo something I have learned to single out the 5% and drop the rest of the data. Let us say that I want to predict if a picture is of a dog or a cat. In my data 2.5% of the pictures are of dogs and 2.5% of cats, the rest are just random pictures. So, I single out the cat and dog pictures and label them so that the ML algo can learn from that. Am I broadly right so far?



    So, if I train my algo on only cat and dog pictures and get a satisfactory accuracy, what will then happen in live usage when 95% of the pictures are not of cats or dogs? I.e. I show my model a picture of a house, what does it predict? Will my algo always predict either cat or dog, or will it somehow tell me that it has no clue what this picture is?



    Any thoughts?







    share|improve this question
























      up vote
      0
      down vote

      favorite
      1









      up vote
      0
      down vote

      favorite
      1






      1





      I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.



      In order to teach the ML algo something I have learned to single out the 5% and drop the rest of the data. Let us say that I want to predict if a picture is of a dog or a cat. In my data 2.5% of the pictures are of dogs and 2.5% of cats, the rest are just random pictures. So, I single out the cat and dog pictures and label them so that the ML algo can learn from that. Am I broadly right so far?



      So, if I train my algo on only cat and dog pictures and get a satisfactory accuracy, what will then happen in live usage when 95% of the pictures are not of cats or dogs? I.e. I show my model a picture of a house, what does it predict? Will my algo always predict either cat or dog, or will it somehow tell me that it has no clue what this picture is?



      Any thoughts?







      share|improve this question














      I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.



      In order to teach the ML algo something I have learned to single out the 5% and drop the rest of the data. Let us say that I want to predict if a picture is of a dog or a cat. In my data 2.5% of the pictures are of dogs and 2.5% of cats, the rest are just random pictures. So, I single out the cat and dog pictures and label them so that the ML algo can learn from that. Am I broadly right so far?



      So, if I train my algo on only cat and dog pictures and get a satisfactory accuracy, what will then happen in live usage when 95% of the pictures are not of cats or dogs? I.e. I show my model a picture of a house, what does it predict? Will my algo always predict either cat or dog, or will it somehow tell me that it has no clue what this picture is?



      Any thoughts?









      share|improve this question













      share|improve this question




      share|improve this question








      edited Aug 10 at 20:17









      Community♦

      1




      1










      asked Aug 10 at 14:08









      cJc

      166




      166




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.



          If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.



          Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.






          share|improve this answer






















          • tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
            – cJc
            Aug 10 at 15:43










          • @cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
            – Tanner Swett
            Aug 10 at 16:30










          • @TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
            – cJc
            Aug 10 at 20:19


















          up vote
          2
          down vote













          From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
          $(y, b_x, b_y, b_w, b_h, c_0, c_1),$
          where:



          • $y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);

          • $b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;

          • $b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;

          • $c_0$ is $1$ if the picture contains a dog and $0$ otherwise;

          • $c_1$ is $1$ if the picture contains a cat and $0$ otherwise.

          So, for example, a picture of a dog would get tagged as
          $(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.






          share|improve this answer




















            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            noCode: true, onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36743%2fhow-does-ml-algorithms-treat-unseen-data-a-conceptual-discussion%23new-answer', 'question_page');

            );

            Post as a guest






























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            2
            down vote



            accepted










            Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.



            If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.



            Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.






            share|improve this answer






















            • tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
              – cJc
              Aug 10 at 15:43










            • @cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
              – Tanner Swett
              Aug 10 at 16:30










            • @TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
              – cJc
              Aug 10 at 20:19















            up vote
            2
            down vote



            accepted










            Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.



            If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.



            Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.






            share|improve this answer






















            • tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
              – cJc
              Aug 10 at 15:43










            • @cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
              – Tanner Swett
              Aug 10 at 16:30










            • @TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
              – cJc
              Aug 10 at 20:19













            up vote
            2
            down vote



            accepted







            up vote
            2
            down vote



            accepted






            Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.



            If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.



            Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.






            share|improve this answer














            Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.



            If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.



            Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Aug 10 at 15:26

























            answered Aug 10 at 14:31









            Solver

            415




            415











            • tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
              – cJc
              Aug 10 at 15:43










            • @cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
              – Tanner Swett
              Aug 10 at 16:30










            • @TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
              – cJc
              Aug 10 at 20:19

















            • tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
              – cJc
              Aug 10 at 15:43










            • @cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
              – Tanner Swett
              Aug 10 at 16:30










            • @TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
              – cJc
              Aug 10 at 20:19
















            tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
            – cJc
            Aug 10 at 15:43




            tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
            – cJc
            Aug 10 at 15:43












            @cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
            – Tanner Swett
            Aug 10 at 16:30




            @cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
            – Tanner Swett
            Aug 10 at 16:30












            @TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
            – cJc
            Aug 10 at 20:19





            @TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
            – cJc
            Aug 10 at 20:19











            up vote
            2
            down vote













            From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
            $(y, b_x, b_y, b_w, b_h, c_0, c_1),$
            where:



            • $y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);

            • $b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;

            • $b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;

            • $c_0$ is $1$ if the picture contains a dog and $0$ otherwise;

            • $c_1$ is $1$ if the picture contains a cat and $0$ otherwise.

            So, for example, a picture of a dog would get tagged as
            $(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.






            share|improve this answer
























              up vote
              2
              down vote













              From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
              $(y, b_x, b_y, b_w, b_h, c_0, c_1),$
              where:



              • $y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);

              • $b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;

              • $b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;

              • $c_0$ is $1$ if the picture contains a dog and $0$ otherwise;

              • $c_1$ is $1$ if the picture contains a cat and $0$ otherwise.

              So, for example, a picture of a dog would get tagged as
              $(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.






              share|improve this answer






















                up vote
                2
                down vote










                up vote
                2
                down vote









                From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
                $(y, b_x, b_y, b_w, b_h, c_0, c_1),$
                where:



                • $y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);

                • $b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;

                • $b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;

                • $c_0$ is $1$ if the picture contains a dog and $0$ otherwise;

                • $c_1$ is $1$ if the picture contains a cat and $0$ otherwise.

                So, for example, a picture of a dog would get tagged as
                $(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.






                share|improve this answer












                From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
                $(y, b_x, b_y, b_w, b_h, c_0, c_1),$
                where:



                • $y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);

                • $b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;

                • $b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;

                • $c_0$ is $1$ if the picture contains a dog and $0$ otherwise;

                • $c_1$ is $1$ if the picture contains a cat and $0$ otherwise.

                So, for example, a picture of a dog would get tagged as
                $(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Aug 10 at 14:32









                marco_gorelli

                2667




                2667



























                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36743%2fhow-does-ml-algorithms-treat-unseen-data-a-conceptual-discussion%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Comments

                    Popular posts from this blog

                    What does second last employer means? [closed]

                    List of Gilmore Girls characters

                    One-line joke