How does ML algorithms treat unseen data, a conceptual discussion
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.
In order to teach the ML algo something I have learned to single out the 5% and drop the rest of the data. Let us say that I want to predict if a picture is of a dog or a cat. In my data 2.5% of the pictures are of dogs and 2.5% of cats, the rest are just random pictures. So, I single out the cat and dog pictures and label them so that the ML algo can learn from that. Am I broadly right so far?
So, if I train my algo on only cat and dog pictures and get a satisfactory accuracy, what will then happen in live usage when 95% of the pictures are not of cats or dogs? I.e. I show my model a picture of a house, what does it predict? Will my algo always predict either cat or dog, or will it somehow tell me that it has no clue what this picture is?
Any thoughts?
machine-learning
add a comment |Â
up vote
0
down vote
favorite
I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.
In order to teach the ML algo something I have learned to single out the 5% and drop the rest of the data. Let us say that I want to predict if a picture is of a dog or a cat. In my data 2.5% of the pictures are of dogs and 2.5% of cats, the rest are just random pictures. So, I single out the cat and dog pictures and label them so that the ML algo can learn from that. Am I broadly right so far?
So, if I train my algo on only cat and dog pictures and get a satisfactory accuracy, what will then happen in live usage when 95% of the pictures are not of cats or dogs? I.e. I show my model a picture of a house, what does it predict? Will my algo always predict either cat or dog, or will it somehow tell me that it has no clue what this picture is?
Any thoughts?
machine-learning
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.
In order to teach the ML algo something I have learned to single out the 5% and drop the rest of the data. Let us say that I want to predict if a picture is of a dog or a cat. In my data 2.5% of the pictures are of dogs and 2.5% of cats, the rest are just random pictures. So, I single out the cat and dog pictures and label them so that the ML algo can learn from that. Am I broadly right so far?
So, if I train my algo on only cat and dog pictures and get a satisfactory accuracy, what will then happen in live usage when 95% of the pictures are not of cats or dogs? I.e. I show my model a picture of a house, what does it predict? Will my algo always predict either cat or dog, or will it somehow tell me that it has no clue what this picture is?
Any thoughts?
machine-learning
I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.
In order to teach the ML algo something I have learned to single out the 5% and drop the rest of the data. Let us say that I want to predict if a picture is of a dog or a cat. In my data 2.5% of the pictures are of dogs and 2.5% of cats, the rest are just random pictures. So, I single out the cat and dog pictures and label them so that the ML algo can learn from that. Am I broadly right so far?
So, if I train my algo on only cat and dog pictures and get a satisfactory accuracy, what will then happen in live usage when 95% of the pictures are not of cats or dogs? I.e. I show my model a picture of a house, what does it predict? Will my algo always predict either cat or dog, or will it somehow tell me that it has no clue what this picture is?
Any thoughts?
machine-learning
edited Aug 10 at 20:17
Community♦
1
1
asked Aug 10 at 14:08
cJc
166
166
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.
If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.
Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.
tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
– cJc
Aug 10 at 15:43
@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
– Tanner Swett
Aug 10 at 16:30
@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
– cJc
Aug 10 at 20:19
add a comment |Â
up vote
2
down vote
From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
$(y, b_x, b_y, b_w, b_h, c_0, c_1),$
where:
- $y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);
- $b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;
- $b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;
- $c_0$ is $1$ if the picture contains a dog and $0$ otherwise;
- $c_1$ is $1$ if the picture contains a cat and $0$ otherwise.
So, for example, a picture of a dog would get tagged as
$(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.
If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.
Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.
tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
– cJc
Aug 10 at 15:43
@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
– Tanner Swett
Aug 10 at 16:30
@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
– cJc
Aug 10 at 20:19
add a comment |Â
up vote
2
down vote
accepted
Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.
If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.
Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.
tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
– cJc
Aug 10 at 15:43
@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
– Tanner Swett
Aug 10 at 16:30
@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
– cJc
Aug 10 at 20:19
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.
If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.
Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.
Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.
If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.
Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.
edited Aug 10 at 15:26
answered Aug 10 at 14:31
Solver
415
415
tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
– cJc
Aug 10 at 15:43
@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
– Tanner Swett
Aug 10 at 16:30
@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
– cJc
Aug 10 at 20:19
add a comment |Â
tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
– cJc
Aug 10 at 15:43
@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
– Tanner Swett
Aug 10 at 16:30
@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
– cJc
Aug 10 at 20:19
tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
– cJc
Aug 10 at 15:43
tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
– cJc
Aug 10 at 15:43
@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
– Tanner Swett
Aug 10 at 16:30
@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
– Tanner Swett
Aug 10 at 16:30
@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
– cJc
Aug 10 at 20:19
@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
– cJc
Aug 10 at 20:19
add a comment |Â
up vote
2
down vote
From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
$(y, b_x, b_y, b_w, b_h, c_0, c_1),$
where:
- $y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);
- $b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;
- $b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;
- $c_0$ is $1$ if the picture contains a dog and $0$ otherwise;
- $c_1$ is $1$ if the picture contains a cat and $0$ otherwise.
So, for example, a picture of a dog would get tagged as
$(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.
add a comment |Â
up vote
2
down vote
From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
$(y, b_x, b_y, b_w, b_h, c_0, c_1),$
where:
- $y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);
- $b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;
- $b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;
- $c_0$ is $1$ if the picture contains a dog and $0$ otherwise;
- $c_1$ is $1$ if the picture contains a cat and $0$ otherwise.
So, for example, a picture of a dog would get tagged as
$(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.
add a comment |Â
up vote
2
down vote
up vote
2
down vote
From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
$(y, b_x, b_y, b_w, b_h, c_0, c_1),$
where:
- $y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);
- $b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;
- $b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;
- $c_0$ is $1$ if the picture contains a dog and $0$ otherwise;
- $c_1$ is $1$ if the picture contains a cat and $0$ otherwise.
So, for example, a picture of a dog would get tagged as
$(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.
From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
$(y, b_x, b_y, b_w, b_h, c_0, c_1),$
where:
- $y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);
- $b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;
- $b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;
- $c_0$ is $1$ if the picture contains a dog and $0$ otherwise;
- $c_1$ is $1$ if the picture contains a cat and $0$ otherwise.
So, for example, a picture of a dog would get tagged as
$(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.
answered Aug 10 at 14:32
marco_gorelli
2667
2667
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36743%2fhow-does-ml-algorithms-treat-unseen-data-a-conceptual-discussion%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password