How does ML algorithms treat unseen data, a conceptual discussion

up vote
0
down vote

favorite

I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.

In order to teach the ML algo something I have learned to single out the 5% and drop the rest of the data. Let us say that I want to predict if a picture is of a dog or a cat. In my data 2.5% of the pictures are of dogs and 2.5% of cats, the rest are just random pictures. So, I single out the cat and dog pictures and label them so that the ML algo can learn from that. Am I broadly right so far?

So, if I train my algo on only cat and dog pictures and get a satisfactory accuracy, what will then happen in live usage when 95% of the pictures are not of cats or dogs? I.e. I show my model a picture of a house, what does it predict? Will my algo always predict either cat or dog, or will it somehow tell me that it has no clue what this picture is?

Any thoughts?

edited Aug 10 at 20:17

Communityâ™¦

asked Aug 10 at 14:08

cJc

166

add a commentÂ |Â

up vote
0
down vote

favorite

I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.

Any thoughts?

edited Aug 10 at 20:17

Communityâ™¦

asked Aug 10 at 14:08

cJc

166

add a commentÂ |Â

up vote
0
down vote

favorite

I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.

Any thoughts?

edited Aug 10 at 20:17

Communityâ™¦

asked Aug 10 at 14:08

cJc

166

I want to predict the occurrence of certain events, but these events only occur say 5% of the time in my data, hence in 95% of the data there is nothing to learn.

Any thoughts?

edited Aug 10 at 20:17

Communityâ™¦

asked Aug 10 at 14:08

cJc

166

edited Aug 10 at 20:17

Communityâ™¦

edited Aug 10 at 20:17

Communityâ™¦

edited Aug 10 at 20:17

Communityâ™¦

asked Aug 10 at 14:08

cJc

166

asked Aug 10 at 14:08

cJc

166

asked Aug 10 at 14:08

cJc

166

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

Define two flag variables: flag_is_cat and flag_is_dog, which take on values of 1 if the picture shows a cat or dog, respectively, and 0 otherwise. Define another flag that takes on the value of 1 if the picture contains either a cat or a dog. In a word, label the data.

If you train the model using all of the pictures, even those with neither a cat nor a dog, then the model outputs a probability that the picture contains a cat, a probability that it contains a dog, and another that it contains either. This is the approach mentioned by @marco_gorelli . Dividing the probability that the picture has a cat by the probability that it has either a cat or a dog gives the probability that the picture has a cat conditional on the picture having at least one of them.

Alternatively, if you train a model using only those pictures that contain either a cat or a dog, then the model would output the probability that a cat is contained in the picture and that a dog is contained in the picture conditional on at least one of them being in the picture.

edited Aug 10 at 15:26

answered Aug 10 at 14:31

Solver

415

tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
â€“Â cJc
Aug 10 at 15:43

@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
â€“Â Tanner Swett
Aug 10 at 16:30

@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
â€“Â cJc
Aug 10 at 20:19

add a commentÂ |Â

up vote
2
down vote

From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
$(y, b_x, b_y, b_w, b_h, c_0, c_1),$
where:

$y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);

$b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;

$b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;

$c_0$ is $1$ if the picture contains a dog and $0$ otherwise;

$c_1$ is $1$ if the picture contains a cat and $0$ otherwise.

So, for example, a picture of a dog would get tagged as
$(1, .3, .7, .2, .2, 1, 0)$, a picture of a cat of the same size in the same position would get tagged as $(1, .3, .7, .2, .2, 0, 1)$, and a picture with neither would have $0$ as its first coordinate and it wouldn't matter what the other coordinates were, as the initial $0$ has already signalled that the picture doesn't contain either of the objects we're seeking.

answered Aug 10 at 14:32

marco_gorelli

2667

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f36743%2fhow-does-ml-algorithms-treat-unseen-data-a-conceptual-discussion%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

edited Aug 10 at 15:26

answered Aug 10 at 14:31

Solver

415

tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
â€“Â cJc
Aug 10 at 15:43

@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
â€“Â Tanner Swett
Aug 10 at 16:30

@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
â€“Â cJc
Aug 10 at 20:19

add a commentÂ |Â

up vote
2
down vote

accepted

edited Aug 10 at 15:26

answered Aug 10 at 14:31

Solver

415

tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
â€“Â cJc
Aug 10 at 15:43

@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
â€“Â Tanner Swett
Aug 10 at 16:30

@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
â€“Â cJc
Aug 10 at 20:19

add a commentÂ |Â

up vote
2
down vote

accepted

edited Aug 10 at 15:26

answered Aug 10 at 14:31

Solver

415

edited Aug 10 at 15:26

answered Aug 10 at 14:31

Solver

415

edited Aug 10 at 15:26

answered Aug 10 at 14:31

Solver

415

answered Aug 10 at 14:31

Solver

415

answered Aug 10 at 14:31

Solver

415

tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
â€“Â cJc
Aug 10 at 15:43

@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
â€“Â Tanner Swett
Aug 10 at 16:30

@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
â€“Â cJc
Aug 10 at 20:19

add a commentÂ |Â

tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
â€“Â cJc
Aug 10 at 15:43

@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
â€“Â Tanner Swett
Aug 10 at 16:30

@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
â€“Â cJc
Aug 10 at 20:19

tnx for your comments. From my (limited) experience using NNs, a model using all pictures would converge to always predict 0 (neither cat nor dog) because then the model would be right 95% of the time, even when balancing the class weights. I will try to experiment with what you say in your last paragraph. chrs.
â€“Â cJc
Aug 10 at 15:43

@cJc Have you considered using all three types of pictures, but showing the cat and dog pictures more often than the no-animal pictures, so that the model ends up seeing each of the three kinds of pictures equally often?
â€“Â Tanner Swett
Aug 10 at 16:30

@TannerSwett It makes logical sense what you are saying. I was hoping there was a universally accepted best practise to my question, but I guess this is more art than science. I will look into it, chrs.
â€“Â cJc
Aug 10 at 20:19

add a commentÂ |Â

up vote
2
down vote

From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
$(y, b_x, b_y, b_w, b_h, c_0, c_1),$
where:

$y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);

$b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;

$b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;

$c_0$ is $1$ if the picture contains a dog and $0$ otherwise;

$c_1$ is $1$ if the picture contains a cat and $0$ otherwise.

answered Aug 10 at 14:32

marco_gorelli

2667

add a commentÂ |Â

up vote
2
down vote

From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
$(y, b_x, b_y, b_w, b_h, c_0, c_1),$
where:

$y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);

$b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;

$b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;

$c_0$ is $1$ if the picture contains a dog and $0$ otherwise;

$c_1$ is $1$ if the picture contains a cat and $0$ otherwise.

answered Aug 10 at 14:32

marco_gorelli

2667

add a commentÂ |Â

up vote
2
down vote

From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
$(y, b_x, b_y, b_w, b_h, c_0, c_1),$
where:

$y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);

$b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;

$b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;

$c_0$ is $1$ if the picture contains a dog and $0$ otherwise;

$c_1$ is $1$ if the picture contains a cat and $0$ otherwise.

answered Aug 10 at 14:32

marco_gorelli

2667

From what I can remember from Andrew Ng's Deep Learning course on Coursera, he recommends making vectors of the kind
$(y, b_x, b_y, b_w, b_h, c_0, c_1),$
where:

$y$ indicates whether the picture contains one of the objects you're looking at (so it'd be $1$ in $5%$ of your examples and $0$ in the others);

$b_x$, $b_y$ indicate the $x$ and $y$ coordinates of where the picture's midpoint is found;

$b_w$, $b_h$ indicate the width and height of the bounding boxes of your image;

$c_0$ is $1$ if the picture contains a dog and $0$ otherwise;

$c_1$ is $1$ if the picture contains a cat and $0$ otherwise.

answered Aug 10 at 14:32

marco_gorelli

2667

answered Aug 10 at 14:32

marco_gorelli

2667

answered Aug 10 at 14:32

marco_gorelli

2667

answered Aug 10 at 14:32

marco_gorelli

2667

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky