How to model categorical variables / enums?

up vote
3
down vote

favorite

I am new to the field and I am trying to understand how is possible to use categorical variables / enums?

Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.
How can we "normalize" these features to be able to use them to create a deep network model (e.g. with tensorflow)?

Any reference to read about techniques of modeling that are also very appreciated.

asked Sep 1 at 11:32

Avraam Mavridis

1162

add a commentÂ |Â

up vote
3
down vote

favorite

I am new to the field and I am trying to understand how is possible to use categorical variables / enums?

Any reference to read about techniques of modeling that are also very appreciated.

asked Sep 1 at 11:32

Avraam Mavridis

1162

add a commentÂ |Â

up vote
3
down vote

favorite

I am new to the field and I am trying to understand how is possible to use categorical variables / enums?

Any reference to read about techniques of modeling that are also very appreciated.

asked Sep 1 at 11:32

Avraam Mavridis

1162

I am new to the field and I am trying to understand how is possible to use categorical variables / enums?

Any reference to read about techniques of modeling that are also very appreciated.

asked Sep 1 at 11:32

Avraam Mavridis

1162

asked Sep 1 at 11:32

Avraam Mavridis

1162

asked Sep 1 at 11:32

Avraam Mavridis

1162

asked Sep 1 at 11:32

Avraam Mavridis

1162

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
2
down vote

Authors use many different approaches.

One approach is to have a different input neuron for each possible category, and then use a "1-hot" encoding. So if you have 10 categories, then you can encode this as 10 binary features.

Another is to use some sort of binary encoding. If you have 10 categories, it is sufficient to use 4 neurons to represent all possible categories by using binary numbers.

A third approach is to convert your categories to cardinal values, and then normalize them. This may be more effective if your categories really are cardinal (i.e. orderable). If there isn't a natural ordering to them though, this might lead to strange results or make the problem difficult to learn (since it ends up embedding non-linear relationships in the learning problem that don't need to exist).

answered Sep 1 at 12:14

John Doucette

2,00020

I don't think binary encoding is effective.
â€“Â DuttaA
Sep 1 at 12:33

1

@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
â€“Â John Doucette
Sep 1 at 12:34

add a commentÂ |Â

up vote
0
down vote

A one-hot encoding, as described in John's answer, is probably the most straightforward / simple solution (maybe even the most common?). It is not without its problems though. For example, if you have a large number of such categorical variables, and each has a large number of possible values, the number of binary inputs you need for one-hot encodings may grow too large.

Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.

In this specific example, a different possible solution might be not to use the "identity" of a team as a feature itself, but try to find a number of (ideally numeric) features corresponding to that team.

For example, instead of trying to encode "home_team" in some way in your inputs, you could (if you manage to find the data you need to do this) use the following features (not really familiar with NBA, so not sure if all these make sense):

Win percentage of home_team in recent X amount of time

Historical win percentage of home_team against away_team

Average points scored per match by this team

In football there's something like how many minutes per game a team is "in control" of the ball, is there something similar in NBA maybe?

etc.

And then you can try to get a similar list of features for the away_team.

This kind of solution would work for your example, and maybe also for various other examples. It might not work in all cases of categorical features though, in some cases you'd have to revert to solutions like those in John's answer.

answered Sep 1 at 14:16

Dennis Soemers

1,7851324

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f7781%2fhow-to-model-categorical-variables-enums%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

Authors use many different approaches.

One approach is to have a different input neuron for each possible category, and then use a "1-hot" encoding. So if you have 10 categories, then you can encode this as 10 binary features.

Another is to use some sort of binary encoding. If you have 10 categories, it is sufficient to use 4 neurons to represent all possible categories by using binary numbers.

answered Sep 1 at 12:14

John Doucette

2,00020

I don't think binary encoding is effective.
â€“Â DuttaA
Sep 1 at 12:33

1

@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
â€“Â John Doucette
Sep 1 at 12:34

add a commentÂ |Â

up vote
2
down vote

Authors use many different approaches.

One approach is to have a different input neuron for each possible category, and then use a "1-hot" encoding. So if you have 10 categories, then you can encode this as 10 binary features.

Another is to use some sort of binary encoding. If you have 10 categories, it is sufficient to use 4 neurons to represent all possible categories by using binary numbers.

answered Sep 1 at 12:14

John Doucette

2,00020

I don't think binary encoding is effective.
â€“Â DuttaA
Sep 1 at 12:33

1

@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
â€“Â John Doucette
Sep 1 at 12:34

add a commentÂ |Â

up vote
2
down vote

Authors use many different approaches.

One approach is to have a different input neuron for each possible category, and then use a "1-hot" encoding. So if you have 10 categories, then you can encode this as 10 binary features.

Another is to use some sort of binary encoding. If you have 10 categories, it is sufficient to use 4 neurons to represent all possible categories by using binary numbers.

answered Sep 1 at 12:14

John Doucette

2,00020

Authors use many different approaches.

One approach is to have a different input neuron for each possible category, and then use a "1-hot" encoding. So if you have 10 categories, then you can encode this as 10 binary features.

Another is to use some sort of binary encoding. If you have 10 categories, it is sufficient to use 4 neurons to represent all possible categories by using binary numbers.

answered Sep 1 at 12:14

John Doucette

2,00020

answered Sep 1 at 12:14

John Doucette

2,00020

answered Sep 1 at 12:14

John Doucette

2,00020

answered Sep 1 at 12:14

John Doucette

2,00020

I don't think binary encoding is effective.
â€“Â DuttaA
Sep 1 at 12:33

1

@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
â€“Â John Doucette
Sep 1 at 12:34

add a commentÂ |Â

I don't think binary encoding is effective.
â€“Â DuttaA
Sep 1 at 12:33

1

@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
â€“Â John Doucette
Sep 1 at 12:34

I don't think binary encoding is effective.
â€“Â DuttaA
Sep 1 at 12:33

@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
â€“Â John Doucette
Sep 1 at 12:34

add a commentÂ |Â

up vote
0
down vote

Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.

Win percentage of home_team in recent X amount of time

Historical win percentage of home_team against away_team

Average points scored per match by this team

In football there's something like how many minutes per game a team is "in control" of the ball, is there something similar in NBA maybe?

etc.

And then you can try to get a similar list of features for the away_team.

answered Sep 1 at 14:16

Dennis Soemers

1,7851324

add a commentÂ |Â

up vote
0
down vote

Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.

Win percentage of home_team in recent X amount of time

Historical win percentage of home_team against away_team

Average points scored per match by this team

In football there's something like how many minutes per game a team is "in control" of the ball, is there something similar in NBA maybe?

etc.

And then you can try to get a similar list of features for the away_team.

answered Sep 1 at 14:16

Dennis Soemers

1,7851324

add a commentÂ |Â

up vote
0
down vote

Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.

Win percentage of home_team in recent X amount of time

Historical win percentage of home_team against away_team

Average points scored per match by this team

In football there's something like how many minutes per game a team is "in control" of the ball, is there something similar in NBA maybe?

etc.

And then you can try to get a similar list of features for the away_team.

answered Sep 1 at 14:16

Dennis Soemers

1,7851324

Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.

Win percentage of home_team in recent X amount of time

Historical win percentage of home_team against away_team

Average points scored per match by this team

In football there's something like how many minutes per game a team is "in control" of the ball, is there something similar in NBA maybe?

etc.

And then you can try to get a similar list of features for the away_team.

answered Sep 1 at 14:16

Dennis Soemers

1,7851324

answered Sep 1 at 14:16

Dennis Soemers

1,7851324

answered Sep 1 at 14:16

Dennis Soemers

1,7851324

answered Sep 1 at 14:16

Dennis Soemers

1,7851324

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

Search This Blog

Iyfjky