How to model categorical variables / enums?
Clash Royale CLAN TAG#URR8PPP
up vote
3
down vote
favorite
I am new to the field and I am trying to understand how is possible to use categorical variables / enums?
Lets say we have a data set and 2 of its features are home_team
and away_team
, the possible values of these 2 features are all the NBA teams.
How can we "normalize" these features to be able to use them to create a deep network model (e.g. with tensorflow)?
Any reference to read about techniques of modeling that are also very appreciated.
deep-learning categorical-data
add a comment |Â
up vote
3
down vote
favorite
I am new to the field and I am trying to understand how is possible to use categorical variables / enums?
Lets say we have a data set and 2 of its features are home_team
and away_team
, the possible values of these 2 features are all the NBA teams.
How can we "normalize" these features to be able to use them to create a deep network model (e.g. with tensorflow)?
Any reference to read about techniques of modeling that are also very appreciated.
deep-learning categorical-data
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I am new to the field and I am trying to understand how is possible to use categorical variables / enums?
Lets say we have a data set and 2 of its features are home_team
and away_team
, the possible values of these 2 features are all the NBA teams.
How can we "normalize" these features to be able to use them to create a deep network model (e.g. with tensorflow)?
Any reference to read about techniques of modeling that are also very appreciated.
deep-learning categorical-data
I am new to the field and I am trying to understand how is possible to use categorical variables / enums?
Lets say we have a data set and 2 of its features are home_team
and away_team
, the possible values of these 2 features are all the NBA teams.
How can we "normalize" these features to be able to use them to create a deep network model (e.g. with tensorflow)?
Any reference to read about techniques of modeling that are also very appreciated.
deep-learning categorical-data
asked Sep 1 at 11:32


Avraam Mavridis
1162
1162
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
Authors use many different approaches.
One approach is to have a different input neuron for each possible category, and then use a "1-hot" encoding. So if you have 10 categories, then you can encode this as 10 binary features.
Another is to use some sort of binary encoding. If you have 10 categories, it is sufficient to use 4 neurons to represent all possible categories by using binary numbers.
A third approach is to convert your categories to cardinal values, and then normalize them. This may be more effective if your categories really are cardinal (i.e. orderable). If there isn't a natural ordering to them though, this might lead to strange results or make the problem difficult to learn (since it ends up embedding non-linear relationships in the learning problem that don't need to exist).
I don't think binary encoding is effective.
– DuttaA
Sep 1 at 12:33
1
@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
– John Doucette
Sep 1 at 12:34
add a comment |Â
up vote
0
down vote
A one-hot encoding, as described in John's answer, is probably the most straightforward / simple solution (maybe even the most common?). It is not without its problems though. For example, if you have a large number of such categorical variables, and each has a large number of possible values, the number of binary inputs you need for one-hot encodings may grow too large.
Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.
In this specific example, a different possible solution might be not to use the "identity" of a team as a feature itself, but try to find a number of (ideally numeric) features corresponding to that team.
For example, instead of trying to encode "home_team" in some way in your inputs, you could (if you manage to find the data you need to do this) use the following features (not really familiar with NBA, so not sure if all these make sense):
- Win percentage of home_team in recent X amount of time
- Historical win percentage of home_team against away_team
- Average points scored per match by this team
- In football there's something like how many minutes per game a team is "in control" of the ball, is there something similar in NBA maybe?
- etc.
And then you can try to get a similar list of features for the away_team.
This kind of solution would work for your example, and maybe also for various other examples. It might not work in all cases of categorical features though, in some cases you'd have to revert to solutions like those in John's answer.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
Authors use many different approaches.
One approach is to have a different input neuron for each possible category, and then use a "1-hot" encoding. So if you have 10 categories, then you can encode this as 10 binary features.
Another is to use some sort of binary encoding. If you have 10 categories, it is sufficient to use 4 neurons to represent all possible categories by using binary numbers.
A third approach is to convert your categories to cardinal values, and then normalize them. This may be more effective if your categories really are cardinal (i.e. orderable). If there isn't a natural ordering to them though, this might lead to strange results or make the problem difficult to learn (since it ends up embedding non-linear relationships in the learning problem that don't need to exist).
I don't think binary encoding is effective.
– DuttaA
Sep 1 at 12:33
1
@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
– John Doucette
Sep 1 at 12:34
add a comment |Â
up vote
2
down vote
Authors use many different approaches.
One approach is to have a different input neuron for each possible category, and then use a "1-hot" encoding. So if you have 10 categories, then you can encode this as 10 binary features.
Another is to use some sort of binary encoding. If you have 10 categories, it is sufficient to use 4 neurons to represent all possible categories by using binary numbers.
A third approach is to convert your categories to cardinal values, and then normalize them. This may be more effective if your categories really are cardinal (i.e. orderable). If there isn't a natural ordering to them though, this might lead to strange results or make the problem difficult to learn (since it ends up embedding non-linear relationships in the learning problem that don't need to exist).
I don't think binary encoding is effective.
– DuttaA
Sep 1 at 12:33
1
@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
– John Doucette
Sep 1 at 12:34
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Authors use many different approaches.
One approach is to have a different input neuron for each possible category, and then use a "1-hot" encoding. So if you have 10 categories, then you can encode this as 10 binary features.
Another is to use some sort of binary encoding. If you have 10 categories, it is sufficient to use 4 neurons to represent all possible categories by using binary numbers.
A third approach is to convert your categories to cardinal values, and then normalize them. This may be more effective if your categories really are cardinal (i.e. orderable). If there isn't a natural ordering to them though, this might lead to strange results or make the problem difficult to learn (since it ends up embedding non-linear relationships in the learning problem that don't need to exist).
Authors use many different approaches.
One approach is to have a different input neuron for each possible category, and then use a "1-hot" encoding. So if you have 10 categories, then you can encode this as 10 binary features.
Another is to use some sort of binary encoding. If you have 10 categories, it is sufficient to use 4 neurons to represent all possible categories by using binary numbers.
A third approach is to convert your categories to cardinal values, and then normalize them. This may be more effective if your categories really are cardinal (i.e. orderable). If there isn't a natural ordering to them though, this might lead to strange results or make the problem difficult to learn (since it ends up embedding non-linear relationships in the learning problem that don't need to exist).
answered Sep 1 at 12:14


John Doucette
2,00020
2,00020
I don't think binary encoding is effective.
– DuttaA
Sep 1 at 12:33
1
@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
– John Doucette
Sep 1 at 12:34
add a comment |Â
I don't think binary encoding is effective.
– DuttaA
Sep 1 at 12:33
1
@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
– John Doucette
Sep 1 at 12:34
I don't think binary encoding is effective.
– DuttaA
Sep 1 at 12:33
I don't think binary encoding is effective.
– DuttaA
Sep 1 at 12:33
1
1
@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
– John Doucette
Sep 1 at 12:34
@DuttaA I agree. It has many of the same problems as a cardinal encoding (i.e. introducing non-linear patterns to learn). I have seen it used from time to time however.
– John Doucette
Sep 1 at 12:34
add a comment |Â
up vote
0
down vote
A one-hot encoding, as described in John's answer, is probably the most straightforward / simple solution (maybe even the most common?). It is not without its problems though. For example, if you have a large number of such categorical variables, and each has a large number of possible values, the number of binary inputs you need for one-hot encodings may grow too large.
Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.
In this specific example, a different possible solution might be not to use the "identity" of a team as a feature itself, but try to find a number of (ideally numeric) features corresponding to that team.
For example, instead of trying to encode "home_team" in some way in your inputs, you could (if you manage to find the data you need to do this) use the following features (not really familiar with NBA, so not sure if all these make sense):
- Win percentage of home_team in recent X amount of time
- Historical win percentage of home_team against away_team
- Average points scored per match by this team
- In football there's something like how many minutes per game a team is "in control" of the ball, is there something similar in NBA maybe?
- etc.
And then you can try to get a similar list of features for the away_team.
This kind of solution would work for your example, and maybe also for various other examples. It might not work in all cases of categorical features though, in some cases you'd have to revert to solutions like those in John's answer.
add a comment |Â
up vote
0
down vote
A one-hot encoding, as described in John's answer, is probably the most straightforward / simple solution (maybe even the most common?). It is not without its problems though. For example, if you have a large number of such categorical variables, and each has a large number of possible values, the number of binary inputs you need for one-hot encodings may grow too large.
Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.
In this specific example, a different possible solution might be not to use the "identity" of a team as a feature itself, but try to find a number of (ideally numeric) features corresponding to that team.
For example, instead of trying to encode "home_team" in some way in your inputs, you could (if you manage to find the data you need to do this) use the following features (not really familiar with NBA, so not sure if all these make sense):
- Win percentage of home_team in recent X amount of time
- Historical win percentage of home_team against away_team
- Average points scored per match by this team
- In football there's something like how many minutes per game a team is "in control" of the ball, is there something similar in NBA maybe?
- etc.
And then you can try to get a similar list of features for the away_team.
This kind of solution would work for your example, and maybe also for various other examples. It might not work in all cases of categorical features though, in some cases you'd have to revert to solutions like those in John's answer.
add a comment |Â
up vote
0
down vote
up vote
0
down vote
A one-hot encoding, as described in John's answer, is probably the most straightforward / simple solution (maybe even the most common?). It is not without its problems though. For example, if you have a large number of such categorical variables, and each has a large number of possible values, the number of binary inputs you need for one-hot encodings may grow too large.
Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.
In this specific example, a different possible solution might be not to use the "identity" of a team as a feature itself, but try to find a number of (ideally numeric) features corresponding to that team.
For example, instead of trying to encode "home_team" in some way in your inputs, you could (if you manage to find the data you need to do this) use the following features (not really familiar with NBA, so not sure if all these make sense):
- Win percentage of home_team in recent X amount of time
- Historical win percentage of home_team against away_team
- Average points scored per match by this team
- In football there's something like how many minutes per game a team is "in control" of the ball, is there something similar in NBA maybe?
- etc.
And then you can try to get a similar list of features for the away_team.
This kind of solution would work for your example, and maybe also for various other examples. It might not work in all cases of categorical features though, in some cases you'd have to revert to solutions like those in John's answer.
A one-hot encoding, as described in John's answer, is probably the most straightforward / simple solution (maybe even the most common?). It is not without its problems though. For example, if you have a large number of such categorical variables, and each has a large number of possible values, the number of binary inputs you need for one-hot encodings may grow too large.
Lets say we have a data set and 2 of its features are home_team and away_team, the possible values of these 2 features are all the NBA teams.
In this specific example, a different possible solution might be not to use the "identity" of a team as a feature itself, but try to find a number of (ideally numeric) features corresponding to that team.
For example, instead of trying to encode "home_team" in some way in your inputs, you could (if you manage to find the data you need to do this) use the following features (not really familiar with NBA, so not sure if all these make sense):
- Win percentage of home_team in recent X amount of time
- Historical win percentage of home_team against away_team
- Average points scored per match by this team
- In football there's something like how many minutes per game a team is "in control" of the ball, is there something similar in NBA maybe?
- etc.
And then you can try to get a similar list of features for the away_team.
This kind of solution would work for your example, and maybe also for various other examples. It might not work in all cases of categorical features though, in some cases you'd have to revert to solutions like those in John's answer.
answered Sep 1 at 14:16
Dennis Soemers
1,7851324
1,7851324
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f7781%2fhow-to-model-categorical-variables-enums%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password