Regression when both the predictor and outcome variables are proportions
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I am using
$X$ The estimated pre-game win probability of a sporting team playing on its Home field (estimated according to a certain model)
to predict
$Y$ Actual proportion of points scored by the Home team in the game (i.e. number of points scored by Home team divided by all points scored in the game).
Graphed, the data look like this.
Data viewable here.
I did a simple linear regression and it produced the parameters $b_0 = 0.3554$ and $b_1 = 0.2930$. Thus even at the maximum possible value of $x$ it doesn't produce a prediction that the team will score more than 100% of the points.
However, some reading of other questions here indicates that linear regression is generally considered inappropriate for situations in which the outcome variable is a proportion.
The question is highly similar to this one, in which a the poster was seeking to predict a team's winning percentage. There it was suggested that the poster should convert the proportion of wins to the number of wins. However, in my question it would not be the same thing for me to use the number of points scored by the team.
How inappropriate it is for me to use linear regression here?
What analysis should I use, keeping in mind that unlike in the linked question I can't just use the raw number of points scored by the team (since I really am interested in the proportion of points they will score). gung's answer here seems to indicate beta regression if the predictor is a continuous proportion, and logistic regression if it's a count proportion. However, I'm not sure which of those two my predictor is.
Does it make any difference that my predictor is also proportion?
regression
add a comment |Â
up vote
2
down vote
favorite
I am using
$X$ The estimated pre-game win probability of a sporting team playing on its Home field (estimated according to a certain model)
to predict
$Y$ Actual proportion of points scored by the Home team in the game (i.e. number of points scored by Home team divided by all points scored in the game).
Graphed, the data look like this.
Data viewable here.
I did a simple linear regression and it produced the parameters $b_0 = 0.3554$ and $b_1 = 0.2930$. Thus even at the maximum possible value of $x$ it doesn't produce a prediction that the team will score more than 100% of the points.
However, some reading of other questions here indicates that linear regression is generally considered inappropriate for situations in which the outcome variable is a proportion.
The question is highly similar to this one, in which a the poster was seeking to predict a team's winning percentage. There it was suggested that the poster should convert the proportion of wins to the number of wins. However, in my question it would not be the same thing for me to use the number of points scored by the team.
How inappropriate it is for me to use linear regression here?
What analysis should I use, keeping in mind that unlike in the linked question I can't just use the raw number of points scored by the team (since I really am interested in the proportion of points they will score). gung's answer here seems to indicate beta regression if the predictor is a continuous proportion, and logistic regression if it's a count proportion. However, I'm not sure which of those two my predictor is.
Does it make any difference that my predictor is also proportion?
regression
2
1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
– user2974951
4 hours ago
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am using
$X$ The estimated pre-game win probability of a sporting team playing on its Home field (estimated according to a certain model)
to predict
$Y$ Actual proportion of points scored by the Home team in the game (i.e. number of points scored by Home team divided by all points scored in the game).
Graphed, the data look like this.
Data viewable here.
I did a simple linear regression and it produced the parameters $b_0 = 0.3554$ and $b_1 = 0.2930$. Thus even at the maximum possible value of $x$ it doesn't produce a prediction that the team will score more than 100% of the points.
However, some reading of other questions here indicates that linear regression is generally considered inappropriate for situations in which the outcome variable is a proportion.
The question is highly similar to this one, in which a the poster was seeking to predict a team's winning percentage. There it was suggested that the poster should convert the proportion of wins to the number of wins. However, in my question it would not be the same thing for me to use the number of points scored by the team.
How inappropriate it is for me to use linear regression here?
What analysis should I use, keeping in mind that unlike in the linked question I can't just use the raw number of points scored by the team (since I really am interested in the proportion of points they will score). gung's answer here seems to indicate beta regression if the predictor is a continuous proportion, and logistic regression if it's a count proportion. However, I'm not sure which of those two my predictor is.
Does it make any difference that my predictor is also proportion?
regression
I am using
$X$ The estimated pre-game win probability of a sporting team playing on its Home field (estimated according to a certain model)
to predict
$Y$ Actual proportion of points scored by the Home team in the game (i.e. number of points scored by Home team divided by all points scored in the game).
Graphed, the data look like this.
Data viewable here.
I did a simple linear regression and it produced the parameters $b_0 = 0.3554$ and $b_1 = 0.2930$. Thus even at the maximum possible value of $x$ it doesn't produce a prediction that the team will score more than 100% of the points.
However, some reading of other questions here indicates that linear regression is generally considered inappropriate for situations in which the outcome variable is a proportion.
The question is highly similar to this one, in which a the poster was seeking to predict a team's winning percentage. There it was suggested that the poster should convert the proportion of wins to the number of wins. However, in my question it would not be the same thing for me to use the number of points scored by the team.
How inappropriate it is for me to use linear regression here?
What analysis should I use, keeping in mind that unlike in the linked question I can't just use the raw number of points scored by the team (since I really am interested in the proportion of points they will score). gung's answer here seems to indicate beta regression if the predictor is a continuous proportion, and logistic regression if it's a count proportion. However, I'm not sure which of those two my predictor is.
Does it make any difference that my predictor is also proportion?
regression
regression
edited 3 hours ago
asked 4 hours ago
user1205901
3,289144595
3,289144595
2
1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
– user2974951
4 hours ago
add a comment |Â
2
1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
– user2974951
4 hours ago
2
2
1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
– user2974951
4 hours ago
1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
– user2974951
4 hours ago
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
2
down vote
A glm with a binomial distribution and a logit link should work fine. If there are no probabilities of $0$ or $1$, beta regression is another possibility. In this case, they yield almost indistinguishable results (not shown).
The relationship between the logit of the actual and the logit of the estimated probabilities looks quite linear (see picture below), which would lead me to use the logit of the estimated probabilities as predictor. The R
code using to generate the plots is at the bottom of this answer.
Let's visualize the model on the original scale (the points are partial residuals):
R
code:
library(betareg)
library(visreg)
# Convert probabilities to log-odds
est_p_logit <- log(est_prob/(1 - est_prob))
act_p_logit <- log(act_prob/(1 - act_prob))
# Check linearity
scatter.smooth(act_p_logit~est_p_logit, las = 1)
# GLM with binomial distribution and logit-link
mod <- glm(act_prob~est_p_logit, family = binomial)
summary(mod)
# Visualize GLM model
res_glm <- visreg(mod, scale = "response", ylim = c(0, 1), partial = TRUE, rug = 2)
# Beta regression
mod_beta <- betareg(act_prob~est_p_logit)
# Compare GLM results with beta regression
res_beta <- visreg(mod_beta, scale = "response", ylim = c(0, 1))
lines(res_beta$fit$visregFit~res_beta$fit$est_p_logit, type = "l", col = "red")
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
A glm with a binomial distribution and a logit link should work fine. If there are no probabilities of $0$ or $1$, beta regression is another possibility. In this case, they yield almost indistinguishable results (not shown).
The relationship between the logit of the actual and the logit of the estimated probabilities looks quite linear (see picture below), which would lead me to use the logit of the estimated probabilities as predictor. The R
code using to generate the plots is at the bottom of this answer.
Let's visualize the model on the original scale (the points are partial residuals):
R
code:
library(betareg)
library(visreg)
# Convert probabilities to log-odds
est_p_logit <- log(est_prob/(1 - est_prob))
act_p_logit <- log(act_prob/(1 - act_prob))
# Check linearity
scatter.smooth(act_p_logit~est_p_logit, las = 1)
# GLM with binomial distribution and logit-link
mod <- glm(act_prob~est_p_logit, family = binomial)
summary(mod)
# Visualize GLM model
res_glm <- visreg(mod, scale = "response", ylim = c(0, 1), partial = TRUE, rug = 2)
# Beta regression
mod_beta <- betareg(act_prob~est_p_logit)
# Compare GLM results with beta regression
res_beta <- visreg(mod_beta, scale = "response", ylim = c(0, 1))
lines(res_beta$fit$visregFit~res_beta$fit$est_p_logit, type = "l", col = "red")
add a comment |Â
up vote
2
down vote
A glm with a binomial distribution and a logit link should work fine. If there are no probabilities of $0$ or $1$, beta regression is another possibility. In this case, they yield almost indistinguishable results (not shown).
The relationship between the logit of the actual and the logit of the estimated probabilities looks quite linear (see picture below), which would lead me to use the logit of the estimated probabilities as predictor. The R
code using to generate the plots is at the bottom of this answer.
Let's visualize the model on the original scale (the points are partial residuals):
R
code:
library(betareg)
library(visreg)
# Convert probabilities to log-odds
est_p_logit <- log(est_prob/(1 - est_prob))
act_p_logit <- log(act_prob/(1 - act_prob))
# Check linearity
scatter.smooth(act_p_logit~est_p_logit, las = 1)
# GLM with binomial distribution and logit-link
mod <- glm(act_prob~est_p_logit, family = binomial)
summary(mod)
# Visualize GLM model
res_glm <- visreg(mod, scale = "response", ylim = c(0, 1), partial = TRUE, rug = 2)
# Beta regression
mod_beta <- betareg(act_prob~est_p_logit)
# Compare GLM results with beta regression
res_beta <- visreg(mod_beta, scale = "response", ylim = c(0, 1))
lines(res_beta$fit$visregFit~res_beta$fit$est_p_logit, type = "l", col = "red")
add a comment |Â
up vote
2
down vote
up vote
2
down vote
A glm with a binomial distribution and a logit link should work fine. If there are no probabilities of $0$ or $1$, beta regression is another possibility. In this case, they yield almost indistinguishable results (not shown).
The relationship between the logit of the actual and the logit of the estimated probabilities looks quite linear (see picture below), which would lead me to use the logit of the estimated probabilities as predictor. The R
code using to generate the plots is at the bottom of this answer.
Let's visualize the model on the original scale (the points are partial residuals):
R
code:
library(betareg)
library(visreg)
# Convert probabilities to log-odds
est_p_logit <- log(est_prob/(1 - est_prob))
act_p_logit <- log(act_prob/(1 - act_prob))
# Check linearity
scatter.smooth(act_p_logit~est_p_logit, las = 1)
# GLM with binomial distribution and logit-link
mod <- glm(act_prob~est_p_logit, family = binomial)
summary(mod)
# Visualize GLM model
res_glm <- visreg(mod, scale = "response", ylim = c(0, 1), partial = TRUE, rug = 2)
# Beta regression
mod_beta <- betareg(act_prob~est_p_logit)
# Compare GLM results with beta regression
res_beta <- visreg(mod_beta, scale = "response", ylim = c(0, 1))
lines(res_beta$fit$visregFit~res_beta$fit$est_p_logit, type = "l", col = "red")
A glm with a binomial distribution and a logit link should work fine. If there are no probabilities of $0$ or $1$, beta regression is another possibility. In this case, they yield almost indistinguishable results (not shown).
The relationship between the logit of the actual and the logit of the estimated probabilities looks quite linear (see picture below), which would lead me to use the logit of the estimated probabilities as predictor. The R
code using to generate the plots is at the bottom of this answer.
Let's visualize the model on the original scale (the points are partial residuals):
R
code:
library(betareg)
library(visreg)
# Convert probabilities to log-odds
est_p_logit <- log(est_prob/(1 - est_prob))
act_p_logit <- log(act_prob/(1 - act_prob))
# Check linearity
scatter.smooth(act_p_logit~est_p_logit, las = 1)
# GLM with binomial distribution and logit-link
mod <- glm(act_prob~est_p_logit, family = binomial)
summary(mod)
# Visualize GLM model
res_glm <- visreg(mod, scale = "response", ylim = c(0, 1), partial = TRUE, rug = 2)
# Beta regression
mod_beta <- betareg(act_prob~est_p_logit)
# Compare GLM results with beta regression
res_beta <- visreg(mod_beta, scale = "response", ylim = c(0, 1))
lines(res_beta$fit$visregFit~res_beta$fit$est_p_logit, type = "l", col = "red")
edited 1 hour ago
answered 1 hour ago
COOLSerdash
15.6k75090
15.6k75090
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373835%2fregression-when-both-the-predictor-and-outcome-variables-are-proportions%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
– user2974951
4 hours ago