Regression when both the predictor and outcome variables are proportions

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I am using



$X$ The estimated pre-game win probability of a sporting team playing on its Home field (estimated according to a certain model)



to predict



$Y$ Actual proportion of points scored by the Home team in the game (i.e. number of points scored by Home team divided by all points scored in the game).



Graphed, the data look like this.



Home team estimated probability of winning vs Home team actual proportion of score



Data viewable here.



I did a simple linear regression and it produced the parameters $b_0 = 0.3554$ and $b_1 = 0.2930$. Thus even at the maximum possible value of $x$ it doesn't produce a prediction that the team will score more than 100% of the points.



However, some reading of other questions here indicates that linear regression is generally considered inappropriate for situations in which the outcome variable is a proportion.



The question is highly similar to this one, in which a the poster was seeking to predict a team's winning percentage. There it was suggested that the poster should convert the proportion of wins to the number of wins. However, in my question it would not be the same thing for me to use the number of points scored by the team.



  1. How inappropriate it is for me to use linear regression here?


  2. What analysis should I use, keeping in mind that unlike in the linked question I can't just use the raw number of points scored by the team (since I really am interested in the proportion of points they will score). gung's answer here seems to indicate beta regression if the predictor is a continuous proportion, and logistic regression if it's a count proportion. However, I'm not sure which of those two my predictor is.


  3. Does it make any difference that my predictor is also proportion?










share|cite|improve this question



















  • 2




    1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
    – user2974951
    4 hours ago

















up vote
2
down vote

favorite












I am using



$X$ The estimated pre-game win probability of a sporting team playing on its Home field (estimated according to a certain model)



to predict



$Y$ Actual proportion of points scored by the Home team in the game (i.e. number of points scored by Home team divided by all points scored in the game).



Graphed, the data look like this.



Home team estimated probability of winning vs Home team actual proportion of score



Data viewable here.



I did a simple linear regression and it produced the parameters $b_0 = 0.3554$ and $b_1 = 0.2930$. Thus even at the maximum possible value of $x$ it doesn't produce a prediction that the team will score more than 100% of the points.



However, some reading of other questions here indicates that linear regression is generally considered inappropriate for situations in which the outcome variable is a proportion.



The question is highly similar to this one, in which a the poster was seeking to predict a team's winning percentage. There it was suggested that the poster should convert the proportion of wins to the number of wins. However, in my question it would not be the same thing for me to use the number of points scored by the team.



  1. How inappropriate it is for me to use linear regression here?


  2. What analysis should I use, keeping in mind that unlike in the linked question I can't just use the raw number of points scored by the team (since I really am interested in the proportion of points they will score). gung's answer here seems to indicate beta regression if the predictor is a continuous proportion, and logistic regression if it's a count proportion. However, I'm not sure which of those two my predictor is.


  3. Does it make any difference that my predictor is also proportion?










share|cite|improve this question



















  • 2




    1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
    – user2974951
    4 hours ago













up vote
2
down vote

favorite









up vote
2
down vote

favorite











I am using



$X$ The estimated pre-game win probability of a sporting team playing on its Home field (estimated according to a certain model)



to predict



$Y$ Actual proportion of points scored by the Home team in the game (i.e. number of points scored by Home team divided by all points scored in the game).



Graphed, the data look like this.



Home team estimated probability of winning vs Home team actual proportion of score



Data viewable here.



I did a simple linear regression and it produced the parameters $b_0 = 0.3554$ and $b_1 = 0.2930$. Thus even at the maximum possible value of $x$ it doesn't produce a prediction that the team will score more than 100% of the points.



However, some reading of other questions here indicates that linear regression is generally considered inappropriate for situations in which the outcome variable is a proportion.



The question is highly similar to this one, in which a the poster was seeking to predict a team's winning percentage. There it was suggested that the poster should convert the proportion of wins to the number of wins. However, in my question it would not be the same thing for me to use the number of points scored by the team.



  1. How inappropriate it is for me to use linear regression here?


  2. What analysis should I use, keeping in mind that unlike in the linked question I can't just use the raw number of points scored by the team (since I really am interested in the proportion of points they will score). gung's answer here seems to indicate beta regression if the predictor is a continuous proportion, and logistic regression if it's a count proportion. However, I'm not sure which of those two my predictor is.


  3. Does it make any difference that my predictor is also proportion?










share|cite|improve this question















I am using



$X$ The estimated pre-game win probability of a sporting team playing on its Home field (estimated according to a certain model)



to predict



$Y$ Actual proportion of points scored by the Home team in the game (i.e. number of points scored by Home team divided by all points scored in the game).



Graphed, the data look like this.



Home team estimated probability of winning vs Home team actual proportion of score



Data viewable here.



I did a simple linear regression and it produced the parameters $b_0 = 0.3554$ and $b_1 = 0.2930$. Thus even at the maximum possible value of $x$ it doesn't produce a prediction that the team will score more than 100% of the points.



However, some reading of other questions here indicates that linear regression is generally considered inappropriate for situations in which the outcome variable is a proportion.



The question is highly similar to this one, in which a the poster was seeking to predict a team's winning percentage. There it was suggested that the poster should convert the proportion of wins to the number of wins. However, in my question it would not be the same thing for me to use the number of points scored by the team.



  1. How inappropriate it is for me to use linear regression here?


  2. What analysis should I use, keeping in mind that unlike in the linked question I can't just use the raw number of points scored by the team (since I really am interested in the proportion of points they will score). gung's answer here seems to indicate beta regression if the predictor is a continuous proportion, and logistic regression if it's a count proportion. However, I'm not sure which of those two my predictor is.


  3. Does it make any difference that my predictor is also proportion?







regression






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 3 hours ago

























asked 4 hours ago









user1205901

3,289144595




3,289144595







  • 2




    1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
    – user2974951
    4 hours ago













  • 2




    1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
    – user2974951
    4 hours ago








2




2




1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
– user2974951
4 hours ago





1). It's not great because your DV is bounded, 2). Beta regression would be the first choice, since it allows you to model your data as is (without needing to change it), 3). Yes, you could also use a logit transform $$textlogBig(cfracp1-pBig)$$ to transform your DV and do regular linear regression.
– user2974951
4 hours ago











1 Answer
1






active

oldest

votes

















up vote
2
down vote













A glm with a binomial distribution and a logit link should work fine. If there are no probabilities of $0$ or $1$, beta regression is another possibility. In this case, they yield almost indistinguishable results (not shown).



The relationship between the logit of the actual and the logit of the estimated probabilities looks quite linear (see picture below), which would lead me to use the logit of the estimated probabilities as predictor. The R code using to generate the plots is at the bottom of this answer.



loess_plot



Let's visualize the model on the original scale (the points are partial residuals):



enter image description here



R code:



library(betareg)
library(visreg)

# Convert probabilities to log-odds
est_p_logit <- log(est_prob/(1 - est_prob))
act_p_logit <- log(act_prob/(1 - act_prob))

# Check linearity
scatter.smooth(act_p_logit~est_p_logit, las = 1)

# GLM with binomial distribution and logit-link
mod <- glm(act_prob~est_p_logit, family = binomial)

summary(mod)

# Visualize GLM model
res_glm <- visreg(mod, scale = "response", ylim = c(0, 1), partial = TRUE, rug = 2)

# Beta regression
mod_beta <- betareg(act_prob~est_p_logit)

# Compare GLM results with beta regression
res_beta <- visreg(mod_beta, scale = "response", ylim = c(0, 1))
lines(res_beta$fit$visregFit~res_beta$fit$est_p_logit, type = "l", col = "red")





share|cite|improve this answer






















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "65"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373835%2fregression-when-both-the-predictor-and-outcome-variables-are-proportions%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote













    A glm with a binomial distribution and a logit link should work fine. If there are no probabilities of $0$ or $1$, beta regression is another possibility. In this case, they yield almost indistinguishable results (not shown).



    The relationship between the logit of the actual and the logit of the estimated probabilities looks quite linear (see picture below), which would lead me to use the logit of the estimated probabilities as predictor. The R code using to generate the plots is at the bottom of this answer.



    loess_plot



    Let's visualize the model on the original scale (the points are partial residuals):



    enter image description here



    R code:



    library(betareg)
    library(visreg)

    # Convert probabilities to log-odds
    est_p_logit <- log(est_prob/(1 - est_prob))
    act_p_logit <- log(act_prob/(1 - act_prob))

    # Check linearity
    scatter.smooth(act_p_logit~est_p_logit, las = 1)

    # GLM with binomial distribution and logit-link
    mod <- glm(act_prob~est_p_logit, family = binomial)

    summary(mod)

    # Visualize GLM model
    res_glm <- visreg(mod, scale = "response", ylim = c(0, 1), partial = TRUE, rug = 2)

    # Beta regression
    mod_beta <- betareg(act_prob~est_p_logit)

    # Compare GLM results with beta regression
    res_beta <- visreg(mod_beta, scale = "response", ylim = c(0, 1))
    lines(res_beta$fit$visregFit~res_beta$fit$est_p_logit, type = "l", col = "red")





    share|cite|improve this answer


























      up vote
      2
      down vote













      A glm with a binomial distribution and a logit link should work fine. If there are no probabilities of $0$ or $1$, beta regression is another possibility. In this case, they yield almost indistinguishable results (not shown).



      The relationship between the logit of the actual and the logit of the estimated probabilities looks quite linear (see picture below), which would lead me to use the logit of the estimated probabilities as predictor. The R code using to generate the plots is at the bottom of this answer.



      loess_plot



      Let's visualize the model on the original scale (the points are partial residuals):



      enter image description here



      R code:



      library(betareg)
      library(visreg)

      # Convert probabilities to log-odds
      est_p_logit <- log(est_prob/(1 - est_prob))
      act_p_logit <- log(act_prob/(1 - act_prob))

      # Check linearity
      scatter.smooth(act_p_logit~est_p_logit, las = 1)

      # GLM with binomial distribution and logit-link
      mod <- glm(act_prob~est_p_logit, family = binomial)

      summary(mod)

      # Visualize GLM model
      res_glm <- visreg(mod, scale = "response", ylim = c(0, 1), partial = TRUE, rug = 2)

      # Beta regression
      mod_beta <- betareg(act_prob~est_p_logit)

      # Compare GLM results with beta regression
      res_beta <- visreg(mod_beta, scale = "response", ylim = c(0, 1))
      lines(res_beta$fit$visregFit~res_beta$fit$est_p_logit, type = "l", col = "red")





      share|cite|improve this answer
























        up vote
        2
        down vote










        up vote
        2
        down vote









        A glm with a binomial distribution and a logit link should work fine. If there are no probabilities of $0$ or $1$, beta regression is another possibility. In this case, they yield almost indistinguishable results (not shown).



        The relationship between the logit of the actual and the logit of the estimated probabilities looks quite linear (see picture below), which would lead me to use the logit of the estimated probabilities as predictor. The R code using to generate the plots is at the bottom of this answer.



        loess_plot



        Let's visualize the model on the original scale (the points are partial residuals):



        enter image description here



        R code:



        library(betareg)
        library(visreg)

        # Convert probabilities to log-odds
        est_p_logit <- log(est_prob/(1 - est_prob))
        act_p_logit <- log(act_prob/(1 - act_prob))

        # Check linearity
        scatter.smooth(act_p_logit~est_p_logit, las = 1)

        # GLM with binomial distribution and logit-link
        mod <- glm(act_prob~est_p_logit, family = binomial)

        summary(mod)

        # Visualize GLM model
        res_glm <- visreg(mod, scale = "response", ylim = c(0, 1), partial = TRUE, rug = 2)

        # Beta regression
        mod_beta <- betareg(act_prob~est_p_logit)

        # Compare GLM results with beta regression
        res_beta <- visreg(mod_beta, scale = "response", ylim = c(0, 1))
        lines(res_beta$fit$visregFit~res_beta$fit$est_p_logit, type = "l", col = "red")





        share|cite|improve this answer














        A glm with a binomial distribution and a logit link should work fine. If there are no probabilities of $0$ or $1$, beta regression is another possibility. In this case, they yield almost indistinguishable results (not shown).



        The relationship between the logit of the actual and the logit of the estimated probabilities looks quite linear (see picture below), which would lead me to use the logit of the estimated probabilities as predictor. The R code using to generate the plots is at the bottom of this answer.



        loess_plot



        Let's visualize the model on the original scale (the points are partial residuals):



        enter image description here



        R code:



        library(betareg)
        library(visreg)

        # Convert probabilities to log-odds
        est_p_logit <- log(est_prob/(1 - est_prob))
        act_p_logit <- log(act_prob/(1 - act_prob))

        # Check linearity
        scatter.smooth(act_p_logit~est_p_logit, las = 1)

        # GLM with binomial distribution and logit-link
        mod <- glm(act_prob~est_p_logit, family = binomial)

        summary(mod)

        # Visualize GLM model
        res_glm <- visreg(mod, scale = "response", ylim = c(0, 1), partial = TRUE, rug = 2)

        # Beta regression
        mod_beta <- betareg(act_prob~est_p_logit)

        # Compare GLM results with beta regression
        res_beta <- visreg(mod_beta, scale = "response", ylim = c(0, 1))
        lines(res_beta$fit$visregFit~res_beta$fit$est_p_logit, type = "l", col = "red")






        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited 1 hour ago

























        answered 1 hour ago









        COOLSerdash

        15.6k75090




        15.6k75090



























             

            draft saved


            draft discarded















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373835%2fregression-when-both-the-predictor-and-outcome-variables-are-proportions%23new-answer', 'question_page');

            );

            Post as a guest













































































            Comments

            Popular posts from this blog

            What does second last employer means? [closed]

            List of Gilmore Girls characters

            Confectionery